genome wide copy number analysis l.
Skip this Video
Download Presentation
Genome-wide Copy Number Analysis

Loading in 2 Seconds...

play fullscreen
1 / 39

Genome-wide Copy Number Analysis - PowerPoint PPT Presentation

  • Uploaded on

Genome-wide Copy Number Analysis. Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University School of Medicine 02 - 08 – 2006 Course: M 21-621 Computational Statistical Genetics. Four Questions. What is Copy Number ?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Genome-wide Copy Number Analysis' - debra

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
genome wide copy number analysis
Genome-wide Copy Number Analysis

Qunyuan Zhang,Ph.D.

Division of Statistical Genomics

Department of Genetics & Center for Genome Sciences

Washington University School of Medicine

02 - 08 – 2006

Course: M 21-621 Computational Statistical Genetics

four questions
Four Questions
  • What is Copy Number ?
  • What can Copy Number tell us?
  • How to measure/quantify Copy Number?
  • How to analyze Copy Number?
what is copy number
What is Copy Number ?
  • Gene Copy Number

The gene copy number (also "copy number variants" or CNVs) is the amount of copies of a particular gene in the genotype of an individual. Recent evidence shows that the gene copy number can be elevated in cancer cells. For instance, the EGFR copy number can be higher than normal in Non-small cell lung cancer. …Elevating the gene copy number of a particular gene can increase the expression of the protein that it encodes.

From Wikipedia

DNA Copy Number

A Copy Number Variant (CNV) represents a copy number change involving a DNA fragment that is ~1 kilobases or larger.

From Nature Reviews Genetics, Feuk et al. 2006

  • DNA Copy Number≠ DNA Tandem Repeat Number (e.g. micro satellites)

<10 bases

  • DNA Copy Number≠RNA Copy Number
  • RNA Copy Number = Gene Expression Level

DNA transcription mRNA

  • Copy Numberis the amount of copies of a particular fragment of nucleic acid molecular chain. It refers to DNA Copy Number in most publications.
what can copy number tell us
What can Copy Number tell us?
  • Genetic Diversity/Polymorphisms

- restriction fragment length polymorphism (RFLP)

- amplified fragment length polymorphism (AFLP)

- random amplification of polymorphic DNA (RAPD)

- variable number of tandem repeat (VNTR; e.g., mini- and microsatellite)

- single nucleotide polymorphism (SNP)

- presence/absence of transportable elements

- structural alterations (e.g., deletions, duplications, inversions … )

- DNA copy number variant (CNV)

Association with phenotypes/diseases genes/genetic factors

genetic alterations in tumor cells dna copy number changes

Normal cell


Homologous repeats

Segmental duplications

Chromosomal rearrangements

Duplicative transpositions

Non-allelic recombinations


Tumor cells

deletion amplification

CN=0 CN=1 CN=2 CN=3 CN=4

Genetic Alterations in Tumor Cells (DNA Copy Number Changes)
how to measure quantify copy number

Quantitative Polymerase Chain Reaction (Q-PCR) : DNA Amplification

(dNTPs, primers, Taq polymerase, fluorescent dye)


less CN amplification less DNA low fluorescent intensity

more CN amplification more DNA high fluorescent intensity

(one fragment each time)

  • Microarray : DNA Hybridization

(dNTPs, primers, Taq polymerase, fluorescent dye)


less CN amplification less DNA arrayed probes low intensities

more CN amplification more DNA arrayed probes high intensities

(multiple/different fragments, mixed pool)


How to measure/quantify Copy Number?
microarray from image to copy number



Affymetrix Mapping 250K Sty-I chip

~250K probe sets

~250K SNPs

probe set (24 probes)










more DNA copy number more DNA hybridization higher intensity

Microarray: From Image to Copy Number

~400 cancer patients

Normal tissue & tumor tissue (~400 pairs, ~800 DNA samples)

Affymetrix 250K Sty-I Human Mapping SNP Array

DNA hybridization signals (intensities on chip images)

Genotype calling

SNP genotypes

LOH analysis DNA copy number analysis

(genotypic changes) (DNA copy number changes)

How to Analyze Copy Number?

  • A Real Example



Finished chips (scanner) Raw image data [.DAT files]

(experiment info [ .EXP]) (image processing software)

Probe level raw intensity data [.CEL files]

Background adjustment, Normalization, Summarization

Summarized intensity data

Raw copy number (CN) data [log ratio of tumor/normal intensities]

Significance test of CN changes

Estimation of CN

Smoothing and boundary determination

Concurrent regions among population

Amplification and deletion frequencies among populations

Association analysis

chip description file [.CDF]

Preprocessing :

  • General Procedures for Copy Number Analysis
background adjustment correction
Background Adjustment/Correction

Reduces unevenness of a single chip

Makes intensities of different positions on a chip comparable

Before adjustment After adjustment

Corrected Intensity (S’) = Observed Intensity (S) – Background Intensity (B)

For each region i, B(i) = Mean of the lowest 2% intensities in region i

AffyMetrix MAS 5.0


Background Adjustment/Correction

Eliminates non-specific hybridization signal

Obtains accurate intensity values for specific hybridization

sense or antisense strands

25 oligonucleotide probes


probe set

PM only, PM-MM, Ideal MM, etc.


S – Mean of S

S’ =

STD of S

S’ ~ N(0,1 )

Base Line Array (linear); Quantile Normalization;Contrast Normalization; etc.


Reduces technical variation between chips

Makes intensities from different chips comparable

Before normalization After normalization



Combines the multiple probe intensities for each probe set to produce a summarized value for subsequent analyses.

Average methods:

PM only or PM-MM, allele specific or non-specific

Model based method : Li & Wong , 2001

Gene Expression Index

raw copy number data

after Log transformation


before Log transformation


S : Summarized raw intensity

S’ : Log transformation, S’ = log2(S)

Raw CN:

Log ratio of tumor / normal intensities

CN = S’tumor - S’normal = log2(Stumor/Snormal)

Pair design

Snormal = S of the paired normal sample

Group design

Snormal = average S of the group of normal samples

Raw CN

Raw Copy Number Data
individual level analysis
Individual Level Analysis

Analysis for each individual sample (or each sample pair)

  • Significance test of CN amplification and deletion
  • Boundary finding (smoothing and segmentation)
  • CN estimation
significance test for copy number changes log p values chr 1 pair 101

Window-based t test

Window size = 0.5 Mbp (~30 SNPs); N = SNP number in window

Mean CN of window

t = X N ~ t (df=N -1)

SD of widow


Window Position (Mbp)

Significance Test for Copy Number Changes: -log(p) values, chr. 1, pair#101

SegmentationBioConductor R Packages ( package, adaptive weights smoothing (AWS) methodDNAcopy package, circular binary segmentation method


… SNP_i SNP_i+1 SNP_i+2 SNP_i+3 SNP_i+4 …






log ratio

log ratio

log ratio

log ratio

log ratio

CN Estimation: Hidden Markov Model (HMM)CNAT(; dChip ( ; CNAG (


hidden status

(unknown CN )

observed status

(raw CN = log ratio of intensities)

CN estimation:finding a sequence of CN values which maximizes the likelihood of observed raw CN.

Algorithm: Viterbi algorithm (can be Iterative)

Information/assumptions below are needed

Background probabilities: Overall probabilities of possible CN values.

P(CN=x); x=-2,-1,0,1,2,3,…, n (usually,n<10)

Transition probabilities: Probabilities of CN values of each SNP conditional on the previous one.

P(CN_i+1=x|CN_i=y); x=-2,-1,0,1,2,3,…, or n; y=-2,-1,0,1,2,3, …, or n

Emission probabilities: Probabilities of observed raw CN values of each SNP conditional on the hidden/unknown/true CN status.

P(log ratio<x|CN=y)=f(x|CN=y); x=one of real numbers; y=-2,-1,0,1,2,3, …, or n






HMM Estimation of CN for Chr. 1 (Piar#101)Black: Normal Intensities, Red: Tumor Intensities, Green: Tumor- Normal Blue: HMM estimated CNs in Tumor Tissue

population level analysis
Population Level Analysis

Analysis for the whole group (or sub-group) of samples

  • Overall significance test
  • Amplification and deletion frequencies summarization
  • Common/concurrent region finding
  • Associations (with mutations, LOHs, clinical variables …)
sliding window analysis

… .. … … . . . . .. …… …… .. … … . . . . .. …… … .. …… … ..

Window k

Window N

Window 10

Window 9

Window 6

Window 8

Window 4

Window 3

Window 2

Window 1

Window 7

Window 5



Each window (k) contains 30 consecutive SNPs (k, k+1, k+2, k+3, …, k+29)

Sliding Window Analysis

Sliding Window Test of Significance of CN Changes

-log(p) values, based on ~ 400 pairs


CN Change Frequencies in Population( Chr.14,~400 pairs)Black: Freq.(CN>0) Red: Freq.(CN>0, significant amplification at 0.01 level) Green: Freq.(CN<0, significant deletion at 0.01 level)


Population Level Segmentation Analysis (~400 pairs)Circular Binary Segmentation approach, Bioconductor Package DNAcopy


Separate Tumor Samples from Normal Samples Using Six Chromosomal Peaks with Significant CN Changes

(Classification Based on RAW CN)




Affymetrix Chips (

Illumina Chips (


dChip ( ;



BioConductor R Packages (

GLAD package, adaptive weights smoothing (AWS) method

DNAcopy package, circular binary segmentation method

Widows ?

Unix ?

Parallel Computation ?

  • R Gentlemen et al. Bioinformatics and computational biology solutions using R and Bioconductor. Springer, 2005
  • JL Freeman et al. Genome Research 2006; 16:949-961
  • J Huang et al. Hum Genomics. 2004;1(4):287-99
  • X Zhao et al. Cancer Research 2004; 64:3060-3071
  • Y Nannya et al. Cancer Research 2005, 65: 6071-6079
  • … see google …

Aldi Kraja Li Ding

Ingrid Borecki John Osborne

Michael Province Ken Chen

Division of Statistical Genomics Medical Sequencing Group

Center for Genome Sciences

Washington University School of Medicine