Copy Number Variations

Copy Number Variations DTC BioInformatics Course Hillary Term 2010 WTHCG, Thursday 12th of February Jean-Baptiste Cazier • http://www.well.ox.ac.uk/dr-jean-baptiste-cazier • Jean-Baptiste.Cazier@well.ox.ac.uk

Outline • Lecture • Definitions • More important than it may seem • Identification • Technology, Algorithmic, Design • Recent studies • McCarrol & Korn, GSV, WTCCC • Break • The special case of Cancer • More problems • Conclusions • Break • Practical • Applications with CGH data in R • Application with SNP data in Illumina’sBeadStudio (PC only)

Definitions • Acronyms: • CNP: • Copy Number Polymorphisms • CNV: • Copy Number Variations • CNA: • Copy Number Aberrations • Copy Number Alterations • Creation: Germline vs Somatic • Is the CNV coming from the original cell or did it evolve only in a few ? • There are very many CNVs shared among population like SNPs or STRs • Somatic propagation of CNVs is a mark of Cancer Finding the missing heritability of complex diseases TA Manolio et al. Nature 461, 747-753 (2009)

Gain, Loss, etc • Normal: • 2 chromosomes are inherited, one from each parents • Deletion: • Homozygous: 0 copy left • Hemizygous: 1 copy left • Sizeable event: • -> not InDels • Gain • Can be 3, 4, 5, … copies • Most often nearby, but not always • Not Line, Sine, repeats, etc. • Copy Neutral Loss of Heterozygosity • Not Copy Number Polymorsphism per se, but needs to be addressed Copy Number Variation in Human Health, Disease, and Evolution Zhang F et al, Ann. Rev. of Gen. and Hum. Gen. 2009 (10) 451-481

Mechanisms • 4 main mechanisms in the generation of CNV: • NAHR • Non-Allelic Homologous Recombination • NHEJ • Non-Homologous End-Joining • FoSTeS • Fork Stalling and Template Switching • L1 retrotransposition Copy Number Variation in Human Health, Disease, and Evolution Zhang F et al, Ann. Rev. of Gen. and Hum. Gen. 2009 (10) 451-481

Characterization • Identification: a Genome-Wide test • Karyotyping • Multi color chromosome painting • Comparative Genetic Hybridization (CGH) • Array CGH (aCGH) • “SNP”- array • Validation: a local test • qPCR: quantitative Polymerase Chain Reaction • MLPA: Multiplex Ligation-dependent Probe Amplification • Fluorescent In-Situ Hybridization (FISH) • Sequencing

Array technology • Array CGH • Agilent, Nimblegen • 2 channels: compare hybridization level to a common background reference • Usually 42 million probes genome-wide • Resolution up to 200bp • SNP array • Illumina, Affymetrix • Test one or few samples at a time • Initially developed for genotyping • 2 channels: allele A/B • Increasing density of markers • From 10,000 Linkage SNPs • Up to 5M SNPs and CNV probes Affymetrix

CNV in color • (a) Aberrations leading to aneuploidy. • (b) Aberrations leaving the chromosome apparently intact Chromosome aberrations in solid tumors Donna G et al. Nature Genetics 34, 369 - 376 (2003) SNP array + + + + + + + + +

Revival • Genome-Wide Association provided some success in the identification of variants for many diseases: • AMD, Coeliac disease, Type 2 Diabetes, Prostate Cancer, Colorectal Cancer, etc. • However most variants are ‘only’ statistically significant: • 80% fall outside of coding regions • The case of Missing Heritability: • Whatever the number of variants identified, they usually account for only a small proportion of the heritability Finding the missing heritability of complex diseases TA Manolio et al. Nature 461, 747-753 (2009)

Missing Heritability • Need to find other “reasons” to explain the difference. • Heritability definition • Proportion of phenotypic variance attributable to additive genetic factors • The Common Variant Common Disease model is challenged • Look for more markers • Rarer with strong effect • Common with lower effect • Gene-Gene interaction • Shared environment • This is essentially a question of power • Groups are joining forces in very large consortium • Better technological coverage of the rarer variants • More variant types • Copy Number Variation • InDels, Segmental Duplications. • Comparable phenotyping in meta analysis ? • The ‘Dark Matter’ • Does it really exists ? • Can we see it beyond its influence ? Feasibility of identifying genetic variants by risk allele frequency and strength of genetic effect (odds ratio). Finding the missing heritability of complex diseases TA Manolio et al. Nature 461, 747-753 (2009)

SNP-array signature • Sample data for a number of different copy number and LOH events. • The Log R Ratio scales with copy number • The distribution of the B allele frequency is governed by a more complex relationship with allowable genotypes. Simulation Gain Real data Neutral Loss

Copy Number Loss SNP array aCGH

Copy Number Loss and Gain SNP array aCGH

Mixed Cell Population SNP array aCGH

Copy Neutral LOH SNP array aCGH

Automatic recognition of CNVs • Originally done by visual inspection • Problem of reproducibility • Problem of accuracy • With increasing density, problem of possibility to see • Automation and test • Moving average • Probe selection / compilation • Segmentation, Hidden Markov Model • Significance testing • Need to compile data with uncertainty

Moving average

Automatisation by use of Hidden Markov Model • Select automatically the optimal Copy Number sequence over a chromosome to fit the Model • Evaluate the probability of the sequence of intensity signal fitting this model • Can test various models and select the most appropriate • The Model can be trained simply by feeding “typical” data sets • Look for minimum number of changes • Look for maximum instability • Select a most likely default state • … 2 1 0 2 1 0 2 1 0

ObsN Obs2 Obs3 Obs4 Obs1 Process • Definition: • Find the underlying states giving the observation • Underlying states are the number of copies: 0,1,2, … • Observation is the Signal Intensity • Defined by 3 probabilistic entities • Start Value: • (P(0), P(1), P(2)) • State Transition: • (P(0|0), P(1|0), P(2|0), • P(0|1), P(1|1), P(2|1), • P(0|2), P(1|2), P(2|2)) • Emission probability • (P(Obs|0), P(Obs|1), P(Obs|2)) 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0

Segmentation CNAM employs a powerful optimal segmenting algorithm using dynamic programming to detect inherited and de novo CNVs on a per-sample (univariate) and multi-sample (multivariate) basis. Unlike Hidden Markov Models, which assume the means of different copy number states are consistent, optimal segmenting properly delineates CNV boundaries in the presence of mosaicism, even at a single probe level, and with controllable sensitivity and false discovery rate.

Available software • Graphical Interface: • Agilent • Golden Helix • Partek • BeadStudio/GenomeStudio • Golf • CNAT • CNAG • dChip • PennCNV • … • Uneven field of quality and specificity • Command line • QuantiSNP • BirdSuite • OncoSNP * • … • R packages • Somatics * • DNACopy • Aroma • … * Cancer Specific tools

Development of recent array • In 2008 McCarroll and Korn published the identification of CNPs and CNVs using/designing Affymetrix SNP 6.0 high resolution array

SNP 6.0 by McCarroll • “ We designed a hybrid genotyping array (Affymetrix SNP 6.0) to simultaneously measure 906,600 SNPs and copy number at 1.8 million genomic locations. By characterizing 270 HapMap samples, we developed a map of human CNV (at 2-kb breakpoint resolution) informed by integer genotypes for 1,320 copy number polymorphisms (CNPs)” McCarroll • Published both analysis with chip design and algorithm suite: BirdSuite • Perform both genotyping and CNV identification • First call for known CNP • Look for new CNV • 80% of observed copy number differences due to common CNPs (MAF>5%), • > 99% derived from inheritance rather than new mutation. • Found a common deletion polymorphism in perfect LD with Crohn’s disease SNPs • 2kb upstream IRGM • Affect level of expression

High density of probes • Can identify smaller events • E.g. Important to spot residual event in translocation/fusion genes • Gain confidence in SNP-regions by increasing the number of probes • Can get better resolutions, i.e. more accurate breakpoints: • Can split existing large regions into smaller ones • Better coverage of CNP • These regions were mainly not be covered by SNP-only arrays • Beware of overrepresentation of these regions • Tiling across the genome • More exhaustive picture

Increase density Copy Number 4 2 1 10K 4 2 1 250K Nsp 4 2 1 250K Sty 4 2 1 6.0 Loss of 65Kb region confidently identified only with SNP 6.0, Bryan Young et al, Cancer Research UK

Too much data ? t-test t-test on Run I t-test on Run II Summation of I and II Copy Number 4 2 1 Log 2 Ratio I 4 2 1 Log 2 Ratio II Replicates increase signal to noise ratio and avoid false positives and true negatives But it costs twice as much !

Potential Issues • Interpretation • What to use as a baseline ? i.e. define the Ratio • Variations in probe coverage: • Gaps • Overlapping probes • Inaccurate reference • Reference build is inaccurate • Probes cannot match the locus accurately • Systematic error • Autocorrelation with GC content • Preparation, e.g. genome amplification

Overlapping probes in regions of CNP

Probes in repeat elements

SNPs in probes • The special case of rodents: • There can be many strain from limited number of founders • Full sequencing has been limited • The reference used for the probe generation can be far from the strain tested • This will lead to failure across the genome Gauguier et al, in preparation

Systematic SNPs in probes • There can be mosaicism • Grouping of SNPs in specific regions • Generates systematic drops in hybridization at specific loci • Can be misinterpreted as deletion • Be aware of the regions with SNPs • And correct for the lack of hybridization • Design specific probes for the strain Gauguier et al, in preparation

Recent CNV Survey • Recently 2 projects started in parallel to identify and characterize CNVs in Human: • The Genome Structural Variation Consortium (GSV) • CNV discovery project to identify common CNVs using aCGH by Nimblegen, • Detection in 20 CEU, 20 YRI, 1 reference • Assayed in 450 HapMap samples • The Wellcome Trust Case Control Consortium (WTCCC) • Test for association to diseases of CNVs in the WTCCC • 16,000 cases, WTCCC plus Breast cancer • 3,000 common ontrols

The GSV study design

The GSV study outcome Localization Function of CNVs

The GSV study outcome (II) • Designed an array with 42 million probes • cover 11,700 CNV larger than 443 bp • 8,599 validated independently • Generate reference genotype for 4,978 on 450 samples • Identified 30 loci with CNV candidate for influencing phenotype • Striking effect of purifying selection • Act on exonic and intronic deletions • So functional variants should be rare • But most of common CNVs are already well tagged by the existing SNParray • May need to look elsewhere to solve the missing heritability

The WTCCC study • Use the WTCCC cohort of 16,000 samples and 3,000 common controls. • Bipolar, type 1 diabetes, type 2 diabetes, coronary artery disease, hypertension, rheumatoid arthritis, Crohn’s disease + Breast Cancer • 1,500 1958 Birth Cohort and 1,500 National Blood Donor • Designed a specific array using GSV set, McCarroll,1M and WTCCC1 • 104,000 probes targeting 12,000 putatitve loci • Perform assay using the Agilent platform by Oxford Gene Technology (OGT) against a common pooled reference sample • Attempt to design a robust pipeline to call all CNV across the different studies • Use CNVtools by Plagnol and local by Cardin (“Chiamesque”) http://www.wtccc.org.uk/ccc1/plus_typing_array.shtml

The WTCCC results • 3,900 CNV identified • 3,100 validated after QC • Concordance of 99.8% on known 420 duplicates • Remaining 8,000 CNVs from original selection: • False positive in discovery • Too noisy, but genuine • Genuine but very rare • 19 CNVs taken forward to replication with Bayes Factor: ~10-4 p-value • 14 failed to replicate either using tagged SNPs or direct typing • 5 associations

The WTCCC conclusions • Each CNV behaves uniquely • Size, genomic location, biological sample type, sample preparation • Designed 16 different pipelines • Key paramaters: • Normalization • Integration of the 10 probes • Impossible to define one-pipe-fits all • Show importance to have duplicates and large amount of diverse data • Confirmed the overrepresentation of CNVs in intronic regions • Confirm the high level of tag with SNP 6.0 or HapMap2 • MAF > 10% : 75% tagged at r2>0.8 • MAF <5% : 40% tagged at r2>0.8 • Found few new CNV associated with phenotype

Conclusions of these studies • Both identified many CNV in the human genome • Characterization of CNV is very difficult, and not easily stream lined • Careful interpretation of association results • Some artifacts will survive confirmation • Many CNVs co-localize with variants identified by GWAS • Good functional candidate • But, most of the common CNVs are already well tagged with SNPs • This will not bring new common variant in common disease • i.e. these will not solve the mystery of missing heritability. • Still rare CNVs can be associated to diseases, but just as much as SNPs

What with CNV then ? • Copy Number Variations are key in Cancer • Cancers are typical of somatic variations • They are therefore mostly unique • Cannot be tagged • Relatively common event • Although still difficult to identify it is essential

Cancer Schematic illustration of chromosomal evolution in human solid tumor progression. The stages of progression are arranged with the earlier lesions at the top. Cells may begin to proliferate excessively owing to loss of tissue architecture, abrogation of checkpoints and other factors. In general, relatively few aberrations occur before the development of in situ cancer. A sharp increase in genome complexity (the number of independent chromosomal aberrations) in many (but not all) tumors coincides with the development of in situ disease. The types and range in aberration number varies markedly between tumors, HCT116, a mismatch repair–defective cell line T47D, a mismatch repair–proficient cell line64. Chromosome aberrations in solid tumors Donna G et al. Nature Genetics 34, 369 - 376 (2003)

Germline vs. Somatic • Germline variants • The aberration exists from the start, and is inherited • Such variants are more likely to be common Copy Number Polymorphisms, predisposing variants. • Approach similar to non-cancer studies • Somatic events • Aberrations happen during the life-time • Happen more than once • Heterogeneous events; => Each cancer is unique • In Tumours, recurrent aberrations are more likely to be linked to the cancer as a selective advantage • We want to identify the regions with recurrent events

More issues • Interpretation • What to use as a baseline ? i.e. define the Ratio • Within sample baseline of 2 is not an easy assumption anymore • Heterogeneity of tissue • Biopsy can be “contaminated” by normal tissue • Cancer are usually made up of a set of co-existing clones • CNVs are unique • Each one has its own breakpoints • Systematic error • Preparation, e.g. genome amplification • Sample quality

Copy Number Variations in Cancer • It is possible to analyse tumour samples using classic Copy Number tools, but the results are likely to be unsatisfactory as many model assumptions are violated: • The normalisation of SNP genotyping data can be affected by tumour samples containing large scale chromosomal alterations. • Most aberrations do not follow the classic diploidy and cannot fit usual clusters • So Genotype Calls might be forced on the wrong model AA/AB/BB: • Deletions should be 0 or A / B, • Copy Neutral LOH should be AA/BB • Triploid should be AAA/AAB/ABB/BBB • There can be intra-tumour heterogeneity • E.g. Mix of triploid and tetraploid • There can be contamination with normal cells (stromal contamination) Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Korn et al. Nat Genet. 2008 Oct;40(10):1253-60

A deletion found in tumour AML sample at 8p using unpaired analysis. Tumour sample vs Baseline 4 2 1

Same deletion found in corresponding diagnostic AML sample at 8p Tumour sample vs Baseline 4 2 1 Normal sample vs Baseline 4 2 1

Need for pairing Tumour sample vs Baseline 4 2 1 Normal sample vs Baseline 4 2 1 Tumour sample vs Normal sample 4 2 1

Outliers, Batches PCA on 118 samples

Batch effect Removed the outlier, colored by batch

Type: Normal vs Tumour Removed the outlier, colored by type

Copy Number Variations