1 / 74

Copy-number estimation on the latest generation of high-density oligonucleotide microarrays

Copy-number estimation on the latest generation of high-density oligonucleotide microarrays. Henrik Bengtsson (work with Terry Speed) Dept of Statistics, UC Berkeley January 24, 2008 Postdoctoral Seminars, Mathematical Biosciences Institute, The Ohio State University. Acknowledgments.

jrandy
Download Presentation

Copy-number estimation on the latest generation of high-density oligonucleotide microarrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Copy-number estimation on the latest generation of high-density oligonucleotide microarrays Henrik Bengtsson (work with Terry Speed) Dept of Statistics, UC Berkeley January 24, 2008 Postdoctoral Seminars, Mathematical Biosciences Institute, The Ohio State University

  2. Acknowledgments UC Berkeley: James Bullard Kasper Hansen Elizabeth Purdom Terry Speed WEHI, Melbourne: Mark Robinson Ken Simpson ISREC, Lausanne: “Asa” Wirapati John Hopkins: Benilton Carvalho Rafael Irizarry Affymetrix, California: Ben Bolstad Simon Cawley Luis Jevons Chuck Sugnet Jim Veitch

  3. Size = 264 kb, Number of loci = 72 Copy number analysis is about finding "aberrations" in a person's genome.

  4. Single Nucleotide Polymorphisms (SNPs) make us unique Definition: A sequence variation such that two genomes may differ by a single nucleotide (A, T, C, or G). Allele A:A ...CGTAGCCATCGGTA/GTACTCAATGATAG... Allele B:G A person has either genotype AA, AB, or BB at this SNP.

  5. Human Genetic Variation:Breakthrough of the Year 2007 (Science) • 3 billion DNA bases. • First sequenced 2001. • HapMap: 270 individuals genotyped.3 million known SNPs (places where one base differ from one person to another). Estimate: 15 million SNPs. • Genomewide association studies takeover (over linkage analysis). • Copy Number Polymorphism:- 1,000s to millions of bases lost or added.- Estimate: 20% of differences in gene activity are due to copy-number variants; SNPs (genotypes) account for the rest. • January 22, 2008: The 3-year "1,000 Genomes Project" will sequence 1,000 individuals. This follows the HapMap Project (SNPs).

  6. Objectives of this presentation • Total copy number estimation/segmentation • Estimate single-locus CNs well(segmentation methods take it from there) • All generations of Affymetrix SNP arrays: • SNP chips: 10K, 100K, 500K • SNP & CN chips: 5.0, 6.0 • Small and very large data sets

  7. Available in aroma.affymetrix “Infinite” number of arrays: 1-1,000s Requirements: 1-2GB RAM Arrays: SNP, exon, expression, (tiling). Dynamic HTML reports Import/export to existing methods Open source: R Cross platform: Windows, Linux, Mac

  8. Affymetrix chips

  9. Running the assaytake 4-5 working days • Start with target gDNA (genomic DNA) or mRNA. • Obtain labeled single-strandedtargetDNA fragments forhybridization to theprobeson the chip. • After hybridization, washing, and scanning we get a digital image. • Image summarized across pixels to probe-level intensities before we begin. Thisis our "raw data".

  10. Restriction enzymes digest the DNA, which is then amplified and hybridized

  11. * * * * 5 µm 5 µm > 1 million identical 25bp sequences The Affymetrix GeneChip is a synthesized high-density 25-mer microarray * 1.28 cm 1.28 cm 6.5 million probes/chip

  12. Target DNA find their way to complementary probes by massive parallel hybridization

  13. Hybridization + Scanning DAT File(s) [Image, pixel intensities] Image analysis CDF [Chip Description File] CEL File(s) [Probe Cell Intensity] + workable raw data Pre-processing Segmentation

  14. Affymetrix copy-number & genotyping arrays

  15. * * * * * * * * 25 nucleotides Other DNA Other DNA Target seq. Other seq. other PMs PM Terminology Target sequence: ...CGTAGCCATCGGTAAGTACTCAATGATAG... ||||||||||||||||||||||||| Perfect match (PM): ATCGGTAGCCATTCATGAGTTACTA

  16. * * * * * * * * * CN=1 CN=2 CN=3 PM = c PM = 2¢c PM = 3¢c Copy-number probes are used to quantify the amount of DNA at known loci CN locus:...CGTAGCCATCGGTAAGTACTCAATGATAG... PM:ATCGGTAGCCATTCATGAGTTACTA

  17. Raw copy numbers- log-ratios relative to a reference From the preprocessing, we obtain for sample i=1,2,...,I, CN locus j=1,2,...,J: Observed signals: (i1, i2, ..., iJ) These are not absolute copy-number levels. In order to interpret these, we compare each of them to a reference "R", i.e. ij / Rj, but even better "raw copy numbers": Mij = log2 (ij / Rj) = log2(ij) - log2(Rj) The reference can be from normal tissue, or from a pool of normal samples.

  18. Even without a segmentation algorithm, we can easily spot a deletion here. Copy number regions are found by lining up estimates along the chromosome Example: Log-ratios for one sample on Chromosome 22.

  19. Single Nucleotide Polymorphisms (SNPs) make us unique Definition: A sequence variation such that two genomes may differ by a single nucleotide (A, T, C, or G). Allele A:A ...CGTAGCCATCGGTA/GTACTCAATGATAG... Allele B:G A person has either genotype AA, AB, or BB at this SNP.

  20. * * * * * * * * * * * * * * * * * * * AA BB AB PMA >> PMB PMA << PMB PMA ¼PMB Affymetrix probes for a SNP - can be used for genotyping PMA:ATCGGTAGCCATTCATGAGTTACTA Allele A:...CGTAGCCATCGGTAAGTACTCAATGATAG...Allele B:...CGTAGCCATCGGTAGGTACTCAATGATAG...PMB:ATCGGTAGCCATCCATGAGTTACTA

  21. * * * * * * * * * * * * * * * * * * * * * * * * AA AB AAB PM =PMA+PMB = 2c PM = PMA + PMB = 2c PM =PMA+PMB = 3c BB PM =PMA + PMB = 2c SNPs can also be used forestimating copy numbers *

  22. Combing CN estimates from SNPs and CN probes means higher resolution SNPs + CN probes

  23. A brief history...

  24. Genome-Wide Human SNP Array 6.0is the state-of-the-art array • > 906,600 SNPs: • Unbiased selection of 482,000 SNPs:historical SNPs from the SNP Array 5.0 (== 500K) • Selection of additional 424,000 SNPs: • Tag SNPs • SNPs from chromosomes X and Y • Mitochondrial SNPs • Recent SNPs added to the dbSNP database • SNPs in recombination hotspots • > 946,000 copy-number probes: • 202,000 probes targeting 5,677 CNV regions from the Toronto Database of Genomic Variants. Regions resolve into 3,182 distinct, non-overlapping segments; on average 61 probe sets per region • 744,000 probes, evenly spaced along the genome

  25. How did we get here? Data from 2003 on Chr22 (on of the smaller chromosomes)

  26. 2003: 10,000 loci x1

  27. 2004: 100,000 loci x10

  28. 2005: 500,000 loci x50

  29. 2006: 900,000 loci x90

  30. 2007: 1,800,000 loci x180

  31. 4£ further out… 10K 294kb 100K 26kb 500K 6.0kb # loci 5.0 3.6kb 6.0 1.6kb year Rapid increase in density Distance between loci: next? 2003 2004 2005 2006 2007

  32. Affymetrix & Illumina are competing - we get more bang for the buck (cup) Price source: Affymetrix Pricing Information, http://www.affymetrix.com/, January 2008.

  33. Preprocessing forcopy-number analysisCopy-number estimation using Robust Multichip Analysis (CRMA)

  34. Copy-number estimation using Robust Multichip Analysis (CRMA)

  35. * * * * * * * * * * * * * * * * * * * AA BB AB PMA >> PMB PMA << PMB PMA ¼PMB Crosstalk between alleles adds significant artifacts to signals Cross-hybridization: Allele A: TCGGTAAGTACTC Allele B: TCGGTATGTACTC

  36. BB AB PMB AA + PMA offset Crosstalk between alleles is easy to spot

  37. PMB PMA Crosstalk between alleles can be estimated and corrected for

  38. Before removing crosstalk the arrays differ significantly... Crosstalk calibration corrects for differences in distributions too log2 PM

  39. When removing crosstalk systemdifferences between arrays goes away Crosstalk calibration corrects for differences in distributions too log2 PM

  40. Four measurements of the same thing: log2 PM With different scales: log(b*PM) = log(b)+log(PM) log2 PM With different scales and some offset:log(a+b*PM) = ... log2 PM How can a translation and a rescaling make such a big difference?

  41. * * * * * * * * * * * * * * * * * * PM = PMA +PMB Copy-number estimation using Robust Multichip Analysis (CRMA) AA AB PM =PMA+PMB * BB PM =PMA +PMB

  42. 1 2 3 4 5 6 7 1 2 3 4 5 6 7 PMA PMB PMA PMB Genotype AA Genotype BB 1 2 3 4 5 6 7 PMA PMB Genotype AB For robustness (against outliers), there are multiple probes per SNP

  43. Copy-number estimation using Robust Multichip Analysis (CRMA) The log-additive model: log2(PMijk) = log2ij+ log2jk + ijk sample i, SNP j, probe k. Fit using robust linear models (rlm)

  44. Probe-level summarization- probe affinity model For a particular SNP, the total CN signal for sample i=1,2,...,I is: i Which we observe via K probe signals: (PMi1, PMi2, ..., PMiK) rescaled by probe affinities: (1, 2, ..., K) A model for the observed PM signals is then: PMik = k * i + xik where xik is noise.

  45. Probe-level summarization- the log-additive model For one SNP, the model is: PMik = k * i + ik Take the logarithm on both sides: log2(PMik) = log2(k * i + ik) ¼ log2(k * i)+ ik = log2k + log2i + ik Sample i=1,2,...,I, and probe k=1,2,...,K.

  46. Probe-level summarization- the log-additive model With multiple arrays i=1,2,...,I, we can estimate the probe-affinity parameters {k} and therefore also the "chip effects" {i} in the model: log2(PMik) = log2k + log2i + ik Conclusion: We have summarized signals (PMAk,PMBk)for probes k=1,2,...,K into one signal iper sample.

  47. 100K Copy-number estimation using Robust Multichip Analysis (CRMA) Longer fragments ) less amplified by PCR ) weaker SNP signals 

  48. 500K Copy-number estimation using Robust Multichip Analysis (CRMA) Longer fragments ) less amplified by PCR ) weaker SNP signals 

  49. Copy-number estimation using Robust Multichip Analysis (CRMA) Normalize to get same fragment-length effect for all hybridizations

  50. Copy-number estimation using Robust Multichip Analysis (CRMA) Normalize to get same fragment-length effect for all hybridizations

More Related