1 / 19

Informatics for next-generation sequence analysis – SNP calling

Informatics for next-generation sequence analysis – SNP calling. Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008. Read length and throughput. Illumina/Solexa, AB/SOLiD short-read sequencers. 1Gb. (1-4 Gb in 25-50 bp reads). bases per machine run. 100 Mb.

cecile
Download Presentation

Informatics for next-generation sequence analysis – SNP calling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

  2. Read length and throughput Illumina/Solexa, AB/SOLiD short-read sequencers 1Gb (1-4 Gb in 25-50 bp reads) bases per machine run 100 Mb 454 pyrosequencer (20-100 Mb in 100-250 bp reads) 10 Mb ABI capillary sequencer 1Mb read length 10 bp 100 bp 1,000 bp

  3. Current and future application areas • Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP • De novo genome sequencing • Short-read sequencing will be (at least) an alternative to micro-arrays for: • DNA-protein interaction analysis (CHiP-Seq) • novel transcript discovery • quantification of gene expression • epigenetic analysis (methylation profiling)

  4. 3. Alignment of billions of reads Fundamental informatics challenges (I) 1. Interpreting machine readouts – base calling, base error estimation 2. Dealing with non-uniqueness in the genome: resequenceability

  5. Informatics challenges (II) 4. SNP and short INDEL, and structural variation discovery 5. Data visualization 6. Data storage & management

  6. Read mapping Read alignment Paralog identification SNP detection + inspection Resequencing-based SNP discovery genome reference sequence

  7. SNP calling workflow • read alignment • SNP detection • visual checking

  8. A A A A A C C C C C G G G G G T T T T T polymorphic combination monomorphic combination Bayesian posterior probability i.e. the SNP score Base call + Base quality Polymorphism rate (prior) Base composition Depth of coverage Bayesian detection algorithm

  9. base quality values help us decide if mismatches are true polymorphisms or sequencing errors • accurate base qualities are crucial, especially in lower coverage Base quality values for SNP calling

  10. AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 strain 1 AACGTTCGCATA AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA individual 2 strain 3 AACGTTAGCATA AACGTTAGCATA individual 3 Priors for specific resequencing scenarios

  11. A A/C C C/C A A/A Consensus sequence generation (genotyping) AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCATA individual 1 strain 1 AACGTTCGCATA AACGTTCGCATA strain 2 AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA individual 2 AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA strain 3 AACGTTAGCATA AACGTTAGCATA individual 3

  12. SNP calling in Roche/454 pyrosequences

  13. iso-1 reference 46-2 454 read 46-2 ABI reads (2 fwd + 2 rev) • 92.9 % validation rate (1,342 / 1,443) • 2.0% missed SNP rate (25 / 1247) SNP calling in low 454 coverage DNA courtesy of Chuck Langley, UC Davis • with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) • 10 different African and Americanmelanogaster isolates • 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total) • can we detect SNPs in survey-style 454 read coverage?

  14. SNP calling in Illumina/Solexa short-reads

  15. SNP calling in short-read coverage • SNP calling error rate very low: • Validation rate = 97.8% (224/229) • Conversion rate = 92.6% (224/242) • Missed SNP rate = 3.75% (26/693) SNP • INDEL candidates validate and convert at similar rates to SNPs: • Validation rate = 89.3% (193/216) • Conversion rate = 87.3% (193/221) INS C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs)

  16. SNP calling in AB/SOLiD color-space reads A C G G T C G T C G T G T G C G T A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error

  17. Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis reference sequence Image from JGI web site • collaboration with Doug Smith at Agencourt • Pichia stipitis converts xylose to ethanol (bio-fuel production) • one mutagenized strain had especially high conversion efficiency • determine where the mutations were that caused this phenotype • we resequenced the 15MB genome with 454 Illumina, and SOLiD reads • 14 true point mutations in the entire genome • In about 15X nominal coverage each technology can find every point mutation with essentially no false positives

  18. Our software is available for testing http://bioinformatics.bc.edu/marthlab/Beta_Release

  19. Credits Elaine Mardis (Washington University) Andy Clark (Cornell University) Doug Smith (Agencourt) Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.) Michael Stromberg Chip Stewart Michele Busby Aaron Quinlan Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang http://bioinformatics.bc.edu/marthlab

More Related