1 / 26

Next-generation sequencing – the informatics angle

Next-generation sequencing – the informatics angle. Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008. T1. Roche / 454 FLX system. pyrosequencing technology variable read-length the only new technology with >100bp reads

brick
Download Presentation

Next-generation sequencing – the informatics angle

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008

  2. T1. Roche / 454 FLX system • pyrosequencing technology • variable read-length • the only new technology with >100bp reads • tested in many published applications • supports paired-end read protocols with up to 10kb separation size

  3. T2. Illumina / Solexa Genome Analyzer • fixed-length short-read sequencer • read properties are very close traditional capillary sequences • very low INDEL error rate • tested in many published applications • paired-end read protocols support short (<600bp) separation

  4. T3. AB / SOLiD system 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 1 3 2 0 T • fixed-length short-read sequencer • employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy • requires color-space informatics • published applications underway / in review • paired-end read protocols support up to 10kb separation size

  5. T4. Helicos / Heliscope system • experimental short-read sequencer system • single molecule sequencing • no amplification • variable read-length • error rate reduced with 2-pass template sequencing

  6. A1. Variation discovery: SNPs and short-INDELs 1. sequence alignment 2. dealing with non-unique mapping 3. looking for allelic differences

  7. A2. Structural variation detection • copy number (for amplifications, deletions) from depth of read coverage • structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

  8. A3. Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. Robertson et al. Nature Methods, 2007

  9. A4. Novel transcript discovery (genes) Known exon 1 Known exon 2 • novel transcripts in known genes Known exon 1 Known exon 2 • novel genes / exons Inferred exon 1 Inferred exon 2

  10. A5. Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006

  11. A6. Expression profiling by tag counting gene gene aligned reads aligned reads Jones-Rhoads et al. PLoS Genetics, 2007

  12. A7. De novo organismal genome sequencing Lander et al. Nature 2001 short reads read pairs longer reads assembled sequence contigs

  13. C1. Read length 20-35 (var) 25-35 (fixed) 25-40 (fixed) ~250 (var) 100 200 300 0 read length [bp]

  14. When does read length matter? • longer reads are needed where one must use parts of reads for mapping: • de novo sequencing • novel transcript discovery aacttagacttaca gacttacatacgta Known exon 1 Known exon 2 accgattactatacta • short reads often sufficient where the entire read length can be used for mapping: • SNPs, short-INDELs, SVs • CHIP-SEQ • short RNA discovery • counting (mRNA miRNA)

  15. C2. Read error rate • error rate dictates how many errors the aligner should tolerate • the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned • applications where, in addition, specific alleles are essential, error rate is even more important • error rate typically 0.4 - 1%

  16. C3. Error rate grows with each cycle • this phenomenon limits useful read length

  17. C4. Substitutions vs. INDEL errors • gapped alignment necessary • good SNP discovery accuracy • short-INDEL discovery difficult • SNP discovery may require higher coverage for allele confirmation • INDELs can be discovered with very high confidence!

  18. C5. Quality values are important for allele calling • inaccurate or not well calibrated base quality values hinder allele calling Q-values should be accurate … and high! • PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles

  19. Quality values should be well-calibrated assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle

  20. C6. Representational biases / library complexity fragmentation biases PCR amplification biases sequencing low/no representation sequencing biases high representation

  21. Dispersal of read coverage • this affects variation discovery (deeper starting read coverage is needed) • it has major impact is on counting applications

  22. Amplification errors early amplification error gets propagated onto every clonal copy many reads from clonal copies of a single fragment • early PCR errors in “clonal” read copies lead to false positive allele calls

  23. C7. Paired-end reads • circularization: 500bp - 10kb (sweet spot ~3kb) • fragment length limited by library complexity Korbel et al. Science 2007 • fragment amplification: fragment length 100 - 600 bp • fragment length limited by amplification efficiency • paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

  24. Paired-end reads for SV discovery • longer fragments tend to have wider fragment length distributions • SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std) • longer fragments increase the chance of spanning SV breakpoints and/or entire events

  25. C8. Technologies / properties / applications

  26. Thanks Michael Stromberg MOSAIK talk Thursday, 7:40PM Chip Stewart Michele Busby Aaron Quinlan Damien Croteau-Chonka Eric Tsung Derek Barnett Weichun Huang http://bioinformatics.bc.edu/marthlab Michael Egholm David Bentley Francisco de la Vega Kristen Stoops Ed Thayer Clive Brown Elaine Mardis

More Related