1 / 48

Next-generation sequence analysis

Next-generation sequence analysis. Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008. Read length and throughput. Illumina/Solexa, AB/SOLiD short-read sequencers. 1Gb. (1-4 Gb in 25-50 bp reads). bases per machine run. 100 Mb. 454 pyrosequencer.

Download Presentation

Next-generation sequence analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008

  2. Read length and throughput Illumina/Solexa, AB/SOLiD short-read sequencers 1Gb (1-4 Gb in 25-50 bp reads) bases per machine run 100 Mb 454 pyrosequencer (20-100 Mb in 100-250 bp reads) 10 Mb ABI capillary sequencer 1Mb read length 10 bp 100 bp 1,000 bp

  3. Sequencing chemistries DNA base extension DNA ligation Church, 2005

  4. Template clonal amplification Church, 2005

  5. Massively parallel sequencing Church, 2005

  6. Features of NGS data • Short sequence reads • 100-200bp • 25-35bp (micro-reads) • Huge amount of sequence per run • Up to gigabases per run • Huge number of reads per run • Up to 100’s of millions • Higher error as compared with Sanger sequencing • Error profile different to Sanger

  7. Current and future application areas Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP De novo genome sequencing • Short-read sequencing will be (at least) an alternative to micro-arrays for: • DNA-protein interaction analysis (CHiP-Seq) • novel transcript discovery • quantification of gene expression • epigenetic analysis (methylation profiling)

  8. 3. Alignment of billions of reads Fundamental informatics challenges 1. Interpreting machine readouts – base calling, base error estimation 2. Dealing with non-uniqueness in the genome: resequenceability

  9. Informatics challenges (cont’d) 4. SNP and short INDEL, and structural variation discovery 5. Data visualization 6. Data storage & management

  10. Challenge 1. Base accuracy and base calling • machine read-outs are quite different • read length, read accuracy, and sequencing error profiles are variable (and change rapidly as machine hardware, chemistry, optics, and noise filtering improves) • what is the instrument-specific error profile? • are the base quality values satisfactory? • (1) are base quality values accurate? • (2) are most called bases high-quality?

  11. 454 pyrosequencer error profile • multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal the majority of errors are INDELs • error rates are nucleotide-dependent

  12. 454 base quality values • the native 454 base caller assigns too low base quality values

  13. PYROBAYES: determine base number New 454 base caller: data likelihoods priors posterior base number probability

  14. PYROBAYES: base calls and quality values • call the most likely number of nucleotides • produce three base quality values: • QS (substitution) • QI (insertion) • QD (deletion)

  15. PYROBAYES: Performance • better correlation between assigned and measured quality values • higher fraction of high-quality bases

  16. Illumina/Solexa base accuracy • error rate grows as a function of base position within the read • a large fraction of the reads contains 1 or 2 errors

  17. Illumina/Solexa base accuracy (cont’d) • Actual base accuracy for a fixed base quality value is a function of base position within the read (i.e. there is need for quality value calibration) • Most errors are substitutions  PHRED quality values work

  18. 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base C 2 3 0 1 G 3’ 3’ 3’ 5’ 5’ 5’ 1 3 2 0 T N N N A Tz z z N N N G Az z z N N N T Gz z z AB SOLiD System dibase sequencing 2-base, 4-color: 16 probe combinations • 4 dyes to encode 16 2-base combinations • Detect a single color indicates 4 combinations & eliminates 12 • Each color reflects position, not the base call • Each base is interrogated by two probes • Dual interrogation eases discrimination • errors (random or systematic) vs. SNPs (true polymorphisms)

  19. The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known. 2nd Base A C G T 0 1 2 3 A 1 0 3 2 1st Base 0 0 0 0 0 0 1 1 1 1 2 2 2 2 2 2 3 3 AA AC AC AA AG AT AA AG AG CC CA CA CC CT CG CC CT CT GG GT GT GG GA GC GG GA GA TT TG TG TT TC TA TT TC TC A A C A A G C C T C C C A C C T A A G A G G T G G A T T C T T T G T T C G G A G C 2 3 0 1 G 1 3 2 0 T Converting dibase (color) into base calls 4 Possible Sequences

  20. Alignment to reference in “color-space” Reference • Working in color space: • Reverse-complementation becomes simply reverse • Apply color transition rules to remove measurement errors from partial assemblies • If reference of Sanger reads are combined, translate to color space

  21. A C G G T C G T C G T G T G C G T SOLiD error checking code (I) A C G G T C G T C G T G T G C G T No change A C G G T C G C C G T G T G C G T SNP A C G G T C G T C G T G T G C G T Measurement error

  22. G T C encodes G A C encodes 3 Possible Changes in Dibase Encoding 3 Possible Changes in the Middle Base G C C encodes G G C encodes SOLiD error checking code (II) Only Some Transitions Indicate a SNP in sample Allowed Transitions

  23. A C G G T C G T C G T G T G C G T A C G G T C G T C G T G T G C G T Invalid adjacent A C G G T C - T C G T G T G C G T 1 base deletion A C G G T C - - C G T G T G C G T 2 base deletion SOLiD error checking code (III)

  24. SOLiD di-base sequencing accuracy and QV

  25. RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length • Near-perfect micro-repeats can be also a problem because we want to align reads even with a few sequencing errors and / or SNPs Challenge 2. Resequenceability • Reads from repeats cannot be uniquely mapped back to their true region of origin

  26. Repeats at the fragment level “base masking” “fragment masking”

  27. Fragment level repeat annotation • bases in repetitive fragments may be resequenced with reads representing other, unique fragments  fragment-level repeat annotations spare a higher fraction of the genome than base-level repeat masking

  28. Find perfect and near-perfect micro-repeats • Hash based methods (fast but only work out to a couple of mismatches) • Exact methods (very slow but find every repeat copy) • Heuristic methods (fast but miss a fraction of the repeats)

  29. Challenge 3. Read alignment and assembly • resequencing requires reference sequence-guided read alignment • to align billions of reads the aligner has to be fast and efficient • INDEL errors require gapped alignment • individually aligned reads must be “assembled” together • has to work for every read type (short, medium-length, and long reads) • must tolerate sequencing errors and SNPs • must work with both base-level and fragment-level repeat annotations • transcribed sequences require additional features e.g. splice-site aware alignment capability • most frequently used tools: BLAT (only pair-wise), SSAHA (pair-wise), MAQ (pair-wise and assembly), ELAND (pair-wise), MOSAIK (pair-wise and assembly, gapped)

  30. MOSAIK: method Step 1. initial short-hash based scan for possible read locations Step 2. evaluation of candidate read locations with SW method

  31. MOSAIK – performance • Solexa read alignments to C. elegans genome: • 100 million reads aligned in 95 minutes • 18,000 reads / second • 454 reads to Pichia (yeast-size) genome • GS20: 2,000 reads / second • FLX: 300 reads / second • Solexa read alignments to masked human genome: • 40 seconds for 1 million reads • 18,000 reads / second • 5.5 GB RAM used (more for longer initial • hash sizes)

  32. ABI/cap. 454/FLX 454/GS20 Illumina MOSAIK: co-assembling different read types

  33. Challenge 4. Polymorphism discovery • shallow and deep read coverage • most candidates will never be “checked” only very low error rates are acceptable • we updated PolyBayes to deal with new read types • made the new software (PBSHORT) much more efficient

  34. Structural variation discovery • copy number variations (deletions & amplifications) can be detected from variations in the depth of read coverage • structural rearrangements (inversions and translocations) require paired-end read data

  35. Challenge 5. Data visualization • aid software development: integration of trace data viewing, fast navigation, zooming/panning • facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays • promote hypothesis generation: integration of annotation tracks

  36. Challenge 6. Massive data volumes • two connected working groups to define standard data formats Short-read format working group ssrformat@ubc.ca (Asim Siddiqui, UBC) Assembly format working group Boston College http://assembly.bc.edu

  37. Next-generation sequencing software Machine manufacturers’ sites plus third-party developers’ sites, e.g.: http://sourceforge.net/projects/maq/ http://bioinformatics.bc.edu/marthlab/Mosaik

  38. Applications in various discovery projects 1. SNP discovery in shallow, single-read 454 coverage (Drosophila melanogaster) 2. Mutational profiling in deep 454 data (Pichia stipitis) (image from Nature Biotech.) 3. SNP and INDEL discovery in deep Illumina / Solexa short-read coverage (Caenorhabditis elegans)

  39. SNP calling in single-read 454 coverage DNA courtesy of Chuck Langley, UC Davis • collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) • goal was to assess polymorphism rates between 10 different African and Americanmelanogaster isolates • 10 runs of 454 reads (~300,000 reads per isolate) were collected • key informatics question: can we detect SNPs with high accuracy in low-coverage, survey-style 454 reads aligned to finished reference genome sequence? • reads were base-called with PyroBayes and aligned to the 180Mb reference melanogaster genome sequence with Mosaik 0.16 x nominal read coverage  most reads are singletons • SNP detection with PolyBayes

  40. iso-1 reference 46-2 454 read 46-2 ABI reads (2 fwd + 2 rev) SNP calling success rates • 92.9 % validation rate (1,342 / 1,443) • single-read coverage: 92.9% (1,275 / 1,372 ) • double-read coverage: 94.3% (67 / 71) • 2.0% missed SNP rate (25 / 1247) • single-read coverage: 2.12% (25 / 1176) • double-read coverage: 0% (0 / 59)

  41. Genome variation in melanogaster isolates • 658,280 SNPs discovered among all 10 lines. • Nucleotide diversity Ѳ≈ 5x10-3 (1 SNP / 200 bp) between each line and reference (in line with expectations). • 20.2% (133,264 sites) polymorphic among two or more lines. The 1 SNP / 900 bp nominal density is sufficient for high-resolution marker mapping

  42. SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Bristol, N2 strain (3 ½ machine runs) Pasadena, CB4858 (1 ½ machine runs) • goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes • 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University • primary aim was to detect polymorphisms between the Pasadena and the Bristol strain

  43. SNP calling error rate very low: • Validation rate = 97.8% (224/229) • Conversion rate = 92.6% (224/242) • Missed SNP rate = 3.75% (26/693) SNP • INDEL candidates validate and convert at similar rates to SNPs: • Validation rate = 89.3% (193/216) • Conversion rate = 87.3% (193/221) INS Polymorphism discovery in C. elegans • MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU) • PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) • SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

  44. Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis reference sequence Image from JGI web site • collaboration with Doug Smith at Agencourt • Pichia stipitis converts xylose to ethanol (bio-fuel production) • one mutagenized strain had especially high conversion efficiency • determine where the mutations were that caused this phenotype • we resequenced the 15MB genome with 454 Illumina, and SOLiD reads • 14 true point mutations in the entire genome

  45. Mutational profiling: comparisons

  46. Informatics of transcriptome sequencing • novel transcript discovery Inferred Exon 1 Inferred Exon 2 new genes & exons novel transcripts in known genes Inferred Exon 1 Inferred Exon 2 • measuring gene expression levels by sequence tag counting requires SAGE informatics-like approaches

  47. Protein-DNA interactions: CHiP-Seq Protein-bound DNA fragments are isolated with chromatin immunoprecipitation (ChIP) and then sequenced (Seq) on a high-throughput sequencing platform. Sequences are mapped to the genome sequence with a read alignment program. Regions over-represented in the sequences are identified. Johnson et al. Science, 2007

  48. Protein-DNA interactions: CHIP-SEQ ChIP-Seq scales well for simultaneous analysis of binding sites in the entire genome. Mikkelsen et al. Nature 2007.

More Related