1 / 19

Wrangling Short Read Data with SHRiMP

Wrangling Short Read Data with SHRiMP. Stephen M. Rumble Department of Computer Science University of Toronto 07/19/08. Handling NGS Data. NGS: at least 3 distinct read types: Illumina/Solexa, 454  letter-space AB SOLiD  color-space (di-base sequencing) 2-pass SMS (Helicos)

graham
Download Presentation

Wrangling Short Read Data with SHRiMP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wrangling Short Read Datawith SHRiMP Stephen M. Rumble Department of Computer Science University of Toronto 07/19/08

  2. Handling NGS Data • NGS: at least 3 distinct read types: • Illumina/Solexa, 454  letter-space • AB SOLiD  color-space (di-base sequencing) • 2-pass SMS (Helicos) • 2 reads, same location • higher error rates • Need new algorithms • SOLiD: Biologists want letters, not colors • 2-pass: How to best handle two reads?

  3. SHRiMP Overview } Common Isolate similarity in stages: • Spaced Seed Filtering • Vectorized Smith-Waterman • Full Alignment • Specialized for SOLiD, 2-pass, Letter-space • Compute p-values (and other statistics)

  4. Outline • AB SOLiD Reads • 2-pass (SMS) Reads

  5. AB SOLiD: Color-space Sequencing AB SOLiD reads look like this: T012233102 T012033102 G G G A T G G C A A T A C G T T T A 0 0 TGAGCGTTC|||TGAATAGGA 2 A G 1 3 3 1 C T 2 0 0

  6. AB SOLiD: Color space is complex! INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT It’s bloody complicated!

  7. AB SOLiD: Translations TGAGCGTTC|||||||||TGAGCGTTC TGAGCGTTC|||TGAATAGGA • Look at: 012233102 • Recall: 012033102 • 4 translations for every color sequence 0 0 2 A G 1 3 3 1 C T 2 0 0

  8. AB SOLiD: Modified Smith-Waterman • 4 S-W matrices, one per translation • Errors transition into other matrix • ‘Crossover’ penalty charged for errors G A T A C C T T T G A G C G T T C C C A T T G Genome … A G C G T T C Translation A Translation C

  9. AB SOLiD: Obligatory Comparison • SHRiMP and AB Mapper (1.6) • SHRiMP seed 1111001111 • AB 35_2, 35_3 schemas • 10,000 35mers • C. savignyi (173Mb), very high polymorphism • Considering single top hits only

  10. AB SOLiD: Resultant Alignments • SHRiMP emits letter-space alignments • Clear to biologists • Color-space need notbe scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T1211000203110121201000-231 25

  11. Outline • AB SOLiD Reads • 2-pass (SMS) Reads

  12. 2-pass SMS Reads • SMS reads have high error rates • “Dark bases” (skipped letters) • Multiple passes are possible • Ameliorate errors over passes • Good chance of missing base in one read • Acceptable chance of getting it in at least one

  13. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

  14. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTGAC-T CAG-CAT CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

  15. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTGAC-T CAG-CAT C-TG-ACT CA-GCA-T S=8 CT-GAC-T C-AG-CAT AT CC A — —A CC A— —T TT GG AA —C C— —T A— CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

  16. SMS 2-pass: SHRiMP with 2-pass data AT CC A — —A CC A— —T TT GG AA —C C— —T A— • Build a DAG representing the (near) optimal alignments of the two reads • Generate seeds (short paths) from the DAG • Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. • Do full WSG alignment for top hits

  17. SMS 2-pass: Results (in brief) • 10,000 synthetic reads (~25-65 bp) • 7% deletion,1% insertion, 1% sub rate • Mapped to Human chromosome 1 • Spaced seed span 9: 111110111

  18. SHRiMP Summary • Fast mapping of short reads to a genome • -- Handles: • color-space (SOLiD) reads • 2-pass (SMS) reads • insertions and deletions • -- Easy to parallelize • Computation of p-values & other statistics for hits

  19. Acknowledgements • SHRiMP is brought to you by: • Michael Brudno • Adrian Dalca • Marc Fiume • Vlad Yanovsky • Phil Lacroute • Arend Sidow http://compbio.cs.toronto.edu/shrimp University of Toronto Stanford University

More Related