1 / 47

Canadian Bioinformatics Workshops

Join the Canadian Bioinformatics Workshops to learn about mapping and assembly in next generation sequencing. Topics include genome re-sequencing, de novo genome sequencing, and handling non-uniqueness in the genome. Explore AB SOLiD reads and 2-pass SMS reads.

franksr
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Next Generation Sequencing: Mapping & Assembly Michael Brudno Department of Computer Science Banting & Best Dept of Medical Research University of Toronto 14/05/2008

  3. Questions to Ask • What do you want to learn? • Do you have a finished similar genome? • Do you have matepair data? • Do you have other read types? • What compute power do you have available?

  4. What are your options • What do you want to learn? • -- Variation (in a specie) • -- Conservation (between species) • You can try to build a genome from scratch • You can do reference assisted assembly • -- However you will miss some differences • You can map reads to a reference genome • -- Great if you want to discover variation • -- What if you don’t have a reference?

  5. Mapping versus Assembly Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP De novo genome sequencing Dealing with non-uniqueness in the genome: resequenceability

  6. Mapping NGS Data • NGS: at least 3 distinct read types: • Illumina/Solexa, 454  letter-space • AB SOLiD  color-space (di-base sequencing) • 2-pass SMS (Helicos) • 2 reads, same location • higher error rates • Goal: to ALIGN the reads to the reference(BLAST?) • Need new algorithms • More data • SOLiD: Biologists want letters, not colors • 2-pass: How to best handle two reads?

  7. If I had a VERY fast computer A C T A G A C T T G T C C A G T Cell being computed Previously computed cells

  8. Finding Exact Matches Lookup Table Level 1: size 41

  9. Implementation Lookup table Level 2: size 42

  10. Implementation Lookup table Level k: 4k 416 = OUT OF MEMORY

  11. Spaced Seed Filtering • Slide fixed window over genome • Index reads by each seed in genome Genome Reads

  12. } Common SHRiMP Overview Isolate similarity in stages: • Spaced Seed Filtering • Vectorized Smith-Waterman • Full Alignment • Specialized for SOLiD, 2-pass, Letter-space • Compute p-values (and other statistics)

  13. Outline • AB SOLiD Reads • 2-pass (SMS) Reads

  14. AB SOLiD: Color-space Sequencing AB SOLiD reads look like this: T012233102 T012033102 G G G A T G G C A A T A C G T T T A 0 0 TGAGCGTTC|||TGAATAGGA 2 A G 1 3 3 1 C T 2 0 0

  15. AB SOLiD: Color space is complex! INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT It’s bloody complicated!

  16. AB SOLiD: Translations TGAGCGTTC|||||||||TGAGCGTTC TGAGCGTTC|||TGAATAGGA • Look at: 012233102 • Recall: 012033102 • 4 translations for every color sequence 0 0 2 A G 1 3 3 1 C T 2 0 0

  17. AB SOLiD: Modified Smith-Waterman • 4 S-W matrices, one per translation • Errors transition into other matrix • ‘Crossover’ penalty charged for errors G A T A C C T T T G A G C G T T C C C A T T G Genome … A G C G T T C Translation A Translation C

  18. AB SOLiD: Obligatory Comparison • SHRiMP and AB Mapper (1.6) • SHRiMP seed 1111001111 • AB 35_2, 35_3 schemas • 10,000 35mers • C. savignyi (173Mb), very high polymorphism • Considering single top hits only

  19. AB SOLiD: Resultant Alignments • SHRiMP emits letter-space alignments • Clear to biologists • Color-space need notbe scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T1211000203110121201000-231 25

  20. Outline • AB SOLiD Reads • 2-pass (SMS) Reads

  21. 2-pass SMS Reads • SMS reads have high error rates • “Dark bases” (skipped letters) • Multiple passes are possible • Ameliorate errors over passes • Good chance of missing base in one read • Acceptable chance of getting it in at least one

  22. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

  23. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTGAC-T CAG-CAT CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

  24. SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTGAC-T CAG-CAT C-TG-ACT CA-GCA-T S=8 CT-GAC-T C-AG-CAT AT CC A — —A CC A— —T TT GG AA —C C— —T A— CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

  25. AT CC A — —A CC A— —T TT GG AA —C C— —T A— SMS 2-pass: SHRiMP with 2-pass data • Build a DAG representing the (near) optimal alignments of the two reads • Generate seeds (short paths) from the DAG • Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. • Do full WSG alignment for top hits

  26. SMS 2-pass: Results (in brief) • 10,000 synthetic reads (~25-65 bp) • 7% deletion,1% insertion, 1% sub rate • Mapped to Human chromosome 1 • Spaced seed span 9: 111110111

  27. SHRiMP Statistics • p_chance -- The Probability that the hit matches the genome by chance • p_genome -- The probability that the hit has that many or more mutations • Normalized odds -- Normalized ratio • p_genome/p_chance • One read is never sufficient to call a SNP!!!

  28. Statistics Primer You get tested for a horrible disease which only strikes 1/thousand people. You test positive. The test has only a 1% false positive rate (and no false negatives). What are the odds that you have the disease? Test 1 thousand people. -- 1 will have the disease -- 10 will test false positive SNPs are 1/1000, but harmful ones are rarer

  29. SHRiMP Summary • Fast mapping of short reads to a genome • -- Handles: • color-space (SOLiD) reads • 2-pass (SMS) reads • insertions and deletions • -- Easy to parallelize • Computation of p-values & other statistics for hits

  30. Whole Genome Shotgun Sequencing DNA SEQUENCER Sanger vs. NGS reads ASSEMBLER C++ contigs FINISHING Insert Size ± Error sequence

  31. Assembler • Input: set of strings over {A,C,G,T} called reads • Output: A common superstring of the reads. • {TACAT,CATAC, ACGTAC} TACATACGTAC • Initially: Shortest Common Superstring (SCS) • NP-hard [Gallant et al 1980] • Over-collapsing of repeats • Alternate approaches • de Bruijn graphs [Pevzner, Tang, Waterman 01] • string graphs [Myers 05] • Both formulations are NP-hard.

  32. CA GC AG TC AT TT De Bruijn Graphs {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} • Nodes are (k-1)-mers • Edges are k-mers • The set of k-mers is called a k-spectrum • Finding shortest string with given k-spectrum equivalent to Chinese Postman Pevzner 1989

  33. CA GC AG TC AT TT Chinese Postman Tours {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} • Solving Chinese Postman: An Eulerian tour is a solution • Euleriazation: making a graph Eulerian • Can be done with min cost flow: • Unbalanced nodes are sources/sinks • Duplicate all edges used in flow Pevzner 1989

  34. Motivation: String Graphs • Several downsides of the de Bruijn approach • Division into k-mers arbitrary • Very sensitive to sequencing errors • Not memory efficient (one node per k-mer) • Goal • One node per read (or better) • No division into k-mers • Flexibility in the presence of sequencing errors Myers 2005

  35. ACGTAC CATAC TACAT How To Build A String Graph (1)  TACATACGTAC {ACGTAC, CATAC, TACAT} • Nodes are reads • Edges are overlaps • Weights are lengths of non-overlapping prefix • Transitively inferable overlaps 3 5 3 2 2

  36. How To Build A String Graph(2) • Build overlap graph • Remove transitively inferable overlaps • Collapse Chains Required Optional Goal: Find the shortest path using all of the required edges and any optional ones Myers 2005

  37. 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 Simpifying String Graphs Repeatedly collapse vertices using 4 simple rules 2 1 1 2 1 1 2 1 1 2 1 1 1 1 After all collapsals are performed, the remaining nodes (conflict nodes) all have in and out degree of at least 2.

  38. Resolving Conflict Nodes (Repeats) With Mate Pairs A B A’ B’ B A A’ B’ ? Does there exist a short path between A’ and B’? • Dijkstra’s shortest path algorithm -- bounded • Greedily join edges if they have enough supporting reads.

  39. Modeling Double-Strandedness 3’ 5’ GGCAAT • How can two DNA molecules overlap? ATTGCC 5’ 3’ A A C +CTT +AAC -AAG -GTT C T T A A C +TCG +AAC -CGA -GTT T C G T G G +TGG +AAC -CCA -GTT A A C Kececioglu 1992

  40. +AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Bidirected Graphs • A walk has to “match” directions at each node. GGCAAT ATTGCC

  41. {GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ GTTGGCAAT GC GT CC CA TG GC GG AT TT GG AC CC GT TT CA TG AC AT AA Using de Bruijn Graphs The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest string with the given k-spectrum [Pevzner 1989] {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’ AA

  42. The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest DNA molecule with the given k-spectrum {GTT, TAA, TTG, TGG, GGC, GCA, CAA} {GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ +AC +AA GTTGGCAAT +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC +AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Using Bidirected de Bruijn Graphs {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’

  43. Some Comparisons

  44. Assemblers • Previous (Sanger) Assemblers • NGS Assemblers • SSAKE (Jeck et al., 2007) • VCAKE (Warren et al. 2007) • ALLPATHS (Butler et al. 2008) • Edena (Hernandez et al. 2008) • Euler-SR (Chaisson and Pevzner 2008) • SHARCGS (Dohm et al. 2007) • Shorty (Chen and Skiena 2007) • Velvet (Zerbino and Birney, 2008)

  45. Acknowledgments Variation SHRiMP Paul Medvedev Assembly Seunghak Lee Elango Cheran Stephen Rumble Vladimir Yanovsky http://compbio.cs.toronto.edu/shrimp Rest of the Group FUNDING: NSERC, CFI, NIH

  46. Acknowledgments Variation SHRiMP Paul Medvedev Assembly Seunghak Lee Elango Cheran Stephen Rumble Vladimir Yanovsky http://compbio.cs.toronto.edu/shrimp Rest of the Group FUNDING: NSERC, CFI, NIH

More Related