Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Next Generation Sequencing: Mapping & Assembly Michael Brudno Department of Computer Science Banting & Best Dept of Medical Research University of Toronto 14/05/2008

Questions to Ask • What do you want to learn? • Do you have a finished similar genome? • Do you have matepair data? • Do you have other read types? • What compute power do you have available?

What are your options • What do you want to learn? • -- Variation (in a specie) • -- Conservation (between species) • You can try to build a genome from scratch • You can do reference assisted assembly • -- However you will miss some differences • You can map reads to a reference genome • -- Great if you want to discover variation • -- What if you don’t have a reference?

Mapping versus Assembly Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery reference genome DEL SNP De novo genome sequencing Dealing with non-uniqueness in the genome: resequenceability

Mapping NGS Data • NGS: at least 3 distinct read types: • Illumina/Solexa, 454  letter-space • AB SOLiD  color-space (di-base sequencing) • 2-pass SMS (Helicos) • 2 reads, same location • higher error rates • Goal: to ALIGN the reads to the reference(BLAST?) • Need new algorithms • More data • SOLiD: Biologists want letters, not colors • 2-pass: How to best handle two reads?

If I had a VERY fast computer A C T A G A C T T G T C C A G T Cell being computed Previously computed cells

Finding Exact Matches Lookup Table Level 1: size 41

Implementation Lookup table Level 2: size 42

Implementation Lookup table Level k: 4k 416 = OUT OF MEMORY

Spaced Seed Filtering • Slide fixed window over genome • Index reads by each seed in genome Genome Reads

} Common SHRiMP Overview Isolate similarity in stages: • Spaced Seed Filtering • Vectorized Smith-Waterman • Full Alignment • Specialized for SOLiD, 2-pass, Letter-space • Compute p-values (and other statistics)

Outline • AB SOLiD Reads • 2-pass (SMS) Reads

AB SOLiD: Color-space Sequencing AB SOLiD reads look like this: T012233102 T012033102 G G G A T G G C A A T A C G T T T A 0 0 TGAGCGTTC|||TGAATAGGA 2 A G 1 3 3 1 C T 2 0 0

AB SOLiD: Color space is complex! INDELS TGAGTTA 122103 TGA-TTA 12-303 TGAGTTTA 1221003 TGAGTATA 1221333 SNPs TGAGTT 12210 TGACTT 12120 TGAATT 12030 TGATTT 12300 G: TTGAGTTATGGAT 012210331023 R: 012120331023 TTGACTTATGGAT It’s bloody complicated!

AB SOLiD: Translations TGAGCGTTC|||||||||TGAGCGTTC TGAGCGTTC|||TGAATAGGA • Look at: 012233102 • Recall: 012033102 • 4 translations for every color sequence 0 0 2 A G 1 3 3 1 C T 2 0 0

AB SOLiD: Modified Smith-Waterman • 4 S-W matrices, one per translation • Errors transition into other matrix • ‘Crossover’ penalty charged for errors G A T A C C T T T G A G C G T T C C C A T T G Genome … A G C G T T C Translation A Translation C

AB SOLiD: Obligatory Comparison • SHRiMP and AB Mapper (1.6) • SHRiMP seed 1111001111 • AB 35_2, 35_3 schemas • 10,000 35mers • C. savignyi (173Mb), very high polymorphism • Considering single top hits only

AB SOLiD: Resultant Alignments • SHRiMP emits letter-space alignments • Clear to biologists • Color-space need notbe scary! G: 798 GAACCCCTTACAACTGAACCCCTTAC 823 ||X||||||||||||||||||| ||| T: GAaCCCCTTACAACTGAACCCC-TAC R: 1 T1211000203110121201000-231 25

Outline • AB SOLiD Reads • 2-pass (SMS) Reads

2-pass SMS Reads • SMS reads have high error rates • “Dark bases” (skipped letters) • Multiple passes are possible • Ameliorate errors over passes • Good chance of missing base in one read • Acceptable chance of getting it in at least one

SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTGAC-T CAG-CAT CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

SMS 2-pass: SHRiMP with 2 reads C T G A C T C A G C A T CTGAC-T CAG-CAT C-TG-ACT CA-GCA-T S=8 CT-GAC-T C-AG-CAT AT CC A — —A CC A— —T TT GG AA —C C— —T A— CTG-ACT CAGCA-T S=9 Match = +4 Mismatch = -3 Gap = -2

AT CC A — —A CC A— —T TT GG AA —C C— —T A— SMS 2-pass: SHRiMP with 2-pass data • Build a DAG representing the (near) optimal alignments of the two reads • Generate seeds (short paths) from the DAG • Do k-mer scan; if seeds encountered align both reads to the location using vectorized SW. • Do full WSG alignment for top hits

SMS 2-pass: Results (in brief) • 10,000 synthetic reads (~25-65 bp) • 7% deletion,1% insertion, 1% sub rate • Mapped to Human chromosome 1 • Spaced seed span 9: 111110111

SHRiMP Statistics • p_chance -- The Probability that the hit matches the genome by chance • p_genome -- The probability that the hit has that many or more mutations • Normalized odds -- Normalized ratio • p_genome/p_chance • One read is never sufficient to call a SNP!!!

Statistics Primer You get tested for a horrible disease which only strikes 1/thousand people. You test positive. The test has only a 1% false positive rate (and no false negatives). What are the odds that you have the disease? Test 1 thousand people. -- 1 will have the disease -- 10 will test false positive SNPs are 1/1000, but harmful ones are rarer

SHRiMP Summary • Fast mapping of short reads to a genome • -- Handles: • color-space (SOLiD) reads • 2-pass (SMS) reads • insertions and deletions • -- Easy to parallelize • Computation of p-values & other statistics for hits

Whole Genome Shotgun Sequencing DNA SEQUENCER Sanger vs. NGS reads ASSEMBLER C++ contigs FINISHING Insert Size ± Error sequence

Assembler • Input: set of strings over {A,C,G,T} called reads • Output: A common superstring of the reads. • {TACAT,CATAC, ACGTAC} TACATACGTAC • Initially: Shortest Common Superstring (SCS) • NP-hard [Gallant et al 1980] • Over-collapsing of repeats • Alternate approaches • de Bruijn graphs [Pevzner, Tang, Waterman 01] • string graphs [Myers 05] • Both formulations are NP-hard.

CA GC AG TC AT TT De Bruijn Graphs {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} • Nodes are (k-1)-mers • Edges are k-mers • The set of k-mers is called a k-spectrum • Finding shortest string with given k-spectrum equivalent to Chinese Postman Pevzner 1989

CA GC AG TC AT TT Chinese Postman Tours {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} • Solving Chinese Postman: An Eulerian tour is a solution • Euleriazation: making a graph Eulerian • Can be done with min cost flow: • Unbalanced nodes are sources/sinks • Duplicate all edges used in flow Pevzner 1989

Motivation: String Graphs • Several downsides of the de Bruijn approach • Division into k-mers arbitrary • Very sensitive to sequencing errors • Not memory efficient (one node per k-mer) • Goal • One node per read (or better) • No division into k-mers • Flexibility in the presence of sequencing errors Myers 2005

ACGTAC CATAC TACAT How To Build A String Graph (1)  TACATACGTAC {ACGTAC, CATAC, TACAT} • Nodes are reads • Edges are overlaps • Weights are lengths of non-overlapping prefix • Transitively inferable overlaps 3 5 3 2 2

How To Build A String Graph(2) • Build overlap graph • Remove transitively inferable overlaps • Collapse Chains Required Optional Goal: Find the shortest path using all of the required edges and any optional ones Myers 2005

1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 Simpifying String Graphs Repeatedly collapse vertices using 4 simple rules 2 1 1 2 1 1 2 1 1 2 1 1 1 1 After all collapsals are performed, the remaining nodes (conflict nodes) all have in and out degree of at least 2.

Resolving Conflict Nodes (Repeats) With Mate Pairs A B A’ B’ B A A’ B’ ? Does there exist a short path between A’ and B’? • Dijkstra’s shortest path algorithm -- bounded • Greedily join edges if they have enough supporting reads.

Modeling Double-Strandedness 3’ 5’ GGCAAT • How can two DNA molecules overlap? ATTGCC 5’ 3’ A A C +CTT +AAC -AAG -GTT C T T A A C +TCG +AAC -CGA -GTT T C G T G G +TGG +AAC -CCA -GTT A A C Kececioglu 1992

+AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Bidirected Graphs • A walk has to “match” directions at each node. GGCAAT ATTGCC

{GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ GTTGGCAAT GC GT CC CA TG GC GG AT TT GG AC CC GT TT CA TG AC AT AA Using de Bruijn Graphs The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest string with the given k-spectrum [Pevzner 1989] {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’ AA

The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest DNA molecule with the given k-spectrum {GTT, TAA, TTG, TGG, GGC, GCA, CAA} {GTT, TAA, TTG, TGG, GGC, GCA, CAA} 5’ 3’ +AC +AA GTTGGCAAT +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC +AC +AA +AT -GT -TT -AT +CC +CA +GC -GG -TG -GC Using Bidirected de Bruijn Graphs {AAC, ATT, CAA, CCA, GCC, TGC, TTG} ATTGCCAAC 5’ 3’

Some Comparisons

Assemblers • Previous (Sanger) Assemblers • NGS Assemblers • SSAKE (Jeck et al., 2007) • VCAKE (Warren et al. 2007) • ALLPATHS (Butler et al. 2008) • Edena (Hernandez et al. 2008) • Euler-SR (Chaisson and Pevzner 2008) • SHARCGS (Dohm et al. 2007) • Shorty (Chen and Skiena 2007) • Velvet (Zerbino and Birney, 2008)

Acknowledgments Variation SHRiMP Paul Medvedev Assembly Seunghak Lee Elango Cheran Stephen Rumble Vladimir Yanovsky http://compbio.cs.toronto.edu/shrimp Rest of the Group FUNDING: NSERC, CFI, NIH

Canadian Bioinformatics Workshops