1 / 65

Assembly group

Assembly group. Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee. Presentation overview. Biology background Algorithms De novo - Overlap-Layout-Consensus - De Bruijn graphs Reference Tools and Techniques Work flow & strategy

garland
Download Presentation

Assembly group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assembly group Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee

  2. Presentation overview • Biology background • Algorithms De novo -Overlap-Layout-Consensus - De Bruijn graphs Reference • Tools and Techniques • Work flow & strategy • Group management

  3. Presentation overview • Biology background • Algorithms De novo -Overlap-Layout-Consensus - De Bruijn graphs Reference • Tools and Techniques • Work flow & strategy • Group management

  4. 4-mer 3-mer Measures the presence or absence of each nucleotide at any given position TACG Flow Order 2-mer KEY (TCAG) 1-mer Pyrosequencing

  5. Video http://www.youtube.com/watch?v=kYAGFrbGl6E Margulies et al., 2005

  6. Data statistics • ~300nt per read • 40X coverage (but widely varying, 12-80)

  7. Biology of H. influenzae • ~ 2 Mb genome (1.8-2.0Mb) • Mostly coding sequences • Good for assembly • Reference genomes of 4 closely related species • H. influenzae • H. parasuis • H. ducreyi • H. somnus

  8. Biology of H. influenzae • High degree of genomic plasticity • 10% of genes in clinically isolated strains are novel • In situ horizontal gene transfer • Supragenome – distributed genome hypothesis • Reference mapping relatively ineffective • Making life difficult!!!

  9. Results preview

  10. Second test for high rates of genomic recombination • Assemble all of the H. hemolyticus genomes together • Should give a more complete mapping because of higher coverage • 40X * 5 genomes  200X coverage • But … we get 3700 contigs • (average of 50 for single strain assembly)

  11. Rampant recombination • These data hint at rampant recombination • Reference mapping relatively worthless

  12. Whole-genome alignment (intra-species) • On average, 27 insertions, 147 deletions (>90bp) • Average length of non-matching seq = 321kb (18%)

  13. Presentation overview • Biology background • Algorithms De novo - Overlap-Layout-Consensus - De Bruijn graphs Reference • Tools and Techniques • Work flow & strategy • Group management

  14. Some terminologies • Read: a 500-900 long word that comes out of sequencer • Mate pair: a pair of reads from two ends of the same insert fragment • Contig: a contiguous sequence formed by several overlapping reads with no gaps. • Supercontig (scaffold) an ordered and oriented set of contigs, usually by mate pairs. • Consensus sequence: sequence derived from the multiple alignment of reads in a contig

  15. Assembly Algorithm • Goal: Find the shortest common sequence of a set of reads. • Input: reads {s1, s2, s3, …} • Output: find the shortest string T such that every s_i is a substring of T. • Comment: This is NP-hard problem, we need to use some approximation algorithm.

  16. Greedy Algorithm • Process: • Calculate pairwise alignments of all fragments. • Choose two fragments with the largest overlap. • Merge chosen fragments. • Repeat step 2 and 3 until only one fragment is left.

  17. Greedy algorithm Input reads Take pairwise alignment Best one Merge the best one

  18. Greedy Algorithm • Comment: • Greedy algorithm was the first successful assembly algorithm implemented. • Used for organisms such as bacteria, single-celled eukaryotes. • It has some efficiency limitation

  19. Overlap-layout-consensus (summary) • This approach is based on graph theory. • Assemblers based on this approach: Arachne, Celera, Newbler etc.

  20. Step 1: Find Overlapping Reads • Sort all k-mers in reads (k~24) • Find pairs of reads sharing a k-mer • Extend to full alignment-throw away if not > 95% similar TACATAGATTACACAGATTACT GA | | | | | | | | | | | | | | | | | | | | | | TAGTTAGATTACACAGATTACTAGA

  21. Step 1: Find Overlapping Reads • One caveat: repeats • A k-mer that appears N times, initiates N^2 comparisons. • Solution: • Discard all k-mers that appear more than c*Coverage, (c~10)

  22. Step 2: Construct overlap graph • A graph is constructed: • Nodes are reads • Edges represent overlapping reads CGTAGTGGCAT Overlap graph ATTCACGTAG

  23. Step 2: Construct overlap graph • A graph is constructed: • Nodes are reads • Edges represent overlapping reads CGTAGTGGCAT Overlap graph ATTCACGTAG

  24. Step 3: Find Contigs • Terminology in graph theory: • Simple path--- a path in the graph contains each node at most once. • Longest simple path---a simple path that cannot be extended. • Hamiltonian path– a path in the graph contains each node exactly once. CGTAGTGGCAT ATTCACGTAG

  25. Step 4: Multiple sequence alignments and consensus • Recall: Now we got several contigs(i.e. several longest simple paths) • Find the multiple alignments of these contigs, and get one consensus sequence as our final contig.

  26. The Eulerian path/de Bruijin graph approach • Summary: • Based on graph theory • Eularian path: a path in a graph which visits every edge exactly once. • Example: Euler, Velvet, Allpath, Abyss, SOAPdenovo… • Eularian path is more efficient, however, in partice both are equally fast.

  27. Step 1: k-mer hash table • Break reads into overlapping k-mers. Example: 10bp read: ATTCGACTCC for k=5-mers: ATTCG TTCGA TCGAC CGACT GACTC ACTCC

  28. Step 2: Build de Bruijn graph • Nodes: k-mers • Edges: if (k-1) suffix of a node equals (k-1) prefix of a node, add a directional edge between them. TTCGA TCGAC ATTCG

  29. Step 3: simplification of the graph • Whenever a node A has only one outgoing arc that points to another node B that has only one ingoing arc, the two nodes are merged. TGCAT TTGCA ATTGC TGCAG ATTGCA

  30. Step 3: simplification of the graph

  31. Other error correction steps • In Velvet: • Error removal • Removing tips Tip: a chain of nodes that is disconnected on one end.

  32. Step 4: Removing bubbles with the Tour Bus Algorithm • Consider two paths redundant if they start and end at the same nodes (forming a “bubble”) and contain similar sequences. • Such bubbles can be created by errors or biological variants, such as SNPs or cloning artifacts prior to sequencing. Erroneous bubbles are removed by an algorithm called “Tour Bus”.

  33. Step 4: Removing bubbles with the Tour Bus Algorithm

  34. Step 5: Find the Eulerian path • Algorithm for directed graphs: (1) Start with an empty stack and an empty circuit (Eulerian path).- If all vertices have same out-degrees as in-degrees - choose any of them.- If all but 2 vertices have same out-degree as in-degree, and one of those 2 vertices has out-degree with one greater than its in-degree, and the other has in-degree with one greater than its out-degree - then choose the vertex that has its out-degree with one greater than its in-degree.- Otherwise no Euler circuit or path exists. (2) If current vertex has no out-going edges (i.e. neighbors) - add it to circuit, remove the last vertex from the stack and set it as the current one. Otherwise (in case it has out-going edges, i.e. neighbors) - add the vertex to the stack, take any of its neighbors, remove the edge between that vertex and selected neighbor, and set that neighbor as the current vertex. (3) Repeat step 2 until the current vertex has no more out-going edges (neighbors) and the stack is empty.

  35. Some terminologies of output

  36. Some terminology of output • C_k=C*(L-k+1)/L • N50 size: 50% of genome is in contigs larger than N50 Example: 1Mbp genome Contigs: 300, 100, 50, 45, 30, 20, 15, 15, 10,… N50=30kbp (300+100+50+45+30=525>=500kbp) Note: N50 is meaningful for comparison only when genome size is the same

  37. Assembly Algorithm with reference sequence • Map k-mer on the reference sequence, get a “location map”. • Map each read onto the “location map” according to the k-mer. location map of 5-ker AATTG GGTTA AATGGTTACCA CCCAATTGAAA

  38. Visualization using Mauve

  39. Presentation overview • Biology background • Algorithms De novo -Overlap-Layout-Consensus - De Bruijn graphs Reference • Tools and Techniques • Work flow & strategy • Group management

  40. What is .sff file? • Standard flowgram format (SFF) A binary file format used to encode results of pyrosequencing from the 454 Life Sciences platform for high-throughput sequencing. a header section + read data sections

  41. Header A summary of general information regarding the file content

  42. Read data section Reads' universal accession numbers (h), sequence information (s), quality scores of basecalls (q), clipping positions (c), flowgram values (f) flowgram indices (i) the nucleotide bases + the quality scores

  43. 6 genomes, 6 .sff files • Number of reads ranges from 72548 to 391117 Assembler Contigs/Scaffold Reads

  44. Newbler • GS De Novo Assembler: a software package designed specifically for assembling sequence data generated by pyrosequencing platforms • De novo assembly • Overlap-Layout-Consensus methodology • Better deal with reads greater than 250bp in length • GS Reference Mapper

  45. Velvet • Algorithms for de novo assembly • Short read assembly (25~50bp) • Using de Bruijn graphs. • Applying Velvet to very short reads and paired-ends information only can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. 

  46. AMOS • Open-source whole genome assembly software - Assemblers: Minimus2 - Validation and Visualization: Hawkeye - Scaffolding: Bambus - Trimming, Overlapping, & Error Correction

  47. Other Assemblers • Celera • MIRA • Edena Finishing is a big challenge !

  48. Problems • Sequencing errors: base pair misread, poly A… • It is possible that some portions of genomes are unsequenced • Identical and nearly identical sequences (repeats) can increase the time and space complexity of algorithms exponentially • Gaps & errors

More Related