1 / 22

Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs

Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs. March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee. What is de Bruijn Graphs?. “De Bruijn graph” is a directed graph An edge represents overlap between sequences of symbols V=(s 1 , s 2 , …, s m )

rmaurer
Download Presentation

Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee

  2. What is de Bruijn Graphs? • “De Bruijn graph” is a directed graph • An edge represents overlap between sequences of symbols • V=(s1, s2, …, sm) • E={(v1,v2,…, vn),(w1,w2,…,wn)):v2=w1,v3=w2, …, vn=wn-1}

  3. Introduction • New sequencing techniques are commercially available (e.g. 454 Sequencing, Solexa) • 454 Sequencing ~ 100 – 200bp • Solexa ~ 30bp • Algorithms whole genome shotgun (WGS) assembly are not suitable for short reads • Overlap graph with a node per read is extremely large • More ambiguous connections in assembly

  4. Introduction (cont) • Euler assembler (Pevzner 2001) used k-mer for a node of de Bruijn graphs • Reads are mapped as a path through the de Brujin graph • High redundancy does not affect the number of nodes • “Velvet” effectively deals with experimental errors and repeats by using Brujin graphs with k-mers

  5. De Bruijn Graphs - structure Structure

  6. De Bruijn Graphs – structure (cont) • Adjacent k-mers overlap by k-1 nucleotides • Each node is attached to twin node • Reverse series of reverse complement k-mers • Overlap between reads from opposite strand • Union of a node and its twin node is called a “block” • Last k-mer overlaps with the first of its destination

  7. De Bruijn Graphs – construction (cont) Construction • Reads are hashed with predefined k-mer length • Small k-mer → increase connectivity → more ambiguous repeats • Large k-mer → increase specificity → decrease connectivity • Determine k considering “sensitivity” and “specificity”

  8. De Bruijn Graphs – construction (cont) • For each k-mer, hash table records ID of the first read and its position • Each k-mer is recorded with reverse complement • Node is created if there is distinct interruption points • Reads are traced through the graph • Create a directed arc if necessary

  9. De Bruijn Graphs – simplification • Simplify the chains of blocks • No information loss • If node A has only one outgoing arc to node B, and if node B has only one ingoing arc → merge A B

  10. De Bruijn Graphs – error removal Velvet focuses on “topological features” of the graph • First step: remove tips • Tip: chain of nodes disconnected on one end • Use two criteria: (1) length and (2) minority count • Length: remove a tip if < 2k bp since two nearby errors can create a tip up to 2k bp error error k k

  11. De Bruijn Graphs – error removal (cont) • Minority count: multiplicity m < n • Starting from node B, going through the tip is an alternative to a more common path m A B tip C n

  12. De Bruijn Graphs – error removal (cont) Second step: remove bubbles using Tour Bus • Redundant paths start and end at the same nodes • Bubbles are created by errors or biological variants such as SNP Bubble

  13. De Bruijn Graphs – error removal (cont) Tour Bus • Detect redundant paths 2. Compare them using dynamic programming methods 3. If similar, merge them

  14. De Bruijn Graphs – error removal (cont) Third step: remove erroneous connections • Remove erroneous connections after Tour Bus algorithm • Remove erroneous connections with basic coverage cutoff • Genuine short nodes which cannot be simplified in the graph should have high coverage

  15. Breadcrumb: resolution of repeats • Using read pairs, pair up the long nodes • Flag paired reads using unambiguous long nodes unambiguous long nodes

  16. Breadcrumb: resolution of repeats • Using read pairs, pair up the long nodes • Flag paired reads using unambiguous long nodes unambiguous long nodes

  17. Breadcrumb: resolution of repeats • Extends the nodes as far as possible using flagged paired reads • All nodes between A and B are paired up to either A or B

  18. Experimental Results Test error removal pipeline on simulated data • Simulate reads are from E. coli, S. cerevisiae, C.elegans, and H. sapiens • Coverage density vs N50 for H. sapiens • Limited by natural repetition of the reference genome Ideal + Error (1%) + SNP N50

  19. Experimental Results (cont) Test error removal pipeline on experimental data • 173,428 bp human BAC was sequenced using Solexa machines • Reads were 35bp long, and k=31 • Tour Bus increased sensitivity by correcting errors and preserved the integrity of the graph structure

  20. Experimental Results (cont)

  21. Experimental Results (cont)

  22. Conclusions • Velvet is a de Bruijn graph based sequence assembly method for short reads • Errors are handled by removing tips and Tour Bus algorithm • A large number of repeats are resolved by Breadcrumb algorithm • Velvet was assessed using simulated and real datasets and it performed well

More Related