allpaths de novo assembly of whole genome shotgun microreads
Skip this Video
Download Presentation
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Loading in 2 Seconds...

play fullscreen
1 / 22

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads - PowerPoint PPT Presentation

  • Uploaded on

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads. Jonathan Butler, Iain MacCallum , Michael Kleber , Ilya A. Shlyakhter , Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum , and David B. Jaffe Presented by: Mohit Jain. Introduction. Overview. Algorithm. Results.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads' - therese

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
allpaths de novo assembly of whole genome shotgun microreads

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe

Presented by: Mohit Jain







(P) DNA Sequencing:

  • Chr length: ~1000 - ~250,000,000 bps
  • Longest sequence-able fragment: ~600 bps

(S) Shotgun Method:determine sequence by breaking genome into many small segments (reads)

(P) Sequence/Genome Assembly:combining these reads to reconstruct the genome

Slide: 1/


(S) Original Genome:

Shortest Common Superstring (SCS) Problem = The shortest sequence that contains every read as a substring

(P) Repeats

Genomes have repeats, and SCS represents repeats only once

(S) Overlap Graph

s overlap graph
(S)Overlap Graph

Each read forms a Node

Edge exists between two nodes if the reads overlap


Step 1: Removing redundant edges, classify edges as required/optional

Step 2: Find the shortest walk which includes all required edges

Red/Thin: False Overlaps

p overlap graph
(P)Overlap Graph

Microreads =only 25-50 bases long (for HTS)

shorter reads = shorter overlap => more reads => more overlaps

- Very large number of (mostly false) overlaps

- Large number of reads + short overlap + higher error rate

s de bruijn graph
(S)De Bruijn Graph
  • To construct de Bruijn graph:

all reads are broken in to overlapping subsequences of length k (k-mer)

  • Each k-1 subsequence represents a Node
  • A directed Edge e exists between two nodes a and b iff there exists a k-mer such that its prefix = a and its suffix = b
s de bruijn graph7
(S)De Bruijn Graph
  • Condensed by collapsing non-ambiguous paths
  • Genome: An Eulerian path (Superwalk: walk including all edges) in this graph
paired reads mate pairs
Paired Reads (Mate pairs)
  • Sequence two ends of a fragment of known size
  • Results better assemblies, but more complicated
current approaches
Current Approaches
  • Velvet

(Velvet and Euler USR are based on De Bruijn Graph method)


Step I. Builds Unipath Graph

Step II. Localizes reads sequences before assembly

Unipath: maximal unbranched sequence

short fragment pair merger
Short Fragment Pair Merger
  • Fills the gap in between two paired reads
  • Builds a local unipath graph
  • Extend both ends (of all reads) based on the local unipath graph
  • For each pair, search for other pairs which overlap on both ends, and merge to obtain longer reads
short fragment pair merger13
Short Fragment Pair Merger
    • Repeat the process for all pairs.
  • Once sequence is complete, update the local unipath graph
  • Iteratively merge local unipath graphs to obtain a global unipath graph, representing the genome
allpaths paired read assembly algorithm
ALLPATHS Paired-Read Assembly Algorithm

Step 1: Creating Approximate Unipaths

1a: Error correction

1b: k-mer numbering and searchable data structure (Ignoring any overlaps between reads)

1c: Computing unipaths from the data structure by walking along the reads until a branch is encountered

Read pairs  Unipaths  Localize


Step 2: Selecting Seeds

Seeds = Unipaths around which assemblies of genomic regions are build

Ideal seed: Long Unipaths with Low Copy Number (=1)

Copy Number = Inferred from read coverage of the unipaths

2a: For each unipath, compute the closest unipaths in the set that are to the left and to the right of the given unipath

2b: If the distance between left and right neighbours is less than 4 kb, then the middle unipath is removed

2c: After all such unipaths are removed, remaining forms the seeds unipaths


Step 3: Assembling neighbourhoods around the seeds

Neighbourhood = Seed + 10 kb on each side

3a: Define a collection of low-copy number unipaths, using iterative linking

3b: Construct two sets of read clouds:

primary(B): only reads, whose true genomic locations are near the seed

secondary(C): contains all the short-fragment read pairs (~0.5 kb) near the seed


Problem of too-many closures persists, hence use Short-Fragment Pair Merger (progressively merge the secondary read cloud pairs)


paired-read links



Step 4: Finding All Paths

compute the closures (include false closures) of all the merged short-fragment pairs

Step 5: Gluing Together the Local Assembly

sequence graph is formed by iteratively joining closures

Step 6: Building the Global Assembly

outputs of local assemblies are glued together to yield a single sequence graph


Step 7: Editing the Assembly

To remove detritus, eliminate ambiguity, and pull apart regions where repeats are assembled on top of each other

  • Simulated Data

10 reference genomes from bacteria and fungi, and 1 10-Mb segment of human genome; with introduced errors

  • Real Data



Simulated Data

  • Highly complete and contiguousassemblies (Proportion of genome covered > 96%)
  • Assembly ambiguities regions <20 per megabase
  • Assemblies of C.jejuni and E.coli have no errors. Very high accuracy, less than one error per 106 bases

Real Data

  • High coverage (99.1%)
  • High continuity
  • High accuracy (Final assembly matches the reference sequence exactly, with only 12 exceptions)
+ / -

+ Read Localization

+ Multi-CPU compatible

+ Extremely good (accurate) results

- Slow

- Very memory intensive

- Impractical assumptions on input data

(500bp +/- 5bp insert size)