Allpaths de novo assembly of whole genome shotgun microreads
1 / 22

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads - PowerPoint PPT Presentation

  • Uploaded on

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads. Jonathan Butler, Iain MacCallum , Michael Kleber , Ilya A. Shlyakhter , Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum , and David B. Jaffe Presented by: Mohit Jain. Introduction. Overview. Algorithm. Results.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads' - therese

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Allpaths de novo assembly of whole genome shotgun microreads l.jpg

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe

Presented by: Mohit Jain

Motivation l.jpg






(P) DNA Sequencing:

  • Chr length: ~1000 - ~250,000,000 bps

  • Longest sequence-able fragment: ~600 bps

(S) Shotgun Method:determine sequence by breaking genome into many small segments (reads)

(P) Sequence/Genome Assembly:combining these reads to reconstruct the genome

Slide: 1/

Motivation3 l.jpg

(S) Original Genome:

Shortest Common Superstring (SCS) Problem = The shortest sequence that contains every read as a substring

(P) Repeats

Genomes have repeats, and SCS represents repeats only once

(S) Overlap Graph

S overlap graph l.jpg
(S)Overlap Graph

Each read forms a Node

Edge exists between two nodes if the reads overlap


Step 1: Removing redundant edges, classify edges as required/optional

Step 2: Find the shortest walk which includes all required edges

Red/Thin: False Overlaps

P overlap graph l.jpg
(P)Overlap Graph

Microreads =only 25-50 bases long (for HTS)

shorter reads = shorter overlap => more reads => more overlaps

- Very large number of (mostly false) overlaps

- Large number of reads + short overlap + higher error rate

S de bruijn graph l.jpg
(S)De Bruijn Graph

  • To construct de Bruijn graph:

    all reads are broken in to overlapping subsequences of length k (k-mer)

  • Each k-1 subsequence represents a Node

  • A directed Edge e exists between two nodes a and b iff there exists a k-mer such that its prefix = a and its suffix = b

S de bruijn graph7 l.jpg
(S)De Bruijn Graph

  • Condensed by collapsing non-ambiguous paths

  • Genome: An Eulerian path (Superwalk: walk including all edges) in this graph

Paired reads mate pairs l.jpg
Paired Reads (Mate pairs)

  • Sequence two ends of a fragment of known size

  • Results better assemblies, but more complicated

Current approaches l.jpg
Current Approaches

  • Velvet



    (Velvet and Euler USR are based on De Bruijn Graph method)

Allpaths l.jpg

Step I. Builds Unipath Graph

Step II. Localizes reads sequences before assembly

Unipath: maximal unbranched sequence

Short fragment pair merger l.jpg
Short Fragment Pair Merger

  • Fills the gap in between two paired reads

  • Builds a local unipath graph

  • Extend both ends (of all reads) based on the local unipath graph

  • For each pair, search for other pairs which overlap on both ends, and merge to obtain longer reads

Short fragment pair merger13 l.jpg
Short Fragment Pair Merger

  • Repeat the process for all pairs.

  • Once sequence is complete, update the local unipath graph

  • Iteratively merge local unipath graphs to obtain a global unipath graph, representing the genome

  • Allpaths paired read assembly algorithm l.jpg
    ALLPATHS Paired-Read Assembly Algorithm

    Step 1: Creating Approximate Unipaths

    1a: Error correction

    1b: k-mer numbering and searchable data structure (Ignoring any overlaps between reads)

    1c: Computing unipaths from the data structure by walking along the reads until a branch is encountered

    Read pairs  Unipaths  Localize

    Slide15 l.jpg

    Step 2: Selecting Seeds

    Seeds = Unipaths around which assemblies of genomic regions are build

    Ideal seed: Long Unipaths with Low Copy Number (=1)

    Copy Number = Inferred from read coverage of the unipaths

    2a: For each unipath, compute the closest unipaths in the set that are to the left and to the right of the given unipath

    2b: If the distance between left and right neighbours is less than 4 kb, then the middle unipath is removed

    2c: After all such unipaths are removed, remaining forms the seeds unipaths

    Slide16 l.jpg

    Step 3: Assembling neighbourhoods around the seeds

    Neighbourhood = Seed + 10 kb on each side

    3a: Define a collection of low-copy number unipaths, using iterative linking

    3b: Construct two sets of read clouds:

    primary(B): only reads, whose true genomic locations are near the seed

    secondary(C): contains all the short-fragment read pairs (~0.5 kb) near the seed


    Problem of too-many closures persists, hence use Short-Fragment Pair Merger (progressively merge the secondary read cloud pairs)


    paired-read links


    Slide17 l.jpg

    Step 4: Finding All Paths

    compute the closures (include false closures) of all the merged short-fragment pairs

    Step 5: Gluing Together the Local Assembly

    sequence graph is formed by iteratively joining closures

    Step 6: Building the Global Assembly

    outputs of local assemblies are glued together to yield a single sequence graph

    Slide18 l.jpg

    Step 7: Editing the Assembly

    To remove detritus, eliminate ambiguity, and pull apart regions where repeats are assembled on top of each other

    Experiments l.jpg

    • Simulated Data

      10 reference genomes from bacteria and fungi, and 1 10-Mb segment of human genome; with introduced errors

    • Real Data


    Results l.jpg

    Simulated Data

    • Highly complete and contiguousassemblies (Proportion of genome covered > 96%)

    • Assembly ambiguities regions <20 per megabase

    • Assemblies of C.jejuni and E.coli have no errors. Very high accuracy, less than one error per 106 bases

      Real Data

    • High coverage (99.1%)

    • High continuity

    • High accuracy (Final assembly matches the reference sequence exactly, with only 12 exceptions)

    Slide21 l.jpg
    + / -

    + Read Localization

    + Multi-CPU compatible

    + Extremely good (accurate) results

    - Slow

    - Very memory intensive

    - Impractical assumptions on input data

    (500bp +/- 5bp insert size)