Allpaths de novo assembly of whole genome shotgun microreads
Download
1 / 22

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads - PowerPoint PPT Presentation


  • 169 Views
  • Uploaded on

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads. Jonathan Butler, Iain MacCallum , Michael Kleber , Ilya A. Shlyakhter , Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum , and David B. Jaffe Presented by: Mohit Jain. Introduction. Overview. Algorithm. Results.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads' - therese


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Allpaths de novo assembly of whole genome shotgun microreads l.jpg

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe

Presented by: Mohit Jain


Motivation l.jpg

Introduction

Overview

Algorithm

Results

Motivation

(P) DNA Sequencing:

  • Chr length: ~1000 - ~250,000,000 bps

  • Longest sequence-able fragment: ~600 bps

(S) Shotgun Method:determine sequence by breaking genome into many small segments (reads)

(P) Sequence/Genome Assembly:combining these reads to reconstruct the genome

Slide: 1/


Motivation3 l.jpg
Motivation

(S) Original Genome:

Shortest Common Superstring (SCS) Problem = The shortest sequence that contains every read as a substring

(P) Repeats

Genomes have repeats, and SCS represents repeats only once

(S) Overlap Graph


S overlap graph l.jpg
(S)Overlap Graph

Each read forms a Node

Edge exists between two nodes if the reads overlap

Algorithm:

Step 1: Removing redundant edges, classify edges as required/optional

Step 2: Find the shortest walk which includes all required edges

Red/Thin: False Overlaps


P overlap graph l.jpg
(P)Overlap Graph

Microreads =only 25-50 bases long (for HTS)

shorter reads = shorter overlap => more reads => more overlaps

- Very large number of (mostly false) overlaps

- Large number of reads + short overlap + higher error rate


S de bruijn graph l.jpg
(S)De Bruijn Graph

  • To construct de Bruijn graph:

    all reads are broken in to overlapping subsequences of length k (k-mer)

  • Each k-1 subsequence represents a Node

  • A directed Edge e exists between two nodes a and b iff there exists a k-mer such that its prefix = a and its suffix = b


S de bruijn graph7 l.jpg
(S)De Bruijn Graph

  • Condensed by collapsing non-ambiguous paths

  • Genome: An Eulerian path (Superwalk: walk including all edges) in this graph


Paired reads mate pairs l.jpg
Paired Reads (Mate pairs)

  • Sequence two ends of a fragment of known size

  • Results better assemblies, but more complicated


Current approaches l.jpg
Current Approaches

  • Velvet

  • EULER-USR

  • ALLPATHS

    (Velvet and Euler USR are based on De Bruijn Graph method)


Allpaths l.jpg
ALLPATHS

Step I. Builds Unipath Graph

Step II. Localizes reads sequences before assembly

Unipath: maximal unbranched sequence



Short fragment pair merger l.jpg
Short Fragment Pair Merger

  • Fills the gap in between two paired reads

  • Builds a local unipath graph

  • Extend both ends (of all reads) based on the local unipath graph

  • For each pair, search for other pairs which overlap on both ends, and merge to obtain longer reads


Short fragment pair merger13 l.jpg
Short Fragment Pair Merger

  • Repeat the process for all pairs.

  • Once sequence is complete, update the local unipath graph

  • Iteratively merge local unipath graphs to obtain a global unipath graph, representing the genome


  • Allpaths paired read assembly algorithm l.jpg
    ALLPATHS Paired-Read Assembly Algorithm

    Step 1: Creating Approximate Unipaths

    1a: Error correction

    1b: k-mer numbering and searchable data structure (Ignoring any overlaps between reads)

    1c: Computing unipaths from the data structure by walking along the reads until a branch is encountered

    Read pairs  Unipaths  Localize


    Slide15 l.jpg

    Step 2: Selecting Seeds

    Seeds = Unipaths around which assemblies of genomic regions are build

    Ideal seed: Long Unipaths with Low Copy Number (=1)

    Copy Number = Inferred from read coverage of the unipaths

    2a: For each unipath, compute the closest unipaths in the set that are to the left and to the right of the given unipath

    2b: If the distance between left and right neighbours is less than 4 kb, then the middle unipath is removed

    2c: After all such unipaths are removed, remaining forms the seeds unipaths


    Slide16 l.jpg

    Step 3: Assembling neighbourhoods around the seeds

    Neighbourhood = Seed + 10 kb on each side

    3a: Define a collection of low-copy number unipaths, using iterative linking

    3b: Construct two sets of read clouds:

    primary(B): only reads, whose true genomic locations are near the seed

    secondary(C): contains all the short-fragment read pairs (~0.5 kb) near the seed

    partners

    Problem of too-many closures persists, hence use Short-Fragment Pair Merger (progressively merge the secondary read cloud pairs)

    unipaths

    paired-read links

    C


    Slide17 l.jpg

    Step 4: Finding All Paths

    compute the closures (include false closures) of all the merged short-fragment pairs

    Step 5: Gluing Together the Local Assembly

    sequence graph is formed by iteratively joining closures

    Step 6: Building the Global Assembly

    outputs of local assemblies are glued together to yield a single sequence graph


    Slide18 l.jpg

    Step 7: Editing the Assembly

    To remove detritus, eliminate ambiguity, and pull apart regions where repeats are assembled on top of each other


    Experiments l.jpg
    Experiments

    • Simulated Data

      10 reference genomes from bacteria and fungi, and 1 10-Mb segment of human genome; with introduced errors

    • Real Data

      Solexa


    Results l.jpg
    Results

    Simulated Data

    • Highly complete and contiguousassemblies (Proportion of genome covered > 96%)

    • Assembly ambiguities regions <20 per megabase

    • Assemblies of C.jejuni and E.coli have no errors. Very high accuracy, less than one error per 106 bases

      Real Data

    • High coverage (99.1%)

    • High continuity

    • High accuracy (Final assembly matches the reference sequence exactly, with only 12 exceptions)


    Slide21 l.jpg
    + / -

    + Read Localization

    + Multi-CPU compatible

    + Extremely good (accurate) results

    - Slow

    - Very memory intensive

    - Impractical assumptions on input data

    (500bp +/- 5bp insert size)