Loading in 2 Seconds...

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Loading in 2 Seconds...

- 167 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads' - therese

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe

Presented by: Mohit Jain

Overview

Algorithm

Results

Motivation(P) DNA Sequencing:

- Chr length: ~1000 - ~250,000,000 bps
- Longest sequence-able fragment: ~600 bps

(S) Shotgun Method:determine sequence by breaking genome into many small segments (reads)

(P) Sequence/Genome Assembly:combining these reads to reconstruct the genome

Slide: 1/

Motivation

(S) Original Genome:

Shortest Common Superstring (SCS) Problem = The shortest sequence that contains every read as a substring

(P) Repeats

Genomes have repeats, and SCS represents repeats only once

(S) Overlap Graph

(S)Overlap Graph

Each read forms a Node

Edge exists between two nodes if the reads overlap

Algorithm:

Step 1: Removing redundant edges, classify edges as required/optional

Step 2: Find the shortest walk which includes all required edges

Red/Thin: False Overlaps

(P)Overlap Graph

Microreads =only 25-50 bases long (for HTS)

shorter reads = shorter overlap => more reads => more overlaps

- Very large number of (mostly false) overlaps

- Large number of reads + short overlap + higher error rate

(S)De Bruijn Graph

- To construct de Bruijn graph:

all reads are broken in to overlapping subsequences of length k (k-mer)

- Each k-1 subsequence represents a Node
- A directed Edge e exists between two nodes a and b iff there exists a k-mer such that its prefix = a and its suffix = b

(S)De Bruijn Graph

- Condensed by collapsing non-ambiguous paths
- Genome: An Eulerian path (Superwalk: walk including all edges) in this graph

Paired Reads (Mate pairs)

- Sequence two ends of a fragment of known size
- Results better assemblies, but more complicated

ALLPATHS

Step I. Builds Unipath Graph

Step II. Localizes reads sequences before assembly

Unipath: maximal unbranched sequence

Short Fragment Pair Merger

- Fills the gap in between two paired reads
- Builds a local unipath graph
- Extend both ends (of all reads) based on the local unipath graph
- For each pair, search for other pairs which overlap on both ends, and merge to obtain longer reads

Short Fragment Pair Merger

- Repeat the process for all pairs.
- Once sequence is complete, update the local unipath graph
- Iteratively merge local unipath graphs to obtain a global unipath graph, representing the genome

ALLPATHS Paired-Read Assembly Algorithm

Step 1: Creating Approximate Unipaths

1a: Error correction

1b: k-mer numbering and searchable data structure (Ignoring any overlaps between reads)

1c: Computing unipaths from the data structure by walking along the reads until a branch is encountered

Read pairs Unipaths Localize

Seeds = Unipaths around which assemblies of genomic regions are build

Ideal seed: Long Unipaths with Low Copy Number (=1)

Copy Number = Inferred from read coverage of the unipaths

2a: For each unipath, compute the closest unipaths in the set that are to the left and to the right of the given unipath

2b: If the distance between left and right neighbours is less than 4 kb, then the middle unipath is removed

2c: After all such unipaths are removed, remaining forms the seeds unipaths

Step 3: Assembling neighbourhoods around the seeds

Neighbourhood = Seed + 10 kb on each side

3a: Define a collection of low-copy number unipaths, using iterative linking

3b: Construct two sets of read clouds:

primary(B): only reads, whose true genomic locations are near the seed

secondary(C): contains all the short-fragment read pairs (~0.5 kb) near the seed

partners

Problem of too-many closures persists, hence use Short-Fragment Pair Merger (progressively merge the secondary read cloud pairs)

unipaths

paired-read links

C

compute the closures (include false closures) of all the merged short-fragment pairs

Step 5: Gluing Together the Local Assembly

sequence graph is formed by iteratively joining closures

Step 6: Building the Global Assembly

outputs of local assemblies are glued together to yield a single sequence graph

To remove detritus, eliminate ambiguity, and pull apart regions where repeats are assembled on top of each other

Experiments

- Simulated Data

10 reference genomes from bacteria and fungi, and 1 10-Mb segment of human genome; with introduced errors

- Real Data

Solexa

Results

Simulated Data

- Highly complete and contiguousassemblies (Proportion of genome covered > 96%)
- Assembly ambiguities regions <20 per megabase
- Assemblies of C.jejuni and E.coli have no errors. Very high accuracy, less than one error per 106 bases

Real Data

- High coverage (99.1%)
- High continuity
- High accuracy (Final assembly matches the reference sequence exactly, with only 12 exceptions)

+ / -

+ Read Localization

+ Multi-CPU compatible

+ Extremely good (accurate) results

- Slow

- Very memory intensive

- Impractical assumptions on input data

(500bp +/- 5bp insert size)

Download Presentation

Connecting to Server..