Fragment assembly
1 / 31

Fragment Assembly - PowerPoint PPT Presentation

  • Uploaded on

Fragment Assembly. Introduction. Fragments are typically of 200-700 bp long “Target” string is about 30k – 100k bp long Problem: given a set of fragments reconstruct the target. Introduction. Multiple-alignment of the fragments ignoring spaces at the end The alignment is called “layout”

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Fragment Assembly' - cheche

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Introduction l.jpg

  • Fragments are typically of 200-700 bp long

  • “Target” string is about 30k – 100k bp long

  • Problem: given a set of fragments reconstruct the target

Introduction3 l.jpg

  • Multiple-alignment of the fragments ignoring spaces at the end

  • The alignment is called “layout”

  • The output is called the “consensus sequence”

  • An optimization problem

Complications l.jpg

  • Base-call errors:

  • Substitution errors [p 107]

  • Insertion errors (possibly from the host sequence) [p 108, fig 4.3]

  • Deletion error [fig 4.4]

  • Majority voting solves them (or some form of optimization)

Complications5 l.jpg

  • Chimeras:

  • To non-contiguous fragments get joined as a single fragment [p 109, fig 4.5]

  • Needs to be weeded out as a preprocessing step

  • Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well

Complications6 l.jpg

  • Unknown orientation:

  • Fragments may come from either strand

  • Even from the opposite strand, its reverse-complement must be in the target string

  • Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments)

  • [p 109, fig 4.6]

Complications7 l.jpg

  • Repeats:

  • Regions (super-string of some fragments) may repeat in a target

  • Consequent problem: where do the fragments really come from, on approximate alignment? [p 110, fig 4.7]

  • Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9]

  • Inverted repeats: repeat of the reverse complement [fig 4.10]

Complications8 l.jpg

  • Insufficient coverage:

  • Chance of coverage increases with redundancy (a heuristic: cover 8 times the target length)

  • Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here

Complications9 l.jpg

  • Insufficient coverage:

  • What you get with insufficient coverage is multiple “contigs,” not one contig

  • “t-contig” is where we expect t-long overlap between pairs of fragments

  • Expected number of contigs: [p 112, formula 4.1]

  • Lower t means lesser number of contigs (more aligned segments), but weaker consensus

Reconstruction l.jpg

  • Shortest common superstrings are not the best solution

  • Fig 4.12 vs Fig 4.13 (p115/116)

Reconstruction11 l.jpg

  • Superstring to be reconstructed out of fragments

  • An alignment problem with no end penalty

  • d_s is edit distance score without end-penalty: minimized over edit distances d

  • Fig 4.14 (p117) for best aligned subsequence-matching

  • Note, char matched is charged 0, mismatch 1, gap 2, in “distance” rather than “similarity”

  • We will use d for d_s

Reconstruction12 l.jpg

  • f is approximate substring of S at error level e, then the score is

    d(f, S) =< e|f|,

    e=1 means no error allowed

    e<1 allows insert/delete/substitution errors

  • f and f- both should be matched

Reconstruction problem l.jpg
Reconstruction: Problem

  • Input: Set F of substrings, error level e

  • Output: Shortest possible string S s.t. for all f

    Min(d(f, S), d(f-, S)) =< e|f|

Reconstruction multicontig l.jpg
Reconstruction: Multicontig

  • How much overlap do we require between strings?

  • Ideally, each column in the layout L should have same character, for all columns 1 through |L|

  • Fig 4.4 (p 118): t-contig for t=3, 2, 1

  • Balance between t and number of t-contigs

Reconstruction multicontig15 l.jpg
Reconstruction: Multicontig

  • S is e-consensus sequence (multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f|

  • Multicontig problem:

  • Input: set F, integer t>=0, 0=<e=<1

  • Output: Minimum partition over F, each partition Ci is a t-contig with e-consensus

Reconstruction overlap multi graph l.jpg
Reconstruction: Overlap Multi-graph

  • Nodes are the fragments

  • Directed arcs label length t of overlap between nodes” t-suffix= t-prefix

  • Arcs between all pairs of nodes, but no self-loop

  • Fig 4.15 (p 121): example

  • Length of a created superstring=total wt along the path(or overlaps) + total length of all fragments involved

  • Max weight Hamiltonian path is what we are looking for in this graph  max overlapped superstring

Reconstruction17 l.jpg

  • Substrings of fragments within the set of fragments are noise: remove them

  • Draw OMG of the substring free set of fragments

  • Shortest common superstring always correspond to a Hamiltonian path in this graph

Reconstruction omg l.jpg
Reconstruction: OMG

  • Thm 4.1 (p 123): F substring free, for every common superstring S, there is a Ham. Path P, s.t., S(P) is in S

  • Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists)

  • Path follows the same order of fragments (as in S) in OMG

  • S may contain extra garbage materials, so, S(P) is within S

Reconstruction omg19 l.jpg
Reconstruction: OMG

  • If S is shortest common superstring, then S must be within S(P), or S=S(P)

  • In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F

Reconstruction omg20 l.jpg
Reconstruction: OMG

  • Think of an algorithm for weeding out substrings from F

  • Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes

  • If the wt on an edge is below a threshold t, then the wt should be treated as 0

Reconstruction omg21 l.jpg
Reconstruction: OMG

  • Greedy Algorithm to draw Ham. Path (p 125)

  • Collects edges largest to smallest,

    (1) preventing cycle (union-find),

    (2) indegree of each node should be =<1 (first node has 0)

    (3) outdegree of each node should be =<1 (last node has 0)

    [Does not return Ham. Path. Can you modify to return Ham. Path?]

  • Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4

Reconstruction omg22 l.jpg
Reconstruction: OMG

  • Subintervals: if a fragment can be embedded within another one in the set

  • Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string

Reconstruction omg23 l.jpg
Reconstruction: OMG

  • If a repeat exists in the original string, then the graph will have a cycle

  • False positive: substrings from two different portions has t-overlap

  • If a cycle exist in the graph, then there must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered

Reconstruction omg24 l.jpg
Reconstruction: OMG

  • If there is no repeats in a subinterval-free graph, then there exist a unique Ham. Path

  • If there exist a cycle it may not come from a repeat

Reconstruction omg25 l.jpg
Reconstruction: OMG

  • Example 4.6 (p 130): greedy alg finds wrong string, but the Ham. Path finds the correct one

  • Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring)

  • Ham path chooses any t-overlap connections – cares for linkage only

Parameters in aligning for fragment assembly l.jpg
Parameters in aligning for fragment assembly

  • Scoreon a column: traditionally {0,-1,-2} in sum-of-pairs

  • Entropy:

    Sum[over alphabets and space c] –pc log pc, where pc is probability of c

  • All same character, pc = 1, entropy=0

  • For {a, t, c, g, -}, all different, pc = 1/5, entropy=log 5entropy measures uniformity alone, a better metric

Parameters in aligning for fragment assembly27 l.jpg
Parameters in aligning for fragment assembly

  • Coverage: How many each column is “covered” by how many fragments? (Average, min, max)

  • This is different from the concept of t-overlap

  • If a column (of the target) is covered by 0, then the layout is disconnected

  • Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns

Parameters in aligning for fragment assembly28 l.jpg
Parameters in aligning for fragment assembly

  • Coverage is not enough, we need good linkage, Example: p 133

  • Ham. Path algorithm is doing that

Steps in assembly l.jpg
Steps in assembly :

  • Step 1: Overlap finding

  • Approximate – delete, insert, replace allowed

  • by semi-global DP algorithm

  • with appropriate end-gap penalty,

  • pairwise between each fragment and its reverse-complement

Steps in assembly30 l.jpg
Steps in assembly :

  • Step 2: Construct over (F union F-bar) for the fragment set F

  • (-- after eliminating substrings?)

  • Construct Hamiltonian path in this graph

  • Cycles and unbalanced coverage may mean repeats

Steps in assembly31 l.jpg
Steps in assembly :

  • Step 3: fine tuning the multiple alignment to get a consensus target

  • Manual or algorithmic

  • Examples in p 137-138