fragment assembly l.
Skip this Video
Loading SlideShow in 5 Seconds..
Fragment Assembly PowerPoint Presentation
Download Presentation
Fragment Assembly

Loading in 2 Seconds...

play fullscreen
1 / 31

Fragment Assembly - PowerPoint PPT Presentation

  • Uploaded on

Fragment Assembly. Introduction. Fragments are typically of 200-700 bp long “Target” string is about 30k – 100k bp long Problem: given a set of fragments reconstruct the target. Introduction. Multiple-alignment of the fragments ignoring spaces at the end The alignment is called “layout”

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Fragment Assembly

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Fragments are typically of 200-700 bp long
  • “Target” string is about 30k – 100k bp long
  • Problem: given a set of fragments reconstruct the target
  • Multiple-alignment of the fragments ignoring spaces at the end
  • The alignment is called “layout”
  • The output is called the “consensus sequence”
  • An optimization problem
  • Base-call errors:
  • Substitution errors [p 107]
  • Insertion errors (possibly from the host sequence) [p 108, fig 4.3]
  • Deletion error [fig 4.4]
  • Majority voting solves them (or some form of optimization)
  • Chimeras:
  • To non-contiguous fragments get joined as a single fragment [p 109, fig 4.5]
  • Needs to be weeded out as a preprocessing step
  • Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well
  • Unknown orientation:
  • Fragments may come from either strand
  • Even from the opposite strand, its reverse-complement must be in the target string
  • Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments)
  • [p 109, fig 4.6]
  • Repeats:
  • Regions (super-string of some fragments) may repeat in a target
  • Consequent problem: where do the fragments really come from, on approximate alignment? [p 110, fig 4.7]
  • Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9]
  • Inverted repeats: repeat of the reverse complement [fig 4.10]
  • Insufficient coverage:
  • Chance of coverage increases with redundancy (a heuristic: cover 8 times the target length)
  • Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here
  • Insufficient coverage:
  • What you get with insufficient coverage is multiple “contigs,” not one contig
  • “t-contig” is where we expect t-long overlap between pairs of fragments
  • Expected number of contigs: [p 112, formula 4.1]
  • Lower t means lesser number of contigs (more aligned segments), but weaker consensus
  • Shortest common superstrings are not the best solution
  • Fig 4.12 vs Fig 4.13 (p115/116)
  • Superstring to be reconstructed out of fragments
  • An alignment problem with no end penalty
  • d_s is edit distance score without end-penalty: minimized over edit distances d
  • Fig 4.14 (p117) for best aligned subsequence-matching
  • Note, char matched is charged 0, mismatch 1, gap 2, in “distance” rather than “similarity”
  • We will use d for d_s
  • f is approximate substring of S at error level e, then the score is

d(f, S) =< e|f|,

e=1 means no error allowed

e<1 allows insert/delete/substitution errors

  • f and f- both should be matched
reconstruction problem
Reconstruction: Problem
  • Input: Set F of substrings, error level e
  • Output: Shortest possible string S s.t. for all f

Min(d(f, S), d(f-, S)) =< e|f|

reconstruction multicontig
Reconstruction: Multicontig
  • How much overlap do we require between strings?
  • Ideally, each column in the layout L should have same character, for all columns 1 through |L|
  • Fig 4.4 (p 118): t-contig for t=3, 2, 1
  • Balance between t and number of t-contigs
reconstruction multicontig15
Reconstruction: Multicontig
  • S is e-consensus sequence (multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f|
  • Multicontig problem:
  • Input: set F, integer t>=0, 0=<e=<1
  • Output: Minimum partition over F, each partition Ci is a t-contig with e-consensus
reconstruction overlap multi graph
Reconstruction: Overlap Multi-graph
  • Nodes are the fragments
  • Directed arcs label length t of overlap between nodes” t-suffix= t-prefix
  • Arcs between all pairs of nodes, but no self-loop
  • Fig 4.15 (p 121): example
  • Length of a created superstring=total wt along the path(or overlaps) + total length of all fragments involved
  • Max weight Hamiltonian path is what we are looking for in this graph  max overlapped superstring
  • Substrings of fragments within the set of fragments are noise: remove them
  • Draw OMG of the substring free set of fragments
  • Shortest common superstring always correspond to a Hamiltonian path in this graph
reconstruction omg
Reconstruction: OMG
  • Thm 4.1 (p 123): F substring free, for every common superstring S, there is a Ham. Path P, s.t., S(P) is in S
  • Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists)
  • Path follows the same order of fragments (as in S) in OMG
  • S may contain extra garbage materials, so, S(P) is within S
reconstruction omg19
Reconstruction: OMG
  • If S is shortest common superstring, then S must be within S(P), or S=S(P)
  • In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F
reconstruction omg20
Reconstruction: OMG
  • Think of an algorithm for weeding out substrings from F
  • Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes
  • If the wt on an edge is below a threshold t, then the wt should be treated as 0
reconstruction omg21
Reconstruction: OMG
  • Greedy Algorithm to draw Ham. Path (p 125)
  • Collects edges largest to smallest,

(1) preventing cycle (union-find),

(2) indegree of each node should be =<1 (first node has 0)

(3) outdegree of each node should be =<1 (last node has 0)

[Does not return Ham. Path. Can you modify to return Ham. Path?]

  • Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4
reconstruction omg22
Reconstruction: OMG
  • Subintervals: if a fragment can be embedded within another one in the set
  • Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string
reconstruction omg23
Reconstruction: OMG
  • If a repeat exists in the original string, then the graph will have a cycle
  • False positive: substrings from two different portions has t-overlap
  • If a cycle exist in the graph, then there must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered
reconstruction omg24
Reconstruction: OMG
  • If there is no repeats in a subinterval-free graph, then there exist a unique Ham. Path
  • If there exist a cycle it may not come from a repeat
reconstruction omg25
Reconstruction: OMG
  • Example 4.6 (p 130): greedy alg finds wrong string, but the Ham. Path finds the correct one
  • Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring)
  • Ham path chooses any t-overlap connections – cares for linkage only
parameters in aligning for fragment assembly
Parameters in aligning for fragment assembly
  • Scoreon a column: traditionally {0,-1,-2} in sum-of-pairs
  • Entropy:

Sum[over alphabets and space c] –pc log pc, where pc is probability of c

  • All same character, pc = 1, entropy=0
  • For {a, t, c, g, -}, all different, pc = 1/5, entropy=log 5entropy measures uniformity alone, a better metric
parameters in aligning for fragment assembly27
Parameters in aligning for fragment assembly
  • Coverage: How many each column is “covered” by how many fragments? (Average, min, max)
  • This is different from the concept of t-overlap
  • If a column (of the target) is covered by 0, then the layout is disconnected
  • Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns
parameters in aligning for fragment assembly28
Parameters in aligning for fragment assembly
  • Coverage is not enough, we need good linkage, Example: p 133
  • Ham. Path algorithm is doing that
steps in assembly
Steps in assembly :
  • Step 1: Overlap finding
  • Approximate – delete, insert, replace allowed
  • by semi-global DP algorithm
  • with appropriate end-gap penalty,
  • pairwise between each fragment and its reverse-complement
steps in assembly30
Steps in assembly :
  • Step 2: Construct over (F union F-bar) for the fragment set F
  • (-- after eliminating substrings?)
  • Construct Hamiltonian path in this graph
  • Cycles and unbalanced coverage may mean repeats
steps in assembly31
Steps in assembly :
  • Step 3: fine tuning the multiple alignment to get a consensus target
  • Manual or algorithmic
  • Examples in p 137-138