- 143 Views
- Uploaded on
- Presentation posted in: General

Fragment Assembly

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Fragment Assembly

- Fragments are typically of 200-700 bp long
- “Target” string is about 30k – 100k bp long
- Problem: given a set of fragments reconstruct the target

- Multiple-alignment of the fragments ignoring spaces at the end
- The alignment is called “layout”
- The output is called the “consensus sequence”
- An optimization problem

- Base-call errors:
- Substitution errors [p 107]
- Insertion errors (possibly from the host sequence) [p 108, fig 4.3]
- Deletion error [fig 4.4]
- Majority voting solves them (or some form of optimization)

- Chimeras:
- To non-contiguous fragments get joined as a single fragment [p 109, fig 4.5]
- Needs to be weeded out as a preprocessing step
- Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well

- Unknown orientation:
- Fragments may come from either strand
- Even from the opposite strand, its reverse-complement must be in the target string
- Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments)
- [p 109, fig 4.6]

- Repeats:
- Regions (super-string of some fragments) may repeat in a target
- Consequent problem: where do the fragments really come from, on approximate alignment? [p 110, fig 4.7]
- Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9]
- Inverted repeats: repeat of the reverse complement [fig 4.10]

- Insufficient coverage:
- Chance of coverage increases with redundancy (a heuristic: cover 8 times the target length)
- Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here

- Insufficient coverage:
- What you get with insufficient coverage is multiple “contigs,” not one contig
- “t-contig” is where we expect t-long overlap between pairs of fragments
- Expected number of contigs: [p 112, formula 4.1]
- Lower t means lesser number of contigs (more aligned segments), but weaker consensus

- Shortest common superstrings are not the best solution
- Fig 4.12 vs Fig 4.13 (p115/116)

- Superstring to be reconstructed out of fragments
- An alignment problem with no end penalty
- d_s is edit distance score without end-penalty: minimized over edit distances d
- Fig 4.14 (p117) for best aligned subsequence-matching
- Note, char matched is charged 0, mismatch 1, gap 2, in “distance” rather than “similarity”
- We will use d for d_s

- f is approximate substring of S at error level e, then the score is
d(f, S) =< e|f|,

e=1 means no error allowed

e<1 allows insert/delete/substitution errors

- f and f- both should be matched

- Input: Set F of substrings, error level e
- Output: Shortest possible string S s.t. for all f
Min(d(f, S), d(f-, S)) =< e|f|

- How much overlap do we require between strings?
- Ideally, each column in the layout L should have same character, for all columns 1 through |L|
- Fig 4.4 (p 118): t-contig for t=3, 2, 1
- Balance between t and number of t-contigs

- S is e-consensus sequence (multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f|
- Multicontig problem:
- Input: set F, integer t>=0, 0=<e=<1
- Output: Minimum partition over F, each partition Ci is a t-contig with e-consensus

- Nodes are the fragments
- Directed arcs label length t of overlap between nodes” t-suffix= t-prefix
- Arcs between all pairs of nodes, but no self-loop
- Fig 4.15 (p 121): example
- Length of a created superstring=total wt along the path(or overlaps) + total length of all fragments involved
- Max weight Hamiltonian path is what we are looking for in this graph max overlapped superstring

- Substrings of fragments within the set of fragments are noise: remove them
- Draw OMG of the substring free set of fragments
- Shortest common superstring always correspond to a Hamiltonian path in this graph

- Thm 4.1 (p 123): F substring free, for every common superstring S, there is a Ham. Path P, s.t., S(P) is in S
- Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists)
- Path follows the same order of fragments (as in S) in OMG
- S may contain extra garbage materials, so, S(P) is within S

- If S is shortest common superstring, then S must be within S(P), or S=S(P)
- In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F

- Think of an algorithm for weeding out substrings from F
- Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes
- If the wt on an edge is below a threshold t, then the wt should be treated as 0

- Greedy Algorithm to draw Ham. Path (p 125)
- Collects edges largest to smallest,
(1) preventing cycle (union-find),

(2) indegree of each node should be =<1 (first node has 0)

(3) outdegree of each node should be =<1 (last node has 0)

[Does not return Ham. Path. Can you modify to return Ham. Path?]

- Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4

- Subintervals: if a fragment can be embedded within another one in the set
- Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string

- If a repeat exists in the original string, then the graph will have a cycle
- False positive: substrings from two different portions has t-overlap
- If a cycle exist in the graph, then there must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered

- If there is no repeats in a subinterval-free graph, then there exist a unique Ham. Path
- If there exist a cycle it may not come from a repeat

- Example 4.6 (p 130): greedy alg finds wrong string, but the Ham. Path finds the correct one
- Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring)
- Ham path chooses any t-overlap connections – cares for linkage only

- Scoreon a column: traditionally {0,-1,-2} in sum-of-pairs
- Entropy:
Sum[over alphabets and space c] –pc log pc, where pc is probability of c

- All same character, pc = 1, entropy=0
- For {a, t, c, g, -}, all different, pc = 1/5, entropy=log 5entropy measures uniformity alone, a better metric

- Coverage: How many each column is “covered” by how many fragments? (Average, min, max)
- This is different from the concept of t-overlap
- If a column (of the target) is covered by 0, then the layout is disconnected
- Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns

- Coverage is not enough, we need good linkage, Example: p 133
- Ham. Path algorithm is doing that

- Step 1: Overlap finding
- Approximate – delete, insert, replace allowed
- by semi-global DP algorithm
- with appropriate end-gap penalty,
- pairwise between each fragment and its reverse-complement

- Step 2: Construct over (F union F-bar) for the fragment set F
- (-- after eliminating substrings?)
- Construct Hamiltonian path in this graph
- Cycles and unbalanced coverage may mean repeats

- Step 3: fine tuning the multiple alignment to get a consensus target
- Manual or algorithmic
- Examples in p 137-138