1 / 45

Rapid Global Alignments

Rapid Global Alignments. How to align genomic sequences in (more or less) linear time. Methods to CHAIN Local Alignments. Sparse Dynamic Programming O(N log N). The Problem: Find a Chain of Local Alignments. (x,y)  (x’,y’) requires x < x’ y < y’. Each local alignment has a weight

paki-irwin
Download Presentation

Rapid Global Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rapid Global Alignments How to align genomic sequences in (more or less) linear time

  2. Methods toCHAINLocal Alignments Sparse Dynamic Programming O(N log N)

  3. The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

  4. Sparse DP for rectangle chaining • 1,…, N: rectangles • (hj, lj): y-coordinates of rectangle j • w(j): weight of rectangle j • V(j): optimal score of chain ending in j • L: list of triplets (lj, V(j), j) • L is sorted by lj • L is implemented as a balanced binary tree h l y

  5. Sparse DP for rectangle chaining Main idea: • Sweep through x-coordinates • To the right of b, anything chainable to a is chainable to b • Therefore, if V(b) > V(a), rectangle a is “useless” – remove it • In L, keep rectangles j sorted with increasing lj-coordinates  sorted with increasing V(j) V(b) V(a)

  6. Sparse DP for rectangle chaining Go through rectangle x-coordinates, from left to right: • When on the leftmost end of rectangle i, compute V(i) • j: rectangle in L, with largest lj < hi • V(i) = w(i) + V(j) • When on the rightmost end of i, possibly store V(i) in L: • j: rectangle in L, with largest lj li • If V(i) > V(j): • INSERT (li, V(i), i) in L • REMOVE all (lk, V(k), k) with V(k)  V(i) & lk li j i

  7. Example x 2 1: 5 5 6 2: 6 9 10 3: 3 11 12 14 4: 4 15 5: 2 16 y

  8. Time Analysis • Sorting the x-coords takes O(N log N) • Going through x-coords: N steps • Each of N steps requires O(log N) time: • Searching L takes log N • Inserting to L takes log N • All deletions are consecutive, so log N per deletion • Each element is deleted at most once: N log N for all deletions • Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

  9. Putting it All Together:Fast Global Alignment Algorithms • FIND local alignments • CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer: Suffix Tree sparse DP Avid: Suffix Tree hierarchical DP LAGAN CHAOS sparse DP

  10. LAGAN: Pairwise Alignment FIND local alignments CHAIN local alignments DP restricted around chain

  11. LAGAN • Find local alignments • Chain -O(NlogN) L.I.S. • Restricted DP

  12. LAGAN: recursive call • What if a box is too large? • Recursive application of LAGAN, more sensitive word search

  13. A trick to save on memory “necks” have tiny tracebacks …only store tracebacks

  14. Multiple Sequence Alignments

  15. Overview • Definition • Scoring Schemes • Algorithms

  16. Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • A faint similarity between two sequences becomes significant if present in many • Multiple alignments can help improve the pairwise alignments

  17. Scoring Function • Ideally: • Find alignment that maximizes probability that sequences evolved from common ancestor, according to some phylogenetic model • More on phylogenetic models later x y z ? w v

  18. Scoring Function • A comprehensive model would have too many parameters, too inefficient to optimize • Possible simplifications • Ignore phylogenetic tree • Statistically independent columns: S(m) = G(m) + i S(mi) m: alignment matrix G: function penalizing gaps

  19. Scoring Function: Sum Of Pairs Definition:Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

  20. Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

  21. Sum Of Pairs (cont’d) • Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken • Weighted SOP: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance

  22. Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

  23. Multiple Sequence Alignments Algorithms

  24. 1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

  25. 1. Multidimensional Dynamic Programming • Example: in 3D (three sequences): • 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk), F(i-1,j-1,k )+S(xi, xj, - ), F(i-1,j ,k-1)+S(xi, -, xk), F(i-1,j ,k )+S(xi, -, - ), F(i ,j-1,k-1)+S( -, xj, xk), F(i ,j-1,k )+S( -, xj, xk), F(i ,j ,k-1)+S( -, -, xk) }

  26. 1. Multidimensional Dynamic Programming Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

  27. 2. Progressive Alignment • Multiple Alignment is NP-complete • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O( N L2 )

  28. 2. Progressive Alignment x y • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) z w

  29. CLUSTALW: progressive alignment CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • Construct a tree (Neighbor-joining hierarchical clustering) • Align nodes in order of decreasing similarity + a large number of heuristics

  30. CLUSTALW & the CINEMA viewer

  31. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

  32. MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree • Find local alignments for every pair of sequences x, y • Find anchors between every pair of sequences, similar to LAGAN anchoring • Progressive alignment • Multi-Anchoring based on reconciling the pairwise anchors • LAGAN-style limited-area DP • Optional refinement steps

  33. MLAGAN: multi-anchoring To anchor the (X/Y), and (Z) alignments: X Z Y Z X/Y Z

  34. Heuristics to improve multiple alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • …

  35. Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear correct y = GA-CTT

  36. Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Note: Guaranteed to converge

  37. allow y to vary x,z fixed projection Iterative Refinement For each sequence y • Remove y • Realign y (while rest fixed) z x y

  38. Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

  39. Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Realigning any single yi changes nothing

  40. Restricted MDP Here is another way to improve a multiple alignment: • Construct progressive multiple alignment m • Run MDP, restricted to radius R from m Running Time: O(2N RN-1 L)

  41. Restricted MDP • Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

  42. Restricted MDP x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Within radius 1 of the optimal  Restricted MDP will fix it.

  43. Optional refinement steps in MLAGAN • Limited-area iterative refinement • Radius-r 3-sequence refinement on each node of the tree

More Related