1 / 38

Multiple Sequence Alignments

Multiple Sequence Alignments. Progressive Alignment. x. When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x , p y , to generate a new alignment with associated profile p result Weighted version:

carina
Download Presentation

Multiple Sequence Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignments

  2. Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: • Tree edges have weights, proportional to the divergence in that edge • New profile is a weighted average of two old profiles pxy y z pxyzw pzw w

  3. Progressive Alignment x • When evolutionary tree is unknown: • Perform all pairwise alignments • Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment • Construct a tree (UPGMA / Neighbor Joining / Other methods) • Align on the tree y ? z w

  4. Heuristics to improve alignments • Iterative refinement schemes • A*-based search • Consistency • Simulated Annealing • …

  5. Iterative Refinement One problem of progressive alignment: • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear correct y = G-ACTT

  6. allow y to vary x,z fixed projection Iterative Refinement Algorithm (Barton-Stenberg): • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence z x y

  7. Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA Variant: Refinement on a tree

  8. Iterative Refinement Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Realigning any single yi changes nothing

  9. Consistency zk z xi x y yj yj’

  10. Consistency zk z Basic method for applying consistency • Compute all pairs of alignments xy, xz, yz, … • When aligning x, y during progressive alignment, • For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) • Align x and y with DP using the modified s(.,.) function xi x y yj yj’

  11. Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

  12. MUSCLE at a glance • Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time • Build tree TDRAFT based on those distances, with UPGMA • Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT • Only perform alignment steps for the parts of the tree that have changed • Measure new Kimura-based distances D(x, y) based on MDRAFT • Build tree T based on D • Progressive alignment over T, to build M • Iterative refinement; for many rounds, do: • Tree Partitioning: Split M on one branch and realign the two resulting profiles • If new alignment M’ has better sum-of-pairs score than previous one, accept

  13. PROBCONS at a glance • Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM • Re-estimation of posterior matrices M’xy with probabilistic consistency • M’xy(i, j) = 1/N sequence zk Mxz(i, k)  Myz (j, k); M’xy = Avgz(MxzMzy) • Compute for every pair x, y, the maximum expected accuracy alignment • Axy: alignment that maximizes aligned (i, j) in AM’xy(i, j) • Define E(x, y) = aligned (i, j) in AxyM’xy(i, j) • Build tree T with hierarchical clustering using similarity measure E(x, y) • Progressive alignment on T to maximize E(.,.) • Iterative refinement; for many rounds, do: • Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles

  14. Rapid Global Alignments How to align genomic sequences in (more or less) linear time

  15. Motivation • Genomic sequences are very long: • Human genome = 3 x 109 –long • Mouse genome = 2.7 x 109 –long • Aligning genomic regions is useful for revealing common gene structure • Useful to compare regions > 1,000,000-long

  16. Main Idea Genomic regions of interest contain islands of similarity, such as genes • Find local alignments • Chain an optimal subset of them • Refine/complete the alignment Systems that use this idea to various degrees: MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

  17. Saving cells in DP • Find local alignments • Chain -O(NlogN) L.I.S. • Restricted DP

  18. Methods toCHAINLocal Alignments Sparse Dynamic Programming O(N log N)

  19. The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

  20. Quadratic Time Solution • Build Directed Acyclic Graph (DAG): • Nodes: local alignments [(xa,xb) (ya,yb)] & score • Directed edges: local alignments that can be chained • edge ( (xa, xb, ya, yb), (xc, xd, yc, yd) ) • xa < xb < xc < xd • ya < yb < yc < yd Each local alignment is a node vi with alignment score si

  21. Quadratic Time Solution Initialization: Find each node va s.t. there is no edge (u, va) Set score of V(a) to be sa Iteration: For each vi, optimal path ending in vi has total score: V(i) = maxjs.t. there is edge (vj, vi) ( weight(vj, vi) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from vj Worst case time: quadratic

  22. Sparse Dynamic Programming Back to the LCS problem: • Given two sequences • x = x1,…, xm • y = y1, …, yn • Find the longest common subsequence • Quadratic solution with DP • How about when “hits” xi = yj are sparse?

  23. Sparse Dynamic Programming • Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

  24. Sparse Dynamic Programming – L.I.S. • Longest Increasing Subsequence • Given a sequence over an ordered alphabet • x = x1,…, xm • Find a subsequence • s = s1, …, sk • s1 < s2 < … < sk

  25. Sparse Dynamic Programming – L.I.S. Let input be w: w1,…, wn INITIALIZATION: L: 1-indexed array, L[1]  w1 B: 0-indexed array of backpointers; B[0] = 0 P: array used for traceback // L[j]: smallest last element wi of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1]  w[i] B[j+1]  i P[i]  B[j] } That’s it!!! • Running time?

  26. Sparse LCS expressed as LIS x Create a sequence w • Every matching point (i, j), is inserted into w as follows: • For each column j, from smallest to largest, insert in w the points (i, j), in decreasing row i order • The 11 example points are inserted in the order given • a = (y, x), b = (y’, x’) can be chained iff • a is before b in w, and • y < y’ y

  27. Sparse LCS expressed as LIS x Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where • (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y y

  28. Sparse Dynamic Programming for LIS x • Algorithm: initialize empty array L /* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */ for i = 1 to |w| binary search for w[i] in L, to find lj< w[i] ≤ lj+1 replace lj+1 with w[i] keep a backptr lj w[i] That’s it!!! y

  29. Sparse Dynamic Programming for LIS x Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = • (4,2) • (3,3) • (3,3) (10,5) • (2,5) (10,5) • (2,5) (8,6) • (1,6) (8,6) • (1,6) (3,7) • (1,6) (3,7) (4,8) • (1,6) (3,7) (4,8) (7,9) • (1,6) (3,7) (4,8) (5,9) • (1,6) (3,7) (4,8) (5,9) (9,10) y Longest common subsequence: s = 4, 24, 3, 11, 18

  30. Sparse DP for rectangle chaining • 1,…, N: rectangles • (hj, lj): y-coordinates of rectangle j • w(j): weight of rectangle j • V(j): optimal score of chain ending in j • L: list of triplets (lj, V(j), j) • L is sorted by lj: top to bottom • L is implemented as a balanced binary tree h l y

  31. Sparse DP for rectangle chaining Main idea: • Sweep through x-coordinates • To the right of b, anything chainable to a is chainable to b • Therefore, if V(b) > V(a), rectangle a is “useless” – remove it • In L, keep rectangles j sorted with increasing lj-coordinates  sorted with increasing V(j) V(b) V(a)

  32. Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: • When on the leftmost end of i: • j: rectangle in L, with largest lj < hi • V(i) = w(i) + V(j) • When on the rightmost end of i: • k: rectangle in L, with largest lk li • If V(i)  V(k): • INSERT (li, V(i), i) in L • REMOVE all (lj, V(j), j) with V(j)  V(i) & lj li j i k

  33. Example x 2 1: 5 5 6 2: 6 9 10 3: 3 11 12 14 4: 4 15 5: 2 16 y

  34. Time Analysis • Sorting the x-coords takes O(N log N) • Going through x-coords: N steps • Each of N steps requires O(log N) time: • Searching L takes log N • Inserting to L takes log N • All deletions are consecutive, so log N per deletion • Each element is deleted at most once: N log N for all deletions • Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

  35. Examples Human Genome Browser ABC

  36. Whole-genome alignment Rat—Mouse—Human

  37. Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

More Related