1 / 25

Comp. Genomics

Comp. Genomics. Recitation 4 Multiple Sequence Alignment or Computational Complexity – What is it good for?. Outline. MSA is very expensive Speedup – Carillo Lipman Approximation algorithms Heuristics. MSA is expensive. The running time of DP MSA is O(2 N L N )

liona
Download Presentation

Comp. Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comp. Genomics Recitation 4 Multiple Sequence Alignment or Computational Complexity – What is it good for?

  2. Outline • MSA is very expensive • Speedup – Carillo Lipman • Approximation algorithms • Heuristics

  3. MSA is expensive • The running time of DP MSA is O(2NLN) • E.g., to align 10 proteins of 50 residues each we need 1020 operations • If one operation takes one microsecond, it takes over a billion years to align these 10 proteins • How do we obtain practical algorithms?

  4. Exercise 1 • Definitions: • D(xi,xj)-The optimal pairwise alignment between the ith and jth sequences • A*-The optimal SOP MSA (c(A*) = its cost) • A*i,j-The projection of the optimal SOP MSA on the ij plane (c(A*i,j) = its cost) • c‘ = an upper bound on c(A*) • Given D and c’, upper bound c(A*u,v).

  5. The Carillo-Lipman bound A bound on the cost of the optimal MSA Cost of the optimal MSA The optimal Alignment is better than Any other Cost of the optimal MSA’s Projection on the ij plane Break sum

  6. The Carillo-Lipman bound • Which cells can we cancel? • Those whose projection on any 2-dimensional plane falls in a cells such that the optimal 2-dimensional path through that cell costs more than the bound

  7. Exercise 2 • We are building MSA between x,y and z of sizes n,m, and l: • D(x[1,..,i],y[1,..,j])=5, D(x[1,..,i],z[1,..,k])=7, D(y[1,..,j],z[1,..,k])=3 • D(x[i+1,…,n],y[j+1,…,m])=3, D(x[i+1,..,n],z[k+1,..,l])=7, D(y[j+1,..,m],z[K+1,..,l])=4 • Carillo Lipman gave us: C(Ax,y*)≤13, C(Ax,z*)≤14, C(Ay,z*)≤15 (we assume lower scores are better) • True or false: The cell (i,j,k) needs to be considered in the MSA

  8. Solution • The cost of the optimal path through (i,j) on the xy plane is 8 • The Carillo-Lipman bound is 13. So the projection to the xy plane does not cancel the cell. • Similarly, the other bounds do not cancel the cell • The claim is true

  9. Approximation algorithms • Carillo-Lipman is still impractical for many long sequences • Hence, our goal is to obtain faster algorithms • Approximation algorithms promise to remain a certain factor away from OPT • Good: constant factor algorithms (e.g., 1.785 approximation) • Worse: approximation ratio dependent on the input size (e.g., log(n)) • Not always good empirically, but important for inspiring good heuristics

  10. Approximation ratio • Has to be maintained for anylegal input • Cost (score) of OPT is c(OPT) • Cost (score) of ALG is c(ALG) • Approximation ratio: c(ALG)/c(OPT)

  11. Reminder: SP MSA • Input: strings S1, …,Sk of length n. • d(i,j)– The distance between SiandSjas induced by the MSA • Sum-of-pairs (SP) score: • Goal: find MSA with minimum SP score • We’ll look for minimal scores

  12. The Center * algorithm • Assumptions: • The triangle inequality holds • σ(-,-)=0 • σ(x,y)=σ(y,x) • Input: strings S1,…,Sk. • The algorithm: • Find the string S*that minimizes • Iteratively add all other sequences to the alignment • Running time: O(k2n2)

  13. Exercise 3 • Find the approximation ratio of the center star algorithm • Use the following definitions: • M* - An optimal alignment • M - The alignment produced by this algorithm • d(i,j) - The distance M induces on the pair Si,Sj • D(S,T) – min cost of alignment between S and T For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

  14. Solution

  15. Randomized approximation algorithms • Now we want to reduce the O(k2n2) time further • Randomized algorithms perform well on most inputs, and thus have a good performance on average • Approximation ratio of a randomized algorithm: E(c(ALG)/c(OPT)) • The expectancy is over the choices of the algorithm (coin tosses)

  16. Center-star reminder • Algorithm: • Compute the sequence “closest-to-all-others” • Align all sequences to it • Approximation: • (2-1/k) approximation to the optimal sum-of-pairs (SP) alignment • Complexity: • O(k2) pairwise alignments to choose the center star- O(k2n2) – Bottleneck! • O(k) pairwise alignments to construct the MSA: O(kn2)

  17. Random-star • What if instead of picking the best sequence as the starting point, we pick a random sequence from the group? • Ex4: Show that for any r, we’ll construct an example (choose cost function, sequences and number of sequences) such that the algorithm will be > r worse than OPT: d(Mb)/d(Mopt)>r

  18. Solution • Bad sequence: CCC..C (k letters) • Other sequences: k AGAGAG…AG and k GAGAGA…GA • Costs: 1 for gap and 1 for mismatch

  19. Solution • c(ALG): • k2 couples with k mismatches (AGAG vs GAGA). • 2k couples with k mismatches (CCCC vs AGAG\GAGA). • D(Mb) = k3 + 2k2. • C(OPT): • We have k2 couples with 2 gaps (AGAG vs GAGA). • k couples with k mismatches (CCCC vs AGAG). • k couples with k-1 mismatches and 2 gaps (CCCC vs GAGA). • D(Mb) = 4k2 + k • Ratio: (k3 + 2k2) / (4k2 + k)=O(k) • Choose k large enough…

  20. Random-star • Apparently the idea is not “that-bad” • Ex5: Show that if we choose the starting sequence from the k-size group at random we get an expected 2 approximation: E(d(Mb)/d(Mopt))<=2

  21. Solution

  22. Random-star • Ex6: Use this idea for a O(kn2) algorithm! • Solution: Choose sequence at random and proceed as in center-star • Complexity: • No need for initial pairwise alignments • All subsequent alignments can be implemented in O(kn2)

  23. Progressive alignment • A heuristic method for finding MSA • There is no analytic bound on accuracy • Development is guided by empirical results • Examples: Feng-Doolittle, CLUSTALW

  24. Progressive alignment • General algorithm: • Globally align every pair of sequences • Use the alignment scores to construct a “guide tree” • Combine alignments (sequence-sequence,sequence-profile, profile-profile) according to the guide tree

  25. Progressive alignment W-NW F- RF W-NW WL-W WLW WNW FRF

More Related