1 / 75

CSE 5290: Algorithms for Bioinformatics Fall 2009

This presentation discusses dynamic programming as an efficient algorithm for solving optimization problems in bioinformatics, focusing on sequence alignment. It explores examples such as counting paths in a grid and the Manhattan Tourist Problem.

poulos
Download Presentation

CSE 5290: Algorithms for Bioinformatics Fall 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 5290:Algorithms for Bioinformatics Fall 2009 Suprakash Datta datta@cs.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cs.yorku.ca/course/5290 CSE 5290, Fall 2009

  2. Last time Greedy algorithms, genome rearrangements Next: Dynamic programming, sequence alignment Some of the following slides are based on slides by the authors of our text. CSE 5290, Fall 2009

  3. Dynamic programming (DP) • Typically used for optimization problems • Often results in efficient algorithms • Not applicable to all problems Caveats: • Need not yield poly-time algorithms • No unique formulations for most problems • May not rule out greedy algorithms CSE 5290, Fall 2009

  4. Example • Counting the number of shortest paths in a grid • Counting the number of shortest paths in a grid with blocked intersections • Finding paths in a weighted grid • Sequence alignment CSE 5290, Fall 2009

  5. Setting up DP in practice • The optimal solution should be computable as a (recursive) function of the solution to sub-problems • Solve sub-problems systematically and store solutions (to avoid duplication of work). CSE 5290, Fall 2009

  6. Number of paths in a grid • Combinatorial approach • DP approach: how can we decompose the problem into sub-problems ? CSE 5290, Fall 2009

  7. Number of paths in a grid with blocked intersections • Combinatorial approach? • DP approach: how can we decompose the problem into sub-problems ? CSE 5290, Fall 2009

  8. Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink CSE 5290, Fall 2009

  9. Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink CSE 5290, Fall 2009

  10. Manhattan Tourist Problem: Formulation Goal: Find the best path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A best path in Gfrom “source” to “sink” CSE 5290, Fall 2009

  11. MTP: An Example 0 1 2 3 4 j coordinate source 3 2 4 0 3 5 9 0 0 1 0 4 3 2 2 3 2 4 13 1 1 6 5 4 2 0 7 3 4 15 19 2 i coordinate 4 5 2 4 1 0 2 3 3 3 20 3 8 5 6 5 2 sink 1 3 2 CSE 5290, Fall 2009 23 4

  12. MTP: Greedy Algorithm Is Not Optimal 1 2 5 source 3 10 5 5 2 5 1 3 5 3 1 4 2 3 promising start, but leads to bad choices! 5 0 2 0 22 0 0 0 sink 18 CSE 5290, Fall 2009

  13. MTP: Simple Recursive Program MT(n,m) ifn=0 or m=0 returnMT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) returnmax{x,y} What’s wrong with this approach? CSE 5290, Fall 2009

  14. si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j) max si, j = MTP: Recurrence Computing the score for a point (i,j) by the recurrence relation: The running time is n x m for a n by mgrid (n = # of rows, m = # of columns) CSE 5290, Fall 2009

  15. A2 A3 A1 B sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) max of sB = Manhattan Is Not A Perfect Grid What about diagonals? • The score at point B is given by: CSE 5290, Fall 2009

  16. max of sy + weight of vertex (y, x) where y є Predecessors(x) sx = Manhattan Is Not A Perfect Grid (cont’d) Computing the score for point x is given by the recurrence relation: • Predecessors (x) – set of vertices that have edges leading to x • The running time for a graph G(V, E) (V is the set of all vertices and Eis the set of all edges) is O(E) since each edge is evaluated once CSE 5290, Fall 2009

  17. Traveling in the Grid • The only hitch is that one must decide on the order in which visit the vertices • By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. • We need to traverse the vertices in some order • Try to find such order for a directed cycle • ??? CSE 5290, Fall 2009

  18. DAG: Directed Acyclic Graph • Since Manhattan is not a perfect regular grid, we represent it as a DAG • DAG for Dressing in the morning problem CSE 5290, Fall 2009

  19. Topological Ordering • A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label • In other words, if vertices are positioned on a line in an increasing order of labels then all edges go from left to right. CSE 5290, Fall 2009

  20. Topological ordering • 2 different topological orderings of the DAG CSE 5290, Fall 2009

  21. Longest Path in DAG Problem • Goal: Find a longest path between two vertices in a weighted DAG • Input: A weighted DAG G with source and sink vertices • Output: A longest path in G from source to sink CSE 5290, Fall 2009

  22. max of sv= Longest Path in DAG: Dynamic Programming • Suppose vertex v has indegree 3 and predecessors {u1, u2, u3} • Longest path to v from source is: In General: sv = maxu (su + weight of edge from uto v) su1 + weight of edge fromu1to v su2 + weight of edge fromu2to v su3+ weight of edge fromu3to v CSE 5290, Fall 2009

  23. Traversing the Manhattan Grid a) b) • 3 different strategies: • a) Column by column • b) Row by row • c) Along diagonals c) CSE 5290, Fall 2009

  24. T A T G C C T A G T A A T Alignment: 2 row representation Given 2 DNA sequences v and w: v : m = 7 w : n = 6 Alignment : 2 * k matrix ( k > m, n ) letters of v A T -- G T T A T -- letters of w A T C G T -- A -- C 4 matches 2 insertions 2 deletions CSE 5290, Fall 2009

  25. Aligning DNA Sequences n = 8 V = ATCTGATG matches mismatches insertions deletions 4 m = 7 1 W = TGCATAC 2 match 2 mismatch V W deletion indels insertion CSE 5290, Fall 2009

  26. Aligning DNA Sequences - 2 • Brute force is infeasible…. • Number of alignments of X[1..n],Y[1..m], n<m is ( ) • For m=n, this is about 22n/pn m+n n CSE 5290, Fall 2009

  27. Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences • v = v1v2…vm and w = w1w2…wn • The LCS of vand w is a sequence of positions in • v: 1 < i1 < i2 < … < it< m • and a sequence of positions in • w: 1 < j1 < j2 < … < jt< n • such that it -th letter of v equals to jt-letter of wand t is maximal CSE 5290, Fall 2009

  28. 0 1 2 2 3 3 4 5 6 7 8 0 0 1 2 3 4 5 5 6 6 7 LCS: Example i coords: elements of v A T -- C -- T G A T C elements of w -- T G C A T -- A -- C j coords: (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7) positions in v: 2 < 3 < 4 < 6 < 8 Matches shown inred positions in w: 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2-D grid CSE 5290, Fall 2009

  29. LCS Problem as Manhattan Tourist Problem A T C T G A T C j 0 1 2 3 4 5 6 7 8 i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7 CSE 5290, Fall 2009

  30. Edit Graph for LCS Problem A T C T G A T C j 0 1 2 3 4 5 6 7 8 i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7 CSE 5290, Fall 2009

  31. Edit Graph for LCS Problem A T C T G A T C j 0 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7 CSE 5290, Fall 2009

  32. si-1, j si, j-1 si-1, j-1 + 1 if vi = wj max si, j = Computing LCS Let vi = prefix of v of length i: v1 … vi and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: CSE 5290, Fall 2009

  33. Computing LCS (cont’d) i-1,j i-1,j -1 0 1 si-1,j + 0 0 i,j -1 si,j = MAX si,j -1 + 0 i,j si-1,j -1 + 1, if vi = wj CSE 5290, Fall 2009

  34. Every Path in the Grid Corresponds to an Alignment W A T C G 0 1 2 2 3 4 V = A T - G T | | | W= A T C G – 0 1 2 3 4 4 V A T G T CSE 5290, Fall 2009

  35. A T A T A T A T T A T A T A T A Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences vand w : v : w: • The Hamming distance: dH(v, w) = 8 is large but the sequences are very similar CSE 5290, Fall 2009

  36. A T A T A T A T T A T A T A T A Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: v : -- w: -- • The edit distance: dH(v, w) = 2. • Hamming distance neglects insertions and deletions in DNA CSE 5290, Fall 2009

  37. Edit Distance • Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform v w CSE 5290, Fall 2009

  38. Edit Distance vs Hamming Distance Hamming distance always compares i-th letter of v with i-th letter of w V = ATATATAT W= TATATATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task. CSE 5290, Fall 2009

  39. Edit Distance vs Hamming Distance Edit distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W= TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task CSE 5290, Fall 2009

  40. Edit Distance vs Hamming Distance Edit distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT W= TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ??? CSE 5290, Fall 2009

  41. Edit Distance: Example TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rdG) ATCCAT  (insert G before last A) ATCCGAT (Done) CSE 5290, Fall 2009

  42. Edit Distance: Example TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rdG) ATCCAT  (insert G before last A) ATCCGAT (Done) What is the edit distance? 5? CSE 5290, Fall 2009

  43. Edit Distance: Example (cont’d) TGCATAT  ATCCGAT in 4 steps TGCATAT  (insert A at front) ATGCATAT  (delete 6thT) ATGCATA  (substitute G for 5thA) ATGCGTA  (substitute C for 3rdG) ATCCGAT (Done) CSE 5290, Fall 2009

  44. Edit Distance: Example (cont’d) TGCATAT  ATCCGAT in 4 steps TGCATAT  (insert A at front) ATGCATAT  (delete 6thT) ATGCATA  (substitute G for 5thA) ATGCGTA  (substitute C for 3rdG) ATCCGAT (Done) Can it be done in 3 steps??? CSE 5290, Fall 2009

  45. The Alignment Grid • Every alignment path is from source to sink CSE 5290, Fall 2009

  46. A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - Corresponding path - CSE 5290, Fall 2009

  47. A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignments in Edit Graph (cont’d) • and represent indels in v and w with score 0. • represent matches with score 1. • The score of the alignment path is 5. CSE 5290, Fall 2009

  48. A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment: CSE 5290, Fall 2009

  49. A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph Old Alignment 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 New Alignment 0122345677 v= AT_GTTAT_ w= ATCG_TA_C 0123445667 CSE 5290, Fall 2009

  50. A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),(7,7) CSE 5290, Fall 2009

More Related