1 / 53

B89902003 林聖凱 B89902005 高海峰 B89902027 謝俊瑋

The Longest Common Subsequence Problem for Arc-annotated Sequences Tao Jiang, Guo-Hui Lin, Bin Ma, Kaizhong Zhang. B89902003 林聖凱 B89902005 高海峰 B89902027 謝俊瑋. Overview. Arc-annotated sequence usage The secondary and tertiary structure of RNA Protein sequence Solve the open questions in

denali
Download Presentation

B89902003 林聖凱 B89902005 高海峰 B89902027 謝俊瑋

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Longest Common Subsequence Problem for Arc-annotated SequencesTao Jiang, Guo-Hui Lin, Bin Ma, Kaizhong Zhang B89902003 林聖凱 B89902005 高海峰 B89902027 謝俊瑋 2004 L.K.H@NTUCSIE

  2. Overview • Arc-annotated sequence usage • The secondary and tertiary structure of RNA • Protein sequence • Solve the open questions in • P.A. Evans, Algorithms and Complexity for Annotated Sequence Analysis, Ph.D. Thesis, University of Victoria 1999. • P.A. Evans, Finding common subsequences with pseudoknots, in Proceedings of 10th Annual Symposium on Combinatorial Pattern matching (CPM’99), LNCS 1645, pp. 270-280 2004 L.K.H@NTUCSIE

  3. Definitions (I) • Symbol definition • S: Sequence • P: arc set • Arc defintion • (S, P) pair is called arc-annotated sequence 2004 L.K.H@NTUCSIE

  4. Definitions (II) • Arc-preserving • Arc mapping is kept when performing LCS • Cutwidth • The number arcs crossing the position • Arc-cutwidth • The max cutwidth of the sequence 2004 L.K.H@NTUCSIE

  5. Restrictions (I) • NP-hard problems if there’re no restrictions on arc annotations • Fortunately, RNA and protein sequences contains some contraints 2004 L.K.H@NTUCSIE

  6. Restrictions (II) • No sharing of endpoints 2. No crossing 3. No nesting 4. No arcs 2004 L.K.H@NTUCSIE

  7. Restrictions (III) • Five levels • Unlimited • No restrictions • Crossing • Restriction 1 • Nested • Restriction 1, 2 • Chain • Restriction 1, 2, 3 • Plain • Restriction 4 2004 L.K.H@NTUCSIE

  8. Result (I) |S1| = n |S2| = m 2004 L.K.H@NTUCSIE

  9. Result (II) • LCS (crossing, crossing) • 2-approximation algorithm • LCS (crossing, plain) • MAX SNP-hard • LCS (nested, plain) • Dynamic programming algorithm 2004 L.K.H@NTUCSIE

  10. LCS (crossing, crossing)-def (I) • (S1, P1), (S2, P2) • Arc-annotated sequences • Y • The result of common LCS() • |Y| = L • M • The mapping between S1 and S2 induced by Y • M={(i1,j1),…,(i2,j2)} 2004 L.K.H@NTUCSIE

  11. LCS (crossing, crossing)-def (II) • Graph GM • (ik, jk), (il, jl): vertex • Max( deg( vertex of GM)) <= 2 2004 L.K.H@NTUCSIE

  12. LCS (crossing, crossing)- Algo 2004 L.K.H@NTUCSIE

  13. LCS (crossing, crossing)- Result • LCS (crossing, crossing) has a 2-approximation algorithm with time complexity O(nm) • LCS (crossing, nested), LCS (crossing, chain), and LCS (crossing, plain) has a 2-approximation algorithm with time complexity O(nm) 2004 L.K.H@NTUCSIE

  14. LCS (unlimited, plain) • Prove that it can’t be approximated within ratio • Lemma 1 • MaxIS-B is Max SNP-complete when B >= 3 • Lemma 2 • MaxIS-Cubic is SNP-complete 2004 L.K.H@NTUCSIE

  15. Proof of Lemma 2 (I) • L-reduction from MaxIS-3 to MaxIS-Cubic • G(V,E): Instance of MaxIS-3 • i:deg=1 • j: deg=2 • n-i-j: deg=3 • V’: the max IS set • opt(G) = |V’| 2004 L.K.H@NTUCSIE

  16. Proof of Lemma 2 (II) • trivially, opt(G) >= n/4 • i+j <= 4*opt(G) • G’ : instance of MaxIS-Cubic • opt(G’): the max IS of G • Goal: Construct G’ via G and a special graph H 2004 L.K.H@NTUCSIE

  17. Proof of Lemma 2 (III) • Graph H is like this • triangle # • 2i+j • cycle size • 2(2i+j) 2004 L.K.H@NTUCSIE

  18. Proof of Lemma 2 (IV) • H has a maximal IS of size 2(2i+j) • Construct G’ • Connect vertex of deg=1 of G to two free vertices in H • Connect vertex of deg=2 of G to one free vertices in H • G’ is cubic graph 2004 L.K.H@NTUCSIE

  19. Proof of Lemma 2 (V) • k’ = opt(G) + 2(2i+j) • k’: one max IS of G’ • opt(G’) >= opt(G) +2(2i+j)……(1) 2004 L.K.H@NTUCSIE

  20. Proof of Lemma 2 (VI) • Another thoughts • V’’: the IS set of G’, |V’’| = k’ • Deleting the vertices of V’’ which are in H will get a IS set of G with size k • At most 2(2i+j) vertices of V’’ is in H • k>= k’– 2(2i+j)…………..(2) 2004 L.K.H@NTUCSIE

  21. Proof of Lemma 2 (VII) • From (1) • From (2) • L-reduction o.k. • MaxIS-Cubic is Max SNP-complete 2004 L.K.H@NTUCSIE

  22. Proof of LCS(unlimited, plain)(I) • Show that MaxIs can be L-reduce to LCS(unlimited, plain) • MaxIS can’t be approximated 2004 L.K.H@NTUCSIE

  23. Proof of LCS(unlimited, plain) (II) • G(V,E): instance of MaxIS • I: instance of LCS consists • S1=an with P1 = E • S2=an with P2 = Ф • V={vi ,.., vk}, IS, 1-1 corresponds to arc-preserving common subsequences consisting of i1th,..,ikth a’s from S1 • So, LCS() includes MaxIS as a subproblem. 2004 L.K.H@NTUCSIE

  24. Corollary • LCS(unlimited, chain), LCS(unlimited, nested), and LCS(unlimited, unlimited) can’t be approximated within ratio 2004 L.K.H@NTUCSIE

  25. LCS(crossing, plan) is MAX SNP-hard • Use L-reduction to reduce MAXIS-Cubic to problem LCS(crossing, plan) • G(V, E) is a cubic graph, n = |V| • For S1 Construct a segment Tu of letters aaaabbccc for each vertex u V • For edge (u, v), introduce an arc between “c” from Tu to “c” from Tv, each letter c can be used only once 2004 L.K.H@NTUCSIE

  26. Instance I constructedfrom cubic graph G • S2 is obtained by concatenating n identical segments of aaaacccbb 2004 L.K.H@NTUCSIE

  27. Proof(1) • Opt(I) ≥ Opt(G) + 6n • Assume Y is an arc-preserving common subsequence of length k’ for (S1, P1) and (S2, P2) • (1) four “a” should be matched • (2) if a “b” is matched then no “c” is matched and vice versa 2004 L.K.H@NTUCSIE

  28. Proof(2) • Define a subset V’ of vertices of G: for every segment Tu in sequence S1, if all its three “c” is matched, we put u in V’ • V’ is an independent set for G, let k = |V’| • K>k’ -6n, n/4 ≤ opt(G) ≤ n/2 • Opt(I) = Opt(G) + 6n ≤ 25n (a) • |k – opt(G)| ≤ |k’– opt(I)| (b) 2004 L.K.H@NTUCSIE

  29. Proof(3) • Inequalities (a) (b) show the reduction is L-reduction, thus problem LCS(crossing, plain) is MAX SNP-hard • LCS(crossing, chain), LCS(crossing, nested), LCS(crossing, crossing) are all MAX SNP-hard 2004 L.K.H@NTUCSIE

  30. Notes • if with additional constrain: • for any (i1, j1) in the mapping, if (i1, i2) P1 then, for some j2, (i2, j2) is in the mapping, and if (j1, j2) P2 then, for some i2, (i2, j2) is in the mapping. • For this definition, LCS(crossing, crossing) is NP-hard and LCS(crossing, nested) is solvable in polynomial time 2004 L.K.H@NTUCSIE

  31. LCS(nested, plain) • Input: Given a pair (S1, P1) and (S2, Ø) of arc-annotated sequences with P1 being nested • Output: The length of a longest arc-preserving common subsequence for the pair(no arc on the LAPC subsequence) 2004 L.K.H@NTUCSIE

  32. Denote u(i) • n= |S1| • m= |S2| • u(i) denote the arc in P1 incident on position i of sequence S1 • If u(i) not exist, we call i “free” • x(S1[i], S2[j]) = 1 if S1[i] = S2[j], or 0 otherwise i u(i)l u(i)r 2004 L.K.H@NTUCSIE

  33. Dynamic Programming Algorithm -Alas, I know little about dynamic programming -but I know divide-and- conquer -pang feng says DP is bottom up, D&C is top-down 2004 L.K.H@NTUCSIE

  34. Divide and Conquer algorithm • Two function: • 弧DP(i1,i2;j,j’) knows the length of a LARC subsequence for the pair (S1[i1, i2]) and (S2[j,j’], Ø), if and only if i1 = u(i2)l • 無DP(i,i’;j,j’) knows the length of a LARC subsequence for the pair (S1[i, i’]) and (S2[j,j’], Ø), if and only if i < u(i’)l or i’ free S1 S1 i i’ i i’ -how? S1 S1 i1 i2 i i’ 2004 L.K.H@NTUCSIE

  35. Divide and Conquer algorithm • 無DP(i,i’;j,j’): If i’ is free 無DP(i,i’;j,j’) = max -simple LCS algorithm  無DP(i,i’-1;j,j’-1)+x(S1[i’], S2[j’]) 無DP(i,i’-1;j,j’) 無DP(i,i’;j,j’-1) 2004 L.K.H@NTUCSIE

  36. 無DP(i,i’;j,j’) • Else if i’= u(i’)rand i < u(i’)l 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} j  j’’ j’ S1 i u(i’)l i’ S2 S1 j j’ i i’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ 2004 L.K.H@NTUCSIE S2 j j’

  37. 無DP(i,i’;j,j’) • Else (i = u(i’)l ) • Just Call 弧DP(i,i’;j,j’) S1 i1 i2 2004 L.K.H@NTUCSIE

  38. 弧DP(i1,i2;j,j’) S1 i1 i2 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j]) S2 j j’ 無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’])  無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) 弧DP(i1, i2; j + 1, j’) -merge 弧DP and 無DP into DP 2004 L.K.H@NTUCSIE

  39. Example • Top down approach S1 (1,8) A T G C A T G C 1 2 3 4 5 6 7 8 S2 A T (1,1) (2,8) A T G C A T G C (3,7) G C A T G (3,3) (4,7) G C A T G A T (5,6) 2004 L.K.H@NTUCSIE A (5,5)

  40. Example: bottom up (1,8) (5,5): 1 2 T[1,1]表DP(5,5;1,1) T[1,2]表DP(5,5;1,2) T[2,2]表DP(5,5;2,2) (1,1) (2,8) T 1 2 1 1 0 (3,7)  無DP(i,i’-1;j,j’-1)+x(S1[i’], S2[j’]) (3,3) (4,7) 無DP(i,i’;j,j’) = max 無DP(i,i’-1;j,j’) (5,6) 無DP(i,i’;j,j’-1) (5,5) (5,6): 6 is free 1 2 DP(5,6;1,2) = max{ DP(5, 5; 1, 1)+x(S1[2], S2[2]) DP(5, 5; 1, 2 ) DP(5, 6; 1, 1 )} T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

  41. Example: bottom up (1,8) (5,6): 1 2 T 1 2 1 2 (1,1) (2,8) 1 (3,7) 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j])  無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’]) (3,3) (4,7) 無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) (5,6) 弧DP(i1, i2; j + 1, j’) (5,5) (4,7): arc, 用弧DP 1 2 弧DP(4,7;1,2) = max{ DP(5, 6; 2, 2) +x(S1[4], S2[1]) DP(5, 6; 1, 1) +x(S1[7], S2[2]) DP(5, 6; 1, 2) DP(4, 7; 1, 1) DP(4, 7; 2, 2) } T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

  42. Example: bottom up 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} (3,7): 7= u(7)r and 3 < u(7)l (1,8) (3,3): (4,7): 1 2 1 2 T 1 2 1 2 T 1 2 0 0 (1,1) (2,8) 1 0 (3,7) (3,3) (4,7) (5,6) (5,5) DP(3,7;1,2) = max{ DP(3, 3; 1, 0) + DP(4, 7; 1, 2) DP(3, 3; 1, 1) + DP(4, 7; 2, 2) DP(3, 3; 1, 2) + DP(4, 7; 3, 2) } 1 2 T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

  43. Example: bottom up (1,8) (3,7): 1 2 T 1 2 1 2 (1,1) (2,8) 1 (3,7) 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j])  無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’]) (3,3) (4,7) 無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) (5,6) 弧DP(i1, i2; j + 1, j’) (5,5) (2,8): arc, 用弧DP 1 2 弧DP(2, 8; 1, 2) = max{ DP(3, 7; 2, 2) +x(S1[2], S2[1]) DP(3, 7; 1, 1) +x(S1[8], S2[2]) DP(3, 7; 1, 2) DP(2, 8; 1, 1) DP(2, 8; 2, 2) } T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

  44. Example: bottom up 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} (1,8): 8= u(8)r and 1 < u(8)l (1,8) (1,1): (2,8): 1 2 1 2 T 1 2 1 2 T 1 2 1 1 (1,1) (2,8) 1 0 (3,7) (3,3) (4,7) (5,6) (5,5) ANS: DP(1,8;1,2) = max{ DP(1, 1; 1, 0) + DP(2, 8; 1, 2) DP(1, 1; 1, 1) + DP(2, 8; 2, 2) DP(1, 1; 1, 2) + DP(2, 8; 3, 2) } 1 2 T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

  45. Time Complexity (1,8) (1,1) (2,8) (3,7) • Table Size: m*(m-1)/2 = O(m2) • Number of Tables: • Possible (i,j): • Arc: at most n/2 = O(n) • Inside Arc: at most as many as arc • Free: at most O(n) • Table Entry: • O(n) * O(m2) = O(nm2) (3,3) (4,7) m (5,6) m (5,5) A T G C A T G C A T G C A T G C A T G C A T G C 2004 L.K.H@NTUCSIE

  46. Time Complexity • Compute a entry at most cost: • O(m) • Time Complexity: • O(m)*O(nm2 ) = O(nm3 ) 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} 2004 L.K.H@NTUCSIE

  47. Extend LCS(nested, plain) Algorithm • Extend to LCS(nested, chain) • Add two new value α,β to DP(i,i’;j,j’) • DP(i,i’;j,j’; α,β) • Extend to LCS(crossing, nested) • Restrict the cut-width to a constant k • Add k (αi,βi) to DP(i,i’;j,j’) 2004 L.K.H@NTUCSIE

  48. LCS(nested, chain)Notation • -: denote nothing • ρ: the rightmost position of [j,j’-1] except α,β j’ j β α 2004 L.K.H@NTUCSIE

  49. Modification (I) • If i is free and j’ = u(j’)l, • DP(i,i’;j,j’; α,-) = max • DP(i,i’-1;j, ρ; α’,-) + x(S1[i’],S2[j’]) • DP(i,i’-1;j, j’; α,-) • DP(i,i’;j, ρ; α’,-) • DP(i,i’;j,j’; α,j’) = DP(i,i’;j,j’; α,-) • If α< ρ, • α’= α • else α’ = -; 2004 L.K.H@NTUCSIE

  50. Modification (II) • If i is free and j’ = u(j’)r (!= α), • DP(i,i’;j,j’; α,-) = max • DP(i,i’-1;j, ρ; α, β’) + x(S1[i’],S2[j’]) • DP(i,i’-1;j, j’; α,-) • DP(i,i’;j, j’-1; α,-) • DP(i,i’;j,j’; α,j’) = DP(i,i’;j,j’-1; α,-) • If j<= u(j’)l < ρ, • β’ = u(j’)l • else β’ = -; 2004 L.K.H@NTUCSIE

More Related