Comprehensive Study on Arc-Annotated Sequences for RNA and Protein Analysis

The Longest Common Subsequence Problem for Arc-annotated SequencesTao Jiang, Guo-Hui Lin, Bin Ma, Kaizhong Zhang B89902003 林聖凱 B89902005 高海峰 B89902027 謝俊瑋 2004 L.K.H@NTUCSIE

Overview • Arc-annotated sequence usage • The secondary and tertiary structure of RNA • Protein sequence • Solve the open questions in • P.A. Evans, Algorithms and Complexity for Annotated Sequence Analysis, Ph.D. Thesis, University of Victoria 1999. • P.A. Evans, Finding common subsequences with pseudoknots, in Proceedings of 10th Annual Symposium on Combinatorial Pattern matching (CPM’99), LNCS 1645, pp. 270-280 2004 L.K.H@NTUCSIE

Definitions (I) • Symbol definition • S: Sequence • P: arc set • Arc defintion • (S, P) pair is called arc-annotated sequence 2004 L.K.H@NTUCSIE

Definitions (II) • Arc-preserving • Arc mapping is kept when performing LCS • Cutwidth • The number arcs crossing the position • Arc-cutwidth • The max cutwidth of the sequence 2004 L.K.H@NTUCSIE

Restrictions (I) • NP-hard problems if there’re no restrictions on arc annotations • Fortunately, RNA and protein sequences contains some contraints 2004 L.K.H@NTUCSIE

Restrictions (II) • No sharing of endpoints 2. No crossing 3. No nesting 4. No arcs 2004 L.K.H@NTUCSIE

Restrictions (III) • Five levels • Unlimited • No restrictions • Crossing • Restriction 1 • Nested • Restriction 1, 2 • Chain • Restriction 1, 2, 3 • Plain • Restriction 4 2004 L.K.H@NTUCSIE

Result (I) |S1| = n |S2| = m 2004 L.K.H@NTUCSIE

Result (II) • LCS (crossing, crossing) • 2-approximation algorithm • LCS (crossing, plain) • MAX SNP-hard • LCS (nested, plain) • Dynamic programming algorithm 2004 L.K.H@NTUCSIE

LCS (crossing, crossing)-def (I) • (S1, P1), (S2, P2) • Arc-annotated sequences • Y • The result of common LCS() • |Y| = L • M • The mapping between S1 and S2 induced by Y • M={(i1,j1),…,(i2,j2)} 2004 L.K.H@NTUCSIE

LCS (crossing, crossing)-def (II) • Graph GM • (ik, jk), (il, jl): vertex • Max( deg( vertex of GM)) <= 2 2004 L.K.H@NTUCSIE

LCS (crossing, crossing)- Algo 2004 L.K.H@NTUCSIE

LCS (crossing, crossing)- Result • LCS (crossing, crossing) has a 2-approximation algorithm with time complexity O(nm) • LCS (crossing, nested), LCS (crossing, chain), and LCS (crossing, plain) has a 2-approximation algorithm with time complexity O(nm) 2004 L.K.H@NTUCSIE

LCS (unlimited, plain) • Prove that it can’t be approximated within ratio • Lemma 1 • MaxIS-B is Max SNP-complete when B >= 3 • Lemma 2 • MaxIS-Cubic is SNP-complete 2004 L.K.H@NTUCSIE

Proof of Lemma 2 (I) • L-reduction from MaxIS-3 to MaxIS-Cubic • G(V,E): Instance of MaxIS-3 • i:deg=1 • j: deg=2 • n-i-j: deg=3 • V’: the max IS set • opt(G) = |V’| 2004 L.K.H@NTUCSIE

Proof of Lemma 2 (II) • trivially, opt(G) >= n/4 • i+j <= 4*opt(G) • G’ : instance of MaxIS-Cubic • opt(G’): the max IS of G • Goal: Construct G’ via G and a special graph H 2004 L.K.H@NTUCSIE

Proof of Lemma 2 (III) • Graph H is like this • triangle # • 2i+j • cycle size • 2(2i+j) 2004 L.K.H@NTUCSIE

Proof of Lemma 2 (IV) • H has a maximal IS of size 2(2i+j) • Construct G’ • Connect vertex of deg=1 of G to two free vertices in H • Connect vertex of deg=2 of G to one free vertices in H • G’ is cubic graph 2004 L.K.H@NTUCSIE

Proof of Lemma 2 (V) • k’ = opt(G) + 2(2i+j) • k’: one max IS of G’ • opt(G’) >= opt(G) +2(2i+j)……(1) 2004 L.K.H@NTUCSIE

Proof of Lemma 2 (VI) • Another thoughts • V’’: the IS set of G’, |V’’| = k’ • Deleting the vertices of V’’ which are in H will get a IS set of G with size k • At most 2(2i+j) vertices of V’’ is in H • k>= k’– 2(2i+j)…………..(2) 2004 L.K.H@NTUCSIE

Proof of Lemma 2 (VII) • From (1) • From (2) • L-reduction o.k. • MaxIS-Cubic is Max SNP-complete 2004 L.K.H@NTUCSIE

Proof of LCS(unlimited, plain)(I) • Show that MaxIs can be L-reduce to LCS(unlimited, plain) • MaxIS can’t be approximated 2004 L.K.H@NTUCSIE

Proof of LCS(unlimited, plain) (II) • G(V,E): instance of MaxIS • I: instance of LCS consists • S1=an with P1 = E • S2=an with P2 = Ф • V={vi ,.., vk}, IS, 1-1 corresponds to arc-preserving common subsequences consisting of i1th,..,ikth a’s from S1 • So, LCS() includes MaxIS as a subproblem. 2004 L.K.H@NTUCSIE

Corollary • LCS(unlimited, chain), LCS(unlimited, nested), and LCS(unlimited, unlimited) can’t be approximated within ratio 2004 L.K.H@NTUCSIE

LCS(crossing, plan) is MAX SNP-hard • Use L-reduction to reduce MAXIS-Cubic to problem LCS(crossing, plan) • G(V, E) is a cubic graph, n = |V| • For S1 Construct a segment Tu of letters aaaabbccc for each vertex u V • For edge (u, v), introduce an arc between “c” from Tu to “c” from Tv, each letter c can be used only once 2004 L.K.H@NTUCSIE

Instance I constructedfrom cubic graph G • S2 is obtained by concatenating n identical segments of aaaacccbb 2004 L.K.H@NTUCSIE

Proof(1) • Opt(I) ≥ Opt(G) + 6n • Assume Y is an arc-preserving common subsequence of length k’ for (S1, P1) and (S2, P2) • (1) four “a” should be matched • (2) if a “b” is matched then no “c” is matched and vice versa 2004 L.K.H@NTUCSIE

Proof(2) • Define a subset V’ of vertices of G: for every segment Tu in sequence S1, if all its three “c” is matched, we put u in V’ • V’ is an independent set for G, let k = |V’| • K>k’ -6n, n/4 ≤ opt(G) ≤ n/2 • Opt(I) = Opt(G) + 6n ≤ 25n (a) • |k – opt(G)| ≤ |k’– opt(I)| (b) 2004 L.K.H@NTUCSIE

Proof(3) • Inequalities (a) (b) show the reduction is L-reduction, thus problem LCS(crossing, plain) is MAX SNP-hard • LCS(crossing, chain), LCS(crossing, nested), LCS(crossing, crossing) are all MAX SNP-hard 2004 L.K.H@NTUCSIE

Notes • if with additional constrain: • for any (i1, j1) in the mapping, if (i1, i2) P1 then, for some j2, (i2, j2) is in the mapping, and if (j1, j2) P2 then, for some i2, (i2, j2) is in the mapping. • For this definition, LCS(crossing, crossing) is NP-hard and LCS(crossing, nested) is solvable in polynomial time 2004 L.K.H@NTUCSIE

LCS(nested, plain) • Input: Given a pair (S1, P1) and (S2, Ø) of arc-annotated sequences with P1 being nested • Output: The length of a longest arc-preserving common subsequence for the pair(no arc on the LAPC subsequence) 2004 L.K.H@NTUCSIE

Denote u(i) • n= |S1| • m= |S2| • u(i) denote the arc in P1 incident on position i of sequence S1 • If u(i) not exist, we call i “free” • x(S1[i], S2[j]) = 1 if S1[i] = S2[j], or 0 otherwise i u(i)l u(i)r 2004 L.K.H@NTUCSIE

Dynamic Programming Algorithm -Alas, I know little about dynamic programming -but I know divide-and- conquer -pang feng says DP is bottom up, D&C is top-down 2004 L.K.H@NTUCSIE

Divide and Conquer algorithm • Two function: • 弧DP(i1,i2;j,j’) knows the length of a LARC subsequence for the pair (S1[i1, i2]) and (S2[j,j’], Ø), if and only if i1 = u(i2)l • 無DP(i,i’;j,j’) knows the length of a LARC subsequence for the pair (S1[i, i’]) and (S2[j,j’], Ø), if and only if i < u(i’)l or i’ free S1 S1 i i’ i i’ -how? S1 S1 i1 i2 i i’ 2004 L.K.H@NTUCSIE

Divide and Conquer algorithm • 無DP(i,i’;j,j’): If i’ is free 無DP(i,i’;j,j’) = max -simple LCS algorithm  無DP(i,i’-1;j,j’-1)+x(S1[i’], S2[j’]) 無DP(i,i’-1;j,j’) 無DP(i,i’;j,j’-1) 2004 L.K.H@NTUCSIE

無DP(i,i’;j,j’) • Else if i’= u(i’)rand i < u(i’)l 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} j  j’’ j’ S1 i u(i’)l i’ S2 S1 j j’ i i’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ S2 j j’ 2004 L.K.H@NTUCSIE S2 j j’

無DP(i,i’;j,j’) • Else (i = u(i’)l ) • Just Call 弧DP(i,i’;j,j’) S1 i1 i2 2004 L.K.H@NTUCSIE

弧DP(i1,i2;j,j’) S1 i1 i2 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j]) S2 j j’ 無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’])  無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) 弧DP(i1, i2; j + 1, j’) -merge 弧DP and 無DP into DP 2004 L.K.H@NTUCSIE

Example • Top down approach S1 (1，8) A T G C A T G C 1 2 3 4 5 6 7 8 S2 A T (1，1) (2，8) A T G C A T G C (3，7) G C A T G (3，3) (4，7) G C A T G A T (5，6) 2004 L.K.H@NTUCSIE A (5，5)

Example: bottom up (1，8) (5，5): 1 2 T[1,1]表DP(5,5;1,1) T[1,2]表DP(5,5;1,2) T[2,2]表DP(5,5;2,2) (1，1) (2，8) T 1 2 1 1 0 (3，7)  無DP(i,i’-1;j,j’-1)+x(S1[i’], S2[j’]) (3，3) (4，7) 無DP(i,i’;j,j’) = max 無DP(i,i’-1;j,j’) (5，6) 無DP(i,i’;j,j’-1) (5，5) (5，6): 6 is free 1 2 DP(5,6;1,2) = max{ DP(5, 5; 1, 1)+x(S1[2], S2[2]) DP(5, 5; 1, 2 ) DP(5, 6; 1, 1 )} T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

Example: bottom up (1，8) (5，6): 1 2 T 1 2 1 2 (1，1) (2，8) 1 (3，7) 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j])  無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’]) (3，3) (4，7) 無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) (5，6) 弧DP(i1, i2; j + 1, j’) (5，5) (4，7): arc, 用弧DP 1 2 弧DP(4,7;1,2) = max{ DP(5, 6; 2, 2) +x(S1[4], S2[1]) DP(5, 6; 1, 1) +x(S1[7], S2[2]) DP(5, 6; 1, 2) DP(4, 7; 1, 1) DP(4, 7; 2, 2) } T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

Example: bottom up 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} (3，7): 7= u(7)r and 3 < u(7)l (1，8) (3，3): (4，7): 1 2 1 2 T 1 2 1 2 T 1 2 0 0 (1，1) (2，8) 1 0 (3，7) (3，3) (4，7) (5，6) (5，5) DP(3,7;1,2) = max{ DP(3, 3; 1, 0) + DP(4, 7; 1, 2) DP(3, 3; 1, 1) + DP(4, 7; 2, 2) DP(3, 3; 1, 2) + DP(4, 7; 3, 2) } 1 2 T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

Example: bottom up (1，8) (3，7): 1 2 T 1 2 1 2 (1，1) (2，8) 1 (3，7) 無DP(i1+1, i2 - 1; j + 1, j’) +x(S1[i1], S2[j])  無DP(i1+1, i2 - 1; j, j’ -1) +x(S1[i2], S2[j’]) (3，3) (4，7) 無DP(i1 + 1, i2 - 1; j, j’) 弧DP(i1,i2;j,j’) = max 弧DP(i1, i2; j, j’ - 1) (5，6) 弧DP(i1, i2; j + 1, j’) (5，5) (2，8): arc, 用弧DP 1 2 弧DP(2, 8; 1, 2) = max{ DP(3, 7; 2, 2) +x(S1[2], S2[1]) DP(3, 7; 1, 1) +x(S1[8], S2[2]) DP(3, 7; 1, 2) DP(2, 8; 1, 1) DP(2, 8; 2, 2) } T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

Example: bottom up 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} (1，8): 8= u(8)r and 1 < u(8)l (1，8) (1，1): (2，8): 1 2 1 2 T 1 2 1 2 T 1 2 1 1 (1，1) (2，8) 1 0 (3，7) (3，3) (4，7) (5，6) (5，5) ANS: DP(1,8;1,2) = max{ DP(1, 1; 1, 0) + DP(2, 8; 1, 2) DP(1, 1; 1, 1) + DP(2, 8; 2, 2) DP(1, 1; 1, 2) + DP(2, 8; 3, 2) } 1 2 T 1 2 1 2 S1 A T G C A T G C 1 1 2 3 4 5 6 7 8 2004 L.K.H@NTUCSIE S2 A T

Time Complexity (1，8) (1，1) (2，8) (3，7) • Table Size: m*(m-1)/2 = O(m2) • Number of Tables: • Possible (i，j): • Arc: at most n/2 = O(n) • Inside Arc: at most as many as arc • Free: at most O(n) • Table Entry: • O(n) * O(m2) = O(nm2) (3，3) (4，7) m (5，6) m (5，5) A T G C A T G C A T G C A T G C A T G C A T G C 2004 L.K.H@NTUCSIE

Time Complexity • Compute a entry at most cost: • O(m) • Time Complexity: • O(m)*O(nm2 ) = O(nm3 ) 無DP(i,i’;j,j’) = max{無DP(i, u(i’)l-1;j,j’’-1) + 弧DP(u(i’)l,i’;j’’,j’)} 2004 L.K.H@NTUCSIE

Extend LCS(nested, plain) Algorithm • Extend to LCS(nested, chain) • Add two new value α,β to DP(i,i’;j,j’) • DP(i,i’;j,j’; α,β) • Extend to LCS(crossing, nested) • Restrict the cut-width to a constant k • Add k (αi,βi) to DP(i,i’;j,j’) 2004 L.K.H@NTUCSIE

LCS(nested, chain)Notation • -: denote nothing • ρ: the rightmost position of [j,j’-1] except α,β j’ j β α 2004 L.K.H@NTUCSIE

Modification (I) • If i is free and j’ = u(j’)l, • DP(i,i’;j,j’; α,-) = max • DP(i,i’-1;j, ρ; α’,-) + x(S1[i’],S2[j’]) • DP(i,i’-1;j, j’; α,-) • DP(i,i’;j, ρ; α’,-) • DP(i,i’;j,j’; α,j’) = DP(i,i’;j,j’; α,-) • If α< ρ, • α’= α • else α’ = -; 2004 L.K.H@NTUCSIE

Modification (II) • If i is free and j’ = u(j’)r (!= α), • DP(i,i’;j,j’; α,-) = max • DP(i,i’-1;j, ρ; α, β’) + x(S1[i’],S2[j’]) • DP(i,i’-1;j, j’; α,-) • DP(i,i’;j, j’-1; α,-) • DP(i,i’;j,j’; α,j’) = DP(i,i’;j,j’-1; α,-) • If j<= u(j’)l < ρ, • β’ = u(j’)l • else β’ = -; 2004 L.K.H@NTUCSIE

Comprehensive Study on Arc-Annotated Sequences for RNA and Protein Analysis

Comprehensive Study on Arc-Annotated Sequences for RNA and Protein Analysis

Presentation Transcript