Longest Common Subsequence

Longest Common Subsequence • Consider the problem of efficiently searching for a substring or generally a pattern in large piece of text. • This is what text editors and programs like “grep” do when you perform a search.

Longest Common Subsequence • In many instances you do not want to find a piece of text exactly, but rather something that is similar. • In genetics research for example you compare extremely long strings of genetic codes from DNA to determine how similar two strings are • Presumably this determines how closely related two species are • Also in document retrieval on the web, you compare two strings to see how similar they are

Longest Common Subsequence • There are a number of measures of similarity in strings • The first is the edit distance • The minimum # of single character insertions or deletions necessary to convert one string to the other • E.g. X = <abc>, Y = <sabdfc> • Insert s before a, insert df after b: Distance 3

Longest Common Subsequence • Another method of measuring the degree of similarity between two strings is to compute their longest common subsequence (LCS) • X = <ABRACADABRA>, Y = <YABBADABBADOO> • Z = <ABADABA>

Subsequence: Definition • Definition: • Given two sequences X = <x1,x2,…xm>, and Z=<z1,z2,…zk>, we say that Z is a subsequence of X if there is a strictly increasing sequence of k indices <i1,i2,i3, …ik> (1<=i1<i2<i3<…<ik<=n) such that Z = <Xi1,Xi2,Xi3,…,Xik> • E.g.: Let X = <ABRACADABRA> and let Z=<AADAA>, then Z is a subsequence of X • Indice sequence: <1,4,7,8,11>

LCS: Definition • LCS: Given two sequences X and Y, the LCS of X and Y is the longest sequence Z, which is both a subsequence of X and Y X: A C B B R A D R A A A LCS: A A B A B A D Y: Y A B A B D A A D O O B B • LCS may not be unique • The LCS of <ABC> and <BAC> is either <AC> or <BC>

Brute Force Solution • Try all possible subsequences from one string, and search for matches in the other string • This is hopelessly inefficient since there are an exponential number of possible subsequences • We will design a Dynamic Programming solution

Dynamic Programming Solution • We need to break the problem into smaller pieces. It turns out considering all pairs of prefixes will suffice for us • A prefix of a sequence is just an initial string of values, Xi = <x1, x2, …, xi>. X0 is the empty sequence

Dynamic Programming Solution • The idea will be to compute the longest common subsequence for every possible pair of prefixes of X and Y • Let c[i, j] denote the length of the LCS of Xi and Yj • Eventually we are interested in c[m, n] since this will be the LCS of the two entire strings • The idea is to compute c[i,j] assuming we know the values of c[i’,j’] for i’<=i, j’<=j

Dynamic Programming Solution • Basis: • c[I,0] = c[j,0] = 0. If either sequence is empty, then LCS is empty • Last characters match (xi = yj): • Let Xi = <ABCA> and let Yj=<CDA>. Since both end in A, LCS must end in A (otherwise we can make LCS longer by adding A). Since A is part of the LCS, we can find the overall LCS by removing A from both subsequences and taking the LCS of Xi-1 = <ABC> and Yj-1 = <CD>, which is <C>, and then adding A to the end giving <CA> as the answer • Conclusion: xi = yj c[i,j] = 1 + c[i-1,j-1]

Dynamic Programming Solution • Last characters DO NOT match (xi <> yj) • In this case xi and yj cannot both be in LCS (since they have to be the last char of the LCS). Thus either xi is NOT part of the LCS, or yj is NOT part of the LCS (and possibly both are NOT part of the LCS) • In the first case, the LCS of Xi and Yj is the LCS of Xi-1 and Yj, which is c[i-1,j]. • In the second case the LCS is the LCS of Xi and Yj-1, which is c[i,j-1] • Conclusion: xi <> yj  c[i,j] = max{c[i-1,j], c[i, j-1]}

LCS: Final Formulation • Combining all observations, we have the following rules {0 if i=0 or j=0 • c[i,j] ={c[i-1,j-1] + 1 if i,j>0 and xi=yj {max{c[i-1,j], c[i,j-1] if i,j>0 and xi<>yj

LCS: Algorithm LCS(x[1..m], y[1..n]){ For i=0 to m do c[i,0] = 0; // Initialization For j=0 to n do c[0,j] = 0; // Initialization For i=1 to m do { For j=1 to n do { if (x[i] == y[j]) c[i,j] = c[i-1,j-1] + 1; // Last chars eq else c[i,j] = max(c[i,j-1], c[i-1,j]); // not eq } //end-for } //end-for return c[m, n]; } //end-LCS

LCS: Running Time • It is easy to see that the algorithm takes m*n = O(m*n) steps • The algorithm computes the length of the LCS. It does not compute the actual sequence. • But it is easy to modify the algorithm to compute the actual sequence – few slides later • The algorithm also uses O(m*n) space. • We can achieve a savings in space by observing that we only need the two most recent rows of the matrix c, c[i,*], c[i-1,*]. Thus we can reduce the space to O(n). In fact, if n>m, then we can swap X and Y so the space is O(min(m,n))

LCS: Example X = <BACDB> Y = <BDCB> 0 1 2 3 4 B D C B 0 0 0 0 0 0 0 1 1 1 1 1 B 0 1 1 1 1 2 A 3 C 0 1 1 2 2 0 1 2 2 2 4 D 5 B 0 1 2 2 3 LCS = <BCB>

Extracting the Actual Sequence • The trick for extracting the sequence is to first perform the entire computation, but at each step keep track of the choice that was made as to how the sequence was generated • We will record each decision for each prefix in an array b[i,j] • ADDXY if we add x[i] & y[j] to the LCS • SKIPX if x[1..i-1]&y[1..j] have a longer LCS match • SKIPY if x[1..i]&y[1..j-1] have a longer LCS match • Then starting at the final answer, c[m,n], trace back to find how this answer was arrived at • Every time a diagonal step taken by the algorithm corresponds to one additional char being added to the LCS

LCS: Algorithm LCS(x[1..m], y[1..n]){ For i=0 to m do c[i,0] = 0; // Initialization For j=0 to n do c[0,j] = 0; // Initialization For i=1 to m do { For j=1 to n do { if (x[i] == y[j]){ c[i,j] = c[i-1,j-1] + 1; // Last chars eq b[i,j] = ADDXY; } else if (c[i-1,j] > c[i,j-1]){ c[i,j] = c[i-1,j]; //x[1..i-1] & y[1..j] b[i,j] = SKIPX; // skip x[i] } else { c[i,j] = c[i,j-1]; //x[1..i] & y[1..j-1] b[i,j] = SKIPY; // skip y[j] } //end-else } //end-for } //end-for return c[m, n]; } //end-LCS

PrintLCS PrintLCS(i, j){ if (i==0 || j == 0) return; else if (b[i,j] == ADDXY){ printLCS(i-1, j-1); print x[i]; // In this case x[i] == y[j] } else if (b[I,j] == SKIPX){ printLCS(i-1, j); } else PrintLCS(i, j-1); } //end-PrintLCS

Longest Common Subsequence