Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String Edits and Alignments Lecturer: Dr. Rose Slides by: Dr. Rose February 13, 2003

Homework: due 2/20/03 Chapter 11 questions: • #1 • #4 • #7 • #8 Additional question for gradstudents • #10

Linear Space Alignments • Dynamic program takes Q(nm) space for alignments. • Can alignments be computed in linear space? • Hirschberg’s method • Good news: reduces space from Q(nm) to O(n) where n<m. • Bad News: doubles worst case time bound.

Linear Space for Similarity • Recall: similarity is expressed as a single scalar. • There is an alignment that corresponds to this scalar. • i.e., the optimal alignment whose values is this scalar. • We’ve needed the O(n*m) table for the alignment. • Q: If we only want the similarity value do we need the table? • A: No. We only need the space required to compute the value.

Linear Space for Similarity • Q: How much space is that? • A: Two rows. • Recall, computing cell (i, j) we need cells (i -1, j - 1) , (i -1, j), (i, j - 1). • Cells (i -1, j - 1) and (i -1, j) are on the previous row. • Cells (i, j - 1) and (i, j) are on the current row. • We only need the current row, C, and the previous row, P. • When the current row is done, copy it to the previous row for the next iteration, i.e., C  P

Linear Space for Similarity • After n iterations row C holds the last row n of the full table. • The similarity value V(n,m) is in the last cell of C. • The time complexity is still O(nm) but space is now O(m).

Linear Space Alignments • Q: How can we find the actual alignment in linear space? • Consider an alignment solution path in the table computation.

Linear Space Alignments • Imagine that we knew that the optimal alignment solution path went through cell (n/2, k*)? • Knowing this, we could solve the problem by piecing together solution paths for the diagonal quadrants.

Linear Space Alignments • As important, we could ignore the antidiagonal quadrants. • We could repeat this process, reducing the amount of space needed to find the optimal alignment.

Linear Space Alignments • Repeating this process would reduce the amount of space needed to find the optimal alignment. • Q: How far can we go?

Linear Space Alignments • Q: How can we find the cell (n/2, k*)? • A: Stay tuned! • Defn. Let ar denote the reverse of string a. • Defn.Vr(i,j) is the similarity of the first i characters of Sr1 with the first j characters of Sr2.

Linear Space Alignments • An equivalent formulation: Vr(i,j) is the similarity of the last i characters of S1 with the last j characters of S2. • It should be obvious how Vr(i,j) can be computed in O(nm) time and O(m) space. • Furthermore, any row can be computed in O(m) space.

Linear Space Alignments • Lemma:V(n, m) = max0km[V(n/2,k) + Vr(n/2,m-k)] • Q: What does this lemma say? • A: The solution to alignment value V(n, m) is the sum of the smaller alignment problems V(n/2,k) & Vr(n/2,m-k) where k is chosen toyield the largest sum.

Linear Space Alignments Defn:Let k* be the position k that maximizes V(n/2,k) + Vr(n/2,m-k) Defn: Let L denote the solution path from cell (0,0) to cell (n,m)

Linear Space Alignments Defn: Let Ln/2 be the the subpath of L that starts with the last node of L in row n/2 –1 and ends with the first node of L in n/2+1

Linear Space Alignments Lemma: • Position k* can be found in row n/2 in time O(nm) and space O(m). • The subpath Ln/2 can be found and stored in the same bounds. Proof sketch: • Process first n/2 rows to find S1 & S2 alignment, saving row n/2 with traceback pointers. • Process first n/2 rows to find Sr1 & Sr2 alignment saving row n/2 with traceback pointers.

Linear Space Alignments Proof sketch continued : • For each k, add V(n/2,k) and Vr(n/2,m-k). • Set k* to be the k that maximizes V(n/2,k) + Vr(n/2,m-k). Steps 1 & 2 take O(nm) time and O(m) space. Steps 3 & 4 take O(m) time. • One set of traceback pointers leads from k* lead to k1 in row n/2-1. • The other leads from k* lead to k2 in row n/2+1. Steps 5 & 6 give the subpath Ln/2.

Linear Space Alignments Summary : In O(mn) time and O(m) space we have: • Found the value V(n,m) • Found k*, k1 , and k2 • Found the subpath Ln/2. • Created two subproblems: • Aligning S1[1..n/2-1] with S2[1..k1] • Aligning S1[n/2+1..n] with S2[k2..m]

Linear Space Alignments Aligning S1[1..n/2-1] with S2[1..k1] is the top problem labeled A. Aligning S1[n/2+1..n] with S2[k2..m] is the bottom problem labeled B.

Linear Space Alignments Q: What is the dynamic programming time for a p by q table? A: cpq, where c is some constant. Q: Determining the n/2th row of a n by m table takes how long? A: cnm/2 • Thus cnm time to process the two rows (V & Vr). We can solve problems A and B in time proportional to their total size. • The middle row of A can be determined in ck*n/2 • The middle row of B can be determined in c(m-k*)n/2 • Altogether this is cnm/2.

Linear Space Alignments Q: How are we going to find the optimal alignment in linear space? A: Use recursion!

Linear Space Alignments Hirschberg’s Algorithm Procedure OPTA(l,l´,r,r´){ h = (l´- l)/2; /* midpoint of first substring */ Find k*, k1, k2, & Lh in space O(l´- l) = O(m) OPTA(l,h-1, r, k1); /* new top problem */ output subpath Lh; OPTA(h+1, l´, k2, r´); /* new bottom problem */ } The first call is: OPTA(1,n,1,m)

Linear Space Alignments Analysis: • The first call uses cnm time • The second call uses cnm/2 time for 2 subproblems • The ith level of recursion entail 2i-1 subproblems • There are n/2i-1 rows in each of the level i problems • The total time at level i is cnm/2i-1 Thm. Hirschberg’s optimal alignment algorithm takes time S1+log ncnm/2i-1  2cnm and space O(m).

Linear Space Alignments Q: What about computing local alignment? Recall: • This is solved by finding the cell (i*, j*) with maximum value v. • i* and j* are the end indices of substrings a and b. We can compute v row-wise. (recall v(i,j) is the optimal suffix alignment chapt 11) • use only linear space. Q:How do we find the start indices of a and b without the full table? A: Author suggests reverse dynamic programming. Huh? ‘reverse the polarity’? Where is Dr. Who?

Linear Space Alignments Finding the start indices of a and b: Extend the algorithm for v to set pointer h(i, j) for each cell (i, j): If v(i, j) = 0 then set h(i, j) to (i, j) If v(i, j) > 0 & normal traceback pointer would be to cell (p, q) then set h(i, j) to h(p, q). Consequently, h(i*, j*) specifies the starting cell, i.e., the starting positions of a and b. Finding a and b can be done in linear space.  Local alignment can be done in O(nm) time & O(m) space.

Bounded Differences Imagine problems in which there is a bound on the number of expected differences. Q: Can we solve the alignment in faster than O(nm)? A: Yes, if the alignment contains at most k differences O(km) is possible. Core Idea: The main diagonal is comprised of cells (i,i), i  n  m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

Bounded Differences Core Idea: The main diagonal is comprised of cells (i,i), i  n  m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

Bounded Differences Recall: a solution path must extend from cell (0,0) ending along or to the right of the main diagonal in cell (n,m) Observation: k >= m – n is required for a k-difference solution to exist.

Bounded Differences Q: How can we achieve time complexity O(km) in a table with O(nm) entries? A: Only fill O(km) of the O(nm) cells straddling the main diagonal.

Bounded Differences Algorithm: Fill in the table in strips 2k+1 cells wide centered on the main diagonal. Note: The recurrence requires the three neighboring cells. Q: How do we handle neighbors in the forbidden zone? A: Ignore them.

Bounded Differences Thm. There is a global alignment of S1 and S2 with at most k differences IFF the algorithm from the previous slide assigns cell (n,m) the value k or less. • The k-difference global alignment problem can be solved in time O(km) and space O(km).

Bounded Differences Q: What if we don’t know the value of k? Q: How can we decide on a k value? Soln. Start with k = 1. If no solution is found let k = 2 * k Repeat the doubling of k until a solution is found. • We double k to find the optimal value k*. • We stop doubling k when a solution is found. • k* will be the best alignment with the current value of k. • Since we have been doubling k, k*  k.

Bounded Differences Thm. The doubling of k, starting from1, yields a k-difference alignment with the edit distance k* and its alignment in O(k* m) time and space. Proof: Let kL be the largest value of k used for a given pair of strings. Then kL  2k*. The effort involved is O(kLm + kLm/2 + kLm/4 + .. + m) = O(kLm). But, O(kLm) = O(k* m). Q: Why do we state kL  2k* instead of kL < 2k* ?

HomeworkDue 2/27/03 Part 1 #24. Show how to solve the alphabet-weight alignment problem with affine weights in O(nm) time. #27. The recurrence relations we developed for the affine gap model follow the logic of paying Wg + Ws when a gap is initiated and then paying Ws for each additional space used in that gap. An alternative logic is to pay Wg + Ws at the point when the gap is “completed”. Write recurrence relations for the affine gap model that follows that logic. The recurrences should compute the alignment in O(nm) time. Continued on next page.

Homework Part 2 #1. Show how to compute the valueV(n,m) of the optimal alignment using only min(n,m) +1 space in addition to the space needed to represent the two input strings. #4. Show how to reduce the size of the strip needed in the method of Section 12.2.3, when |m - n| < k. Continued on next page.

Homework Part 2continued Gradstudents only: #5. Fill in the details of how to find the actual alignments of P in T that occur with at most k differences. The method uses the O(km) values stored during the k differences algorithm.The solution is somewhat simpler if the k differences algorithm also stores a sparse set of pointers recording how each farthest-reaching d-path extends a farthest-reaching (d-1)-path. These pointers only take O(km) space and are a spare version of the standard dynamic programming pointers. Fill in the details of this approach as well. Required portion of question. Optional, extra credit portion of question.

Bioinformatics Algorithms and Data Structures