- 104 Views
- Uploaded on
- Presentation posted in: General

Bioinformatics Algorithms and Data Structures

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Bioinformatics Algorithms and Data Structures

Chapter 12: Refining Core String Edits and Alignments

Lecturer: Dr. Rose

Slides by: Dr. Rose

February 13, 2003

Chapter 11 questions:

- #1
- #4
- #7
- #8
Additional question for gradstudents

- #10

- Dynamic program takes Q(nm) space for alignments.
- Can alignments be computed in linear space?
- Hirschberg’s method
- Good news: reduces space from Q(nm) to O(n) where n<m.
- Bad News: doubles worst case time bound.

- Recall: similarity is expressed as a single scalar.
- There is an alignment that corresponds to this scalar.
- i.e., the optimal alignment whose values is this scalar.

- We’ve needed the O(n*m) table for the alignment.
- Q: If we only want the similarity value do we need the table?
- A: No. We only need the space required to compute the value.

- Q: How much space is that?
- A: Two rows.
- Recall, computing cell (i, j) we need cells (i -1, j - 1) , (i -1, j), (i, j - 1).
- Cells (i -1, j - 1) and (i -1, j) are on the previous row.
- Cells (i, j - 1) and (i, j) are on the current row.
- We only need the current row, C, and the previous row, P.
- When the current row is done, copy it to the previous row for the next iteration, i.e., C P

- After n iterations row C holds the last row n of the full table.
- The similarity value V(n,m) is in the last cell of C.
- The time complexity is still O(nm) but space is now O(m).

- Q: How can we find the actual alignment in linear space?
- Consider an alignment solution path in the table computation.

- Imagine that we knew that the optimal alignment solution path went through cell (n/2, k*)?

- Knowing this, we could solve the problem by piecing together solution paths for the diagonal quadrants.

- As important, we could ignore the antidiagonal quadrants.
- We could repeat this process, reducing the amount of space needed to find the optimal alignment.

- Repeating this process would reduce the amount of space needed to find the optimal alignment.
- Q: How far can we go?

- Q: How can we find the cell (n/2, k*)?
- A: Stay tuned!
- Defn. Let ar denote the reverse of string a.
- Defn.Vr(i,j) is the similarity of the first i characters of Sr1 with the first j characters of Sr2.

- An equivalent formulation: Vr(i,j) is the similarity of the last i characters of S1 with the last j characters of S2.

- It should be obvious how Vr(i,j) can be computed in O(nm) time and O(m) space.
- Furthermore, any row can be computed in O(m) space.

- Lemma:V(n, m) = max0km[V(n/2,k) + Vr(n/2,m-k)]
- Q: What does this lemma say?
- A: The solution to alignment value V(n, m) is the sum of the smaller alignment problems V(n/2,k) & Vr(n/2,m-k) where k is chosen toyield the largest sum.

Defn:Let k* be the position k that maximizes V(n/2,k) + Vr(n/2,m-k)

Defn: Let L denote the solution path from cell (0,0) to cell (n,m)

Defn: Let Ln/2 be the the subpath of L that starts with the last node of L in row n/2 –1 and ends with the first node of L in n/2+1

Lemma:

- Position k* can be found in row n/2 in time O(nm) and space O(m).
- The subpath Ln/2 can be found and stored in the same bounds.
Proof sketch:

Proof sketch continued :

- For each k, add V(n/2,k) and Vr(n/2,m-k).
- Set k* to be the k that maximizes V(n/2,k) + Vr(n/2,m-k).
Steps 1 & 2 take O(nm) time and O(m) space.

Steps 3 & 4 take O(m) time.

- One set of traceback pointers leads from k* lead to k1 in row n/2-1.
- The other leads from k* lead to k2 in row n/2+1.
Steps 5 & 6 give the subpath Ln/2.

Summary :

In O(mn) time and O(m) space we have:

- Found the value V(n,m)
- Found k*, k1 , and k2
- Found the subpath Ln/2.
- Created two subproblems:
- Aligning S1[1..n/2-1] with S2[1..k1]
- Aligning S1[n/2+1..n] with S2[k2..m]

Aligning S1[1..n/2-1] with S2[1..k1] is the top problem labeled A.

Aligning S1[n/2+1..n] with S2[k2..m] is the bottom problem labeled B.

Q: What is the dynamic programming time for a p by q table?

A: cpq, where c is some constant.

Q: Determining the n/2th row of a n by m table takes how long?

A: cnm/2

- Thus cnm time to process the two rows (V & Vr).
We can solve problems A and B in time proportional to their total size.

- The middle row of A can be determined in ck*n/2
- The middle row of B can be determined in c(m-k*)n/2
- Altogether this is cnm/2.

Q: How are we going to find the optimal alignment in linear space?

A: Use recursion!

Hirschberg’s Algorithm

Procedure OPTA(l,l´,r,r´){

h = (l´- l)/2; /* midpoint of first substring */

Find k*, k1, k2, & Lh in space O(l´- l) = O(m)

OPTA(l,h-1, r, k1); /* new top problem */

output subpath Lh;

OPTA(h+1, l´, k2, r´); /* new bottom problem */

}

The first call is: OPTA(1,n,1,m)

Analysis:

- The first call uses cnm time
- The second call uses cnm/2 time for 2 subproblems
- The ith level of recursion entail 2i-1 subproblems
- There are n/2i-1 rows in each of the level i problems
- The total time at level i is cnm/2i-1
Thm. Hirschberg’s optimal alignment algorithm takes time S1+log ncnm/2i-1 2cnm and space O(m).

Q: What about computing local alignment?

Recall:

- This is solved by finding the cell (i*, j*) with maximum value v.
- i* and j* are the end indices of substrings a and b.
We can compute v row-wise. (recall v(i,j) is the optimal suffix alignment chapt 11)

Q:How do we find the start indices of a and b without the full table?

A: Author suggests reverse dynamic programming.

Huh? ‘reverse the polarity’? Where is Dr. Who?

Finding the start indices of a and b:

Extend the algorithm for v to set pointer h(i, j) for each cell (i, j):

If v(i, j) = 0 then set h(i, j) to (i, j)

If v(i, j) > 0 & normal traceback pointer would be to cell (p, q) then set h(i, j) to h(p, q).

Consequently, h(i*, j*) specifies the starting cell, i.e., the starting positions of a and b.

Finding a and b can be done in linear space.

Local alignment can be done in O(nm) time & O(m) space.

Imagine problems in which there is a bound on the number of expected differences.

Q: Can we solve the alignment in faster than O(nm)?

A: Yes, if the alignment contains at most k differences O(km) is possible.

Core Idea: The main diagonal is comprised of cells (i,i), i n m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

Core Idea: The main diagonal is comprised of cells (i,i), i n m. No k-difference alignment can not stray into cells (i, i + l) or (i, i – l), l > k.

Recall: a solution path must extend from cell (0,0) ending along or to the right of the main diagonal in cell (n,m)

Observation: k >= m – n is required for a k-difference solution to exist.

Q: How can we achieve time complexity O(km) in a table with O(nm) entries?

A: Only fill O(km) of the O(nm) cells straddling the main diagonal.

Algorithm: Fill in the table in strips 2k+1 cells wide centered on the main diagonal.

Note: The recurrence requires the three neighboring cells.

Q: How do we handle neighbors in the forbidden zone?

A: Ignore them.

Thm. There is a global alignment of S1 and S2 with at most k differences IFF the algorithm from the previous slide assigns cell (n,m) the value k or less.

- The k-difference global alignment problem can be solved in time O(km) and space O(km).

Q: What if we don’t know the value of k?

Q: How can we decide on a k value?

Soln. Start with k = 1.

If no solution is found let k = 2 * k

Repeat the doubling of k until a solution is found.

- We double k to find the optimal value k*.
- We stop doubling k when a solution is found.
- k* will be the best alignment with the current value of k.
- Since we have been doubling k, k* k.

Thm. The doubling of k, starting from1, yields a k-difference alignment with the edit distance k* and its alignment in O(k* m) time and space.

Proof: Let kL be the largest value of k used for a given pair of strings. Then kL 2k*. The effort involved is O(kLm + kLm/2 + kLm/4 + .. + m) = O(kLm). But, O(kLm) = O(k* m).

Q: Why do we state kL 2k* instead of kL < 2k* ?

Part 1

#24. Show how to solve the alphabet-weight alignment problem with affine weights in O(nm) time.

#27. The recurrence relations we developed for the affine gap model follow the logic of paying Wg + Ws when a gap is initiated and then paying Ws for each additional space used in that gap. An alternative logic is to pay Wg + Ws at the point when the gap is “completed”. Write recurrence relations for the affine gap model that follows that logic. The recurrences should compute the alignment in O(nm) time.

Continued on next page.

Part 2

#1. Show how to compute the valueV(n,m) of the optimal alignment using only min(n,m) +1 space in addition to the space needed to represent the two input strings.

#4. Show how to reduce the size of the strip needed in the method of Section 12.2.3, when |m - n| < k.

Continued on next page.

Part 2continued

Gradstudents only:

#5. Fill in the details of how to find the actual alignments of P in T that occur with at most k differences. The method uses the O(km) values stored during the k differences algorithm.The solution is somewhat simpler if the k differences algorithm also stores a sparse set of pointers recording how each farthest-reaching d-path extends a farthest-reaching (d-1)-path. These pointers only take O(km) space and are a spare version of the standard dynamic programming pointers. Fill in the details of this approach as well.

Required portion of question.

Optional, extra credit portion of question.