290 likes | 429 Views
In this presentation, we explore advanced dynamic programming techniques for sequence alignment in computational genomics. Topics include an overview of gap alignment strategies such as global, local, and ends-free alignments; the affine gap penalty model; and the efficient computation of shortest paths using the Floyd-Warshall algorithm. We delve into recurrence relations and base conditions essential for alignment setups, as well as how to manage gaps and special cases that arise in genomic sequences, ensuring both accuracy and efficiency in alignment scores.
E N D
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky
Outline • Alignment re-cap • End-space free alignment • Affine gap alignment algorithm and proof • Bounded gap/spaces alignments
Dynamic programming • Useful in many string-related settings • Will be repeatedly used in the course • General idea • Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial
Dynamic programming for shortest paths • Finding the shortest path from X to Y using the Floyd Warshall • Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy wij if k=0 • dij(k)= min{dij(k-1), dik(k-1)+dkj(k-1)} otherwise
Something1|G Something2|C Alignment reminder Something1|G Something1|C Something1|G Something2|C Something1|G Something1|C Somethin g1|G Something2C|- Something1|G Something1G|- Something1|C Somethin g2|C
Global alignment • Input: S1,S2 • Output: Minimum cost alignment • V(k,l) – score of aligning S1[1..k] with S2[1..l] • Base conditions: • V(i,0) = k=0..i(sk,-) • V(0,j) = k=0..j(-,tk) • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj)
Alignment reminder • Global alignment • All of S1 has to be aligned with all of S2 • Every gap is “payed for” • Solution equals V(n,m) Traceback all the way Alignment score here
Local alignment • Local alignment • Subset of S1 aligned with a subset of S2 • Gaps outside subsets “costless” • Solution equals the maximum score cell in the DP matrix • Base conditions: • V(i,0) = 0 • V(0,j) = 0 • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj) 0
Ends-free alignment • Something between global and local • Consider aligning a gene to a (bacterial) genome • Gaps in the beginning and end of S and T are costless • But all of S,T should be aligned • Base conditions: • V(i,0) = 0 • V(0,j) = 0 • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj) • The optimal solution is found at the last row/column (not necessarily at bottom right corner)
Something1|G Something2|C Handling weird gaps • Affine gap: different cost for a “new” and “old” gaps Something1|G Something1|C Something1|G Something2|C Something1|G Something1|C Somethin g1|G Something2C|- Two new things to keep track Two additional matrices Now we care if there were gaps here Something1|G Something1G|- Something1|C Somethin g2|C
G(i,j) S.....i T.....j Alignment with Affine Gap Penalty • Base Conditions: • V(i, 0) = F(i, 0) = Wg + iWs • V(0, j) = E(0, j) = Wg + jWs • Recursive Computation: • V(i, j) = max{ E(i, j), F(i, j), G(i, j)} • where: • G(i, j) = V(i-1, j-1) + (si, tj) • E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg + Ws , F(i, j-1) + Wg + Ws } • F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg + Ws , E(i-1, j) + Wg + Ws } S.....i------ T..............j E(i,j) S...............i T.....j------- • Time complexity O(nm) - compute 4 matrices instead of one. • Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.
When do constant and affine gap costs differ? AGAGACTGACGCTTA ATATTA • Consider: AGAGACTGACGCTTA ATA---------TTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 -14 -9 Affine penalty: Mismatch: -5 Gap open: -3 Gap extend: -0.5 -12 -14.5
Bounding the number of gaps • Lets say we are allowed to have at most K gaps • (Gaps ≠ Spaces Gap can contain many spaces) • Now we keep track of the number of gaps we opened so far • Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)
Bounding the number of gaps • A “multi-layer” DP matrix • Actually separate functions – V,E,F, on every layer, keeping track of layer no. • Every time we open or close a gap we “jump” to the next layer • Where to look for the solution? (not only at last layer!) • What is the complexity?
Bounding the number of spaces • Let’s say that no gap can exceed k spaces • Of course now cannot also bound number of gaps as well (why?) • How many matrices do we need now? • Here, no monotone notion of layer like before • What’s the complexity?
What about arbitrary gap functions? • If the gap cost is an arbitrary function of its length f(k) • Thus, when computing Dij, we need to look at j places “back” and i places “up”: • Complexity? Something1|G Something1|C min
Special cases • How about a logarithmic penalty? Wg+Ws*log(k) • This is a special case of a convex penalty, which is solvable in O(mn*log(m)) • The logarithmic case can be done in O(mn) • For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))
Supersequence • Exercise: A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A. • e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) • Given SandT, find their shortest common supersequence
Reminder: LCS • Longest common non-contigous subsequence: • Adjust global alignment with similarity scores • 1 for match • 0 for gaps • -∞ for mismatches
Supersequence • Find the longest common sub-sequence of S,T • Generate the string as follows: • for every column in the alignment • Match – add the matching character (once!) • Gap – add the character aligned against the gap
Supersequence • For S=“Pride” T=“Parade”: • P-R-IDE • PARA-DE • PARAIDE – Shortest common supersequence
Exercise: Finding repeats • Basic objective: find a pair of subsequences within S with maximum similarity • Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?) • But using local alignment is still a good idea
Variant #1 • Specific requirement: the two sequences may overlap • Solution: Change the local alignment algorithm: • Compute only the upper triangular submatrix (V(i,j), where j>i). • Set diagonal values to 0 • Complexity: O(n2) time and O(n) space
Variant #2 • Specific requirement: the two sequences may not overlap • Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] • Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n • Pick the highest-scoring alignment • Complexity: O(n3) time and O(n) space
Variant #3 • Specific requirement: the two sequences must be consequtive (tandem repeat) • Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], • No penalties for gaps in the beginning of S[1..k] • No penalties for gaps in the end of S[k+1..n] • Complexity: O(n3) time and O(n) space
Variant #4 • Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - SRC (inverted repeat) • Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n3)
Variant #4 • Solution: Compute the local alignment between S and SRC • Look for results on the diagonal i+j=n • AGCTAACGCGTTCGAA (n=16) • Complexity:O(n2) time, O(n) space Index 8 Index 8