Comp. Genomics

Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky

Outline • Alignment re-cap • End-space free alignment • Affine gap alignment algorithm and proof • Bounded gap/spaces alignments

Dynamic programming • Useful in many string-related settings • Will be repeatedly used in the course • General idea • Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial

Dynamic programming for shortest paths • Finding the shortest path from X to Y using the Floyd Warshall • Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy wij if k=0 • dij(k)= min{dij(k-1), dik(k-1)+dkj(k-1)} otherwise

Global alignment • Input: S1,S2 • Output: Minimum cost alignment • V(k,l) – score of aligning S1[1..k] with S2[1..l] • Base conditions: • V(i,0) = k=0..i(sk,-) • V(0,j) = k=0..j(-,tk) • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj)

Alignment reminder • Global alignment • All of S1 has to be aligned with all of S2 • Every gap is “payed for” • Solution equals V(n,m) Traceback all the way Alignment score here

Local alignment • Local alignment • Subset of S1 aligned with a subset of S2 • Gaps outside subsets “costless” • Solution equals the maximum score cell in the DP matrix • Base conditions: • V(i,0) = 0 • V(0,j) = 0 • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj) 0

Ends-free alignment • Something between global and local • Consider aligning a gene to a (bacterial) genome • Gaps in the beginning and end of S and T are costless • But all of S,T should be aligned • Base conditions: • V(i,0) = 0 • V(0,j) = 0 • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj) • The optimal solution is found at the last row/column (not necessarily at bottom right corner)

G(i,j) S.....i T.....j Alignment with Affine Gap Penalty • Base Conditions: • V(i, 0) = F(i, 0) = Wg + iWs • V(0, j) = E(0, j) = Wg + jWs • Recursive Computation: • V(i, j) = max{ E(i, j), F(i, j), G(i, j)} • where: • G(i, j) = V(i-1, j-1) + (si, tj) • E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg + Ws , F(i, j-1) + Wg + Ws } • F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg + Ws , E(i-1, j) + Wg + Ws } S.....i------ T..............j E(i,j) S...............i T.....j------- • Time complexity O(nm) - compute 4 matrices instead of one. • Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.

When do constant and affine gap costs differ? AGAGACTGACGCTTA ATATTA • Consider: AGAGACTGACGCTTA ATA---------TTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 -14 -9 Affine penalty: Mismatch: -5 Gap open: -3 Gap extend: -0.5 -12 -14.5

Bounding the number of gaps • Lets say we are allowed to have at most K gaps • (Gaps ≠ Spaces  Gap can contain many spaces) • Now we keep track of the number of gaps we opened so far • Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)

Bounding the number of gaps • A “multi-layer” DP matrix • Actually separate functions – V,E,F, on every layer, keeping track of layer no. • Every time we open or close a gap we “jump” to the next layer • Where to look for the solution? (not only at last layer!) • What is the complexity?

Bounding the number of spaces • Let’s say that no gap can exceed k spaces • Of course now cannot also bound number of gaps as well (why?) • How many matrices do we need now? • Here, no monotone notion of layer like before • What’s the complexity?

What about arbitrary gap functions? • If the gap cost is an arbitrary function of its length f(k) • Thus, when computing Dij, we need to look at j places “back” and i places “up”: • Complexity? Something1|G Something1|C min

Special cases • How about a logarithmic penalty? Wg+Ws*log(k) • This is a special case of a convex penalty, which is solvable in O(mn*log(m)) • The logarithmic case can be done in O(mn) • For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))

Supersequence • Exercise: A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A. • e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) • Given SandT, find their shortest common supersequence

Reminder: LCS • Longest common non-contigous subsequence: • Adjust global alignment with similarity scores • 1 for match • 0 for gaps • -∞ for mismatches

Supersequence • Find the longest common sub-sequence of S,T • Generate the string as follows: • for every column in the alignment • Match – add the matching character (once!) • Gap – add the character aligned against the gap

Supersequence • For S=“Pride” T=“Parade”: • P-R-IDE • PARA-DE • PARAIDE – Shortest common supersequence

Exercise: Finding repeats • Basic objective: find a pair of subsequences within S with maximum similarity • Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?) • But using local alignment is still a good idea

Variant #1 • Specific requirement: the two sequences may overlap • Solution: Change the local alignment algorithm: • Compute only the upper triangular submatrix (V(i,j), where j>i). • Set diagonal values to 0 • Complexity: O(n2) time and O(n) space

Variant #2 • Specific requirement: the two sequences may not overlap • Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] • Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n • Pick the highest-scoring alignment • Complexity: O(n3) time and O(n) space

Variant #2

Variant #3 • Specific requirement: the two sequences must be consequtive (tandem repeat) • Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], • No penalties for gaps in the beginning of S[1..k] • No penalties for gaps in the end of S[k+1..n] • Complexity: O(n3) time and O(n) space

Variant #3

Variant #4 • Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - SRC (inverted repeat) • Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n3)

Variant #4 • Solution: Compute the local alignment between S and SRC • Look for results on the diagonal i+j=n • AGCTAACGCGTTCGAA (n=16) • Complexity:O(n2) time, O(n) space Index 8 Index 8 

Comp. Genomics

Comp. Genomics

Presentation Transcript

DNA Chips and Their Analysis Comp. Genomics: Lecture 13

Genomics

Genomics

Computational Genomics Fall 2004/5 www.cs.tau.ac.il/~bchor/CG05/comp-genom.html

Comp. Genomics

Comp. Genomics

Comp. Genomics

Genomics

Comp. Genomics

Intro to Comp Genomics

Comp. Genomics

Comp. Genomics

Comp. Genomics

Genomics

Comp. Genomics

Genomics

Comp. Genomics

Computational Genomics Spring 2009 cs.tau.ac.il/~bchor/CG09/comp-genom.html

Genomics

Comp. Genomics

Genomics

Comp. Genomics