- 103 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Comp. Genomics' - dandre

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Alignment re-cap
- End-space free alignment
- Affine gap alignment algorithm and proof
- Bounded gap/spaces alignments

Dynamic programming

- Useful in many string-related settings
- Will be repeatedly used in the course
- General idea
- Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial

Dynamic programming for shortest paths

- Finding the shortest path from X to Y using the Floyd Warshall
- Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy
wij if k=0

- dij(k)= min{dij(k-1), dik(k-1)+dkj(k-1)} otherwise

Something2|C

Alignment reminderSomething1|G

Something1|C

Something1|G

Something2|C

Something1|G

Something1|C

Somethin g1|G

Something2C|-

Something1|G

Something1G|-

Something1|C

Somethin g2|C

Global alignment

- Input: S1,S2
- Output: Minimum cost alignment
- V(k,l) – score of aligning S1[1..k] with S2[1..l]
- Base conditions:
- V(i,0) = k=0..i(sk,-)
- V(0,j) = k=0..j(-,tk)

- Recurrence relation: V(i-1,j-1) + (si,tj)
1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-)

V(i,j-1) + (-,tj)

Alignment reminder

- Global alignment
- All of S1 has to be aligned with all of S2
- Every gap is “payed for”
- Solution equals V(n,m)

Traceback all the way

Alignment score here

Local alignment

- Local alignment
- Subset of S1 aligned with a subset of S2
- Gaps outside subsets “costless”
- Solution equals the maximum score cell in the DP matrix

- Base conditions:
- V(i,0) = 0
- V(0,j) = 0

- Recurrence relation: V(i-1,j-1) + (si,tj)
1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-)

V(i,j-1) + (-,tj)

0

Ends-free alignment

- Something between global and local
- Consider aligning a gene to a (bacterial) genome
- Gaps in the beginning and end of S and T are costless
- But all of S,T should be aligned
- Base conditions:
- V(i,0) = 0
- V(0,j) = 0

- Recurrence relation: V(i-1,j-1) + (si,tj)
1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-)

V(i,j-1) + (-,tj)

- The optimal solution is found at the last row/column
(not necessarily at bottom right corner)

Something2|C

Handling weird gaps- Affine gap: different cost for a “new” and “old” gaps

Something1|G

Something1|C

Something1|G

Something2|C

Something1|G

Something1|C

Somethin g1|G

Something2C|-

Two new things to keep track Two additional matrices

Now we care if there were gaps here

Something1|G

Something1G|-

Something1|C

Somethin g2|C

S.....i

T.....j

Alignment with Affine Gap Penalty

- Base Conditions:
- V(i, 0) = F(i, 0) = Wg + iWs
- V(0, j) = E(0, j) = Wg + jWs
- Recursive Computation:
- V(i, j) = max{ E(i, j), F(i, j), G(i, j)}
- where:
- G(i, j) = V(i-1, j-1) + (si, tj)
- E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg + Ws , F(i, j-1) + Wg + Ws }
- F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg + Ws , E(i-1, j) + Wg + Ws }

S.....i------

T..............j

E(i,j)

S...............i

T.....j-------

- Time complexity O(nm) - compute 4 matrices instead of one.
- Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.

When do constant and affine gap costs differ?

AGAGACTGACGCTTA

ATATTA

- Consider:

AGAGACTGACGCTTA

ATA---------TTA

AGAGACTGACGCTTA

----A-T-A---TTA

Constant penalty:

Mismatch: -5

Gap: -1

-14

-9

Affine penalty:

Mismatch: -5

Gap open: -3

Gap extend: -0.5

-12

-14.5

Bounding the number of gaps

- Lets say we are allowed to have at most K gaps
- (Gaps ≠ Spaces Gap can contain many spaces)
- Now we keep track of the number of gaps we opened so far
- Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)

Bounding the number of gaps

- A “multi-layer” DP matrix
- Actually separate functions – V,E,F, on every layer, keeping track of layer no.
- Every time we open or close a gap we “jump” to the next layer
- Where to look for the solution? (not only
at last layer!)

- What is the complexity?

Bounding the number of spaces

- Let’s say that no gap can exceed k spaces
- Of course now cannot also bound number
of gaps as well (why?)

- How many matrices do we need now?
- Here, no monotone notion of layer like before
- What’s the complexity?

What about arbitrary gap functions?

- If the gap cost is an arbitrary function of its length f(k)
- Thus, when computing Dij, we need to look at j places “back” and i places “up”:
- Complexity?

Something1|G

Something1|C

min

Special cases

- How about a logarithmic penalty? Wg+Ws*log(k)
- This is a special case of a convex penalty, which is solvable in O(mn*log(m))
- The logarithmic case can be done in O(mn)
- For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))

Supersequence

- Exercise: A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A.
- e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU)
- Given SandT, find their shortest common supersequence

Reminder: LCS

- Longest common non-contigous subsequence:
- Adjust global alignment with similarity scores
- 1 for match
- 0 for gaps
- -∞ for mismatches

Supersequence

- Find the longest common sub-sequence of S,T
- Generate the string as follows:
- for every column in the alignment
- Match – add the matching character (once!)
- Gap – add the character aligned against the gap

Supersequence

- For S=“Pride” T=“Parade”:
- P-R-IDE
- PARA-DE
- PARAIDE – Shortest common supersequence

Exercise: Finding repeats

- Basic objective: find a pair of subsequences within S with maximum similarity
- Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?)
- But using local alignment is still a good idea

Variant #1

- Specific requirement: the two sequences may overlap
- Solution: Change the local alignment algorithm:
- Compute only the upper triangular submatrix (V(i,j), where j>i).
- Set diagonal values to 0

- Complexity: O(n2) time and O(n) space

Variant #2

- Specific requirement: the two sequences may not overlap
- Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n]
- Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n
- Pick the highest-scoring alignment

- Complexity: O(n3) time and O(n) space

Variant #3

- Specific requirement: the two sequences must be consequtive (tandem repeat)
- Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n],
- No penalties for gaps in the beginning of S[1..k]
- No penalties for gaps in the end of S[k+1..n]

- Complexity: O(n3) time and O(n) space

Variant #4

- Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - SRC (inverted repeat)
- Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n3)

Variant #4

- Solution: Compute the local alignment between S and SRC
- Look for results on the diagonal i+j=n
- AGCTAACGCGTTCGAA (n=16)
- Complexity:O(n2) time, O(n) space

Index 8

Index 8

Download Presentation

Connecting to Server..