Comp genomics
Download
1 / 29

Comp. Genomics - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Comp. Genomics. Recitation 2 12/3/09 Slides by Igor Ulitsky. Outline. Alignment re-cap End-space free alignment Affine gap alignment algorithm and proof Bounded gap/spaces alignments. Dynamic programming. Useful in many string-related settings Will be repeatedly used in the course

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Comp. Genomics' - dandre


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Comp genomics

Comp. Genomics

Recitation 2

12/3/09

Slides by Igor Ulitsky


Outline
Outline

  • Alignment re-cap

  • End-space free alignment

  • Affine gap alignment algorithm and proof

  • Bounded gap/spaces alignments


Dynamic programming
Dynamic programming

  • Useful in many string-related settings

  • Will be repeatedly used in the course

  • General idea

    • Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial


Dynamic programming for shortest paths
Dynamic programming for shortest paths

  • Finding the shortest path from X to Y using the Floyd Warshall

  • Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy

    wij if k=0

  • dij(k)= min{dij(k-1), dik(k-1)+dkj(k-1)} otherwise


Alignment reminder

Something1|G

Something2|C

Alignment reminder

Something1|G

Something1|C

Something1|G

Something2|C

Something1|G

Something1|C

Somethin g1|G

Something2C|-

Something1|G

Something1G|-

Something1|C

Somethin g2|C


Global alignment
Global alignment

  • Input: S1,S2

  • Output: Minimum cost alignment

  • V(k,l) – score of aligning S1[1..k] with S2[1..l]

  • Base conditions:

    • V(i,0) = k=0..i(sk,-)

    • V(0,j) = k=0..j(-,tk)

  • Recurrence relation: V(i-1,j-1) + (si,tj)

    1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-)

    V(i,j-1) + (-,tj)


Alignment reminder1
Alignment reminder

  • Global alignment

    • All of S1 has to be aligned with all of S2

    • Every gap is “payed for”

    • Solution equals V(n,m)

Traceback all the way

Alignment score here


Local alignment
Local alignment

  • Local alignment

    • Subset of S1 aligned with a subset of S2

    • Gaps outside subsets “costless”

    • Solution equals the maximum score cell in the DP matrix

  • Base conditions:

    • V(i,0) = 0

    • V(0,j) = 0

  • Recurrence relation: V(i-1,j-1) + (si,tj)

    1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-)

    V(i,j-1) + (-,tj)

    0


Ends free alignment
Ends-free alignment

  • Something between global and local

  • Consider aligning a gene to a (bacterial) genome

  • Gaps in the beginning and end of S and T are costless

  • But all of S,T should be aligned

  • Base conditions:

    • V(i,0) = 0

    • V(0,j) = 0

  • Recurrence relation: V(i-1,j-1) + (si,tj)

    1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-)

    V(i,j-1) + (-,tj)

  • The optimal solution is found at the last row/column

    (not necessarily at bottom right corner)


Handling weird gaps

Something1|G

Something2|C

Handling weird gaps

  • Affine gap: different cost for a “new” and “old” gaps

Something1|G

Something1|C

Something1|G

Something2|C

Something1|G

Something1|C

Somethin g1|G

Something2C|-

Two new things to keep track  Two additional matrices

Now we care if there were gaps here

Something1|G

Something1G|-

Something1|C

Somethin g2|C


G(i,j)

S.....i

T.....j

Alignment with Affine Gap Penalty

  • Base Conditions:

  • V(i, 0) = F(i, 0) = Wg + iWs

  • V(0, j) = E(0, j) = Wg + jWs

  • Recursive Computation:

  • V(i, j) = max{ E(i, j), F(i, j), G(i, j)}

  • where:

  • G(i, j) = V(i-1, j-1) + (si, tj)

  • E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg + Ws , F(i, j-1) + Wg + Ws }

  • F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg + Ws , E(i-1, j) + Wg + Ws }

S.....i------

T..............j

E(i,j)

S...............i

T.....j-------

  • Time complexity O(nm) - compute 4 matrices instead of one.

  • Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.


When do constant and affine gap costs differ
When do constant and affine gap costs differ?

AGAGACTGACGCTTA

ATATTA

  • Consider:

AGAGACTGACGCTTA

ATA---------TTA

AGAGACTGACGCTTA

----A-T-A---TTA

Constant penalty:

Mismatch: -5

Gap: -1

-14

-9

Affine penalty:

Mismatch: -5

Gap open: -3

Gap extend: -0.5

-12

-14.5


Bounding the number of gaps
Bounding the number of gaps

  • Lets say we are allowed to have at most K gaps

  • (Gaps ≠ Spaces  Gap can contain many spaces)

  • Now we keep track of the number of gaps we opened so far

  • Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)


Bounding the number of gaps1
Bounding the number of gaps

  • A “multi-layer” DP matrix

  • Actually separate functions – V,E,F, on every layer, keeping track of layer no.

  • Every time we open or close a gap we “jump” to the next layer

  • Where to look for the solution? (not only

    at last layer!)

  • What is the complexity?


Bounding the number of spaces
Bounding the number of spaces

  • Let’s say that no gap can exceed k spaces

  • Of course now cannot also bound number

    of gaps as well (why?)

  • How many matrices do we need now?

  • Here, no monotone notion of layer like before

  • What’s the complexity?


What about arbitrary gap functions
What about arbitrary gap functions?

  • If the gap cost is an arbitrary function of its length f(k)

  • Thus, when computing Dij, we need to look at j places “back” and i places “up”:

  • Complexity?

Something1|G

Something1|C

min


Special cases
Special cases

  • How about a logarithmic penalty? Wg+Ws*log(k)

  • This is a special case of a convex penalty, which is solvable in O(mn*log(m))

  • The logarithmic case can be done in O(mn)

  • For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))


Supersequence
Supersequence

  • Exercise: A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A.

  • e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU)

  • Given SandT, find their shortest common supersequence


Reminder lcs
Reminder: LCS

  • Longest common non-contigous subsequence:

    • Adjust global alignment with similarity scores

    • 1 for match

    • 0 for gaps

    • -∞ for mismatches


Supersequence1
Supersequence

  • Find the longest common sub-sequence of S,T

  • Generate the string as follows:

    • for every column in the alignment

    • Match – add the matching character (once!)

    • Gap – add the character aligned against the gap


Supersequence2
Supersequence

  • For S=“Pride” T=“Parade”:

  • P-R-IDE

  • PARA-DE

  • PARAIDE – Shortest common supersequence


Exercise finding repeats
Exercise: Finding repeats

  • Basic objective: find a pair of subsequences within S with maximum similarity

  • Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?)

  • But using local alignment is still a good idea


Variant 1
Variant #1

  • Specific requirement: the two sequences may overlap

  • Solution: Change the local alignment algorithm:

    • Compute only the upper triangular submatrix (V(i,j), where j>i).

    • Set diagonal values to 0

  • Complexity: O(n2) time and O(n) space


Variant 2
Variant #2

  • Specific requirement: the two sequences may not overlap

  • Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n]

    • Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n

    • Pick the highest-scoring alignment

  • Complexity: O(n3) time and O(n) space



Variant 3
Variant #3

  • Specific requirement: the two sequences must be consequtive (tandem repeat)

  • Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n],

    • No penalties for gaps in the beginning of S[1..k]

    • No penalties for gaps in the end of S[k+1..n]

  • Complexity: O(n3) time and O(n) space



Variant 4
Variant #4

  • Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - SRC (inverted repeat)

  • Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n3)


Variant 41
Variant #4

  • Solution: Compute the local alignment between S and SRC

  • Look for results on the diagonal i+j=n

  • AGCTAACGCGTTCGAA (n=16)

  • Complexity:O(n2) time, O(n) space

Index 8

Index 8 


ad