1 / 27

Comp. Genomics

Comp. Genomics. Recitation 1. Outline. Sequence alignment End-space free alignment Alignment with gaps. x i | G. y j | C. Alignment basic step. x i |G. y j |C. G. C. x i |G. y j |C. G. -. x i |G. -. y j |C. C. Global alignment. All of x has to be aligned with all of y

Download Presentation

Comp. Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Comp. Genomics Recitation 1

  2. Outline • Sequence alignment • End-space free alignment • Alignment with gaps

  3. xi | G yj |C Alignment basic step xi|G yj |C G C xi|G yj |C G - xi|G - yj|C C

  4. Global alignment • All of xhas to be aligned with all of y • Therefore, every gap is “paid for” • The solution score is found in one cell Traceback all the way Alignment score here

  5. Global alignment • Input: Sequences x,y • Output:Maximum score alignment • F(i,j) – score of aligning x[1..i] with y[1..j] • Base conditions: • F(i,0) = k=1..i(xk,-) • F(0,j) = k=1..j(-,yk) • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-, yj)

  6. Local alignment • Local alignment • Subset of xaligned with a subset of y • Gaps outside subsets “costless” • Solution equals the maximum score cell in the DP matrix • Base conditions: • F(i,0) = 0 • F(0,j) = 0 • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-,yj) 0

  7. Local alignment example AWGHE AW_ HE Mismatch: BLOSUM50 Match: BLOSUM50 Gap: -8

  8. Overlap matches (end space free alignment) • Something between global and local • Consider aligning a gene x to a (bacterial) genome y • Gaps in the beginning and end of x and y are costless • But all of x should be aligned • Base conditions: • F(i,0) = 0 • F(0,j) = 0 • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-,yj) • The optimal solution is found at the last row/column (not necessarily at bottom right corner)

  9. Xi|G yj|C Handling weird gaps • Affine gap: different cost for a “new” and “old” gaps Xi|G y j |C G C Xi|G y j |C G - Two new things to keep track  Two additional matrices Now we care if there were gaps here Xi|G y j |C - C

  10. M(i,j) x 1...........i y 1...........j Alignment with Affine Gap Penalty Base Conditions: M(i, 0) = Ix(i, 0) = Wg + iWs M(0, j) = Iy(0, j) = Wg + jWs M(0, 0) = 0 Recursive Computation: x 1......i---- y 1...........j Iy(i,j) x 1...........i y 1….j----- Ix(i,j) M(i-1,j-1) + (xi,yj) M(i,j) = max Ix(i-1,j-1) + (xi,yj) Iy (i-1,j-1) +(xi,yj) M(i-1,j) + Wg+ Ws Ix(i-1,j) + Ws Wg ,Ws <0 Ix(i,j) = max The optimal solution is the maximum of the relevant cells in the three matrices

  11. When do constant and affine gap costs differ? AGAGACTGACGCTTA ATATTA • Consider: AGAGACTGACGCTTA ATA---------TTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 -14 -9 Affine penalty: Mismatch: -5 Gap open: -3 Gap extend: -0.5 -12 -14.5

  12. Question • Given two sequences x and y, the fragmentation number of x,y is the minimal k such that x and y can be broken into substrings x1,x2,...,xk ; y1,y2,...,yk and every xi is a substring of the corresponding yi • Suggest an algorithm for finding the fragmentation number of two sequences

  13. Solution • Global alignment with the following modifications: • No penalty for gaps at the ends of y • Gaps are only allowed in x (characters of x may not be skipped) • Mismatches are not allowed (score -∞) • Affine gaps score, with open cost 1 and extension cost 0

  14. Question • How do we align two sequences with a bound k on the maximal number of gaps? • Analyze the complexity

  15. Solution We will divide every cell in the alignment matrix to 2k sub-cells. The meaning of a sub-cell is as follows: k cells with superscript 1: k cells with superscript 2:

  16. Solution • The update rule for sub-cells with superscript 1: • The update rule for sub-cells with superscript 2:

  17. What about arbitrary gap functions? • If the gap cost is an arbitrary function of its length, γ(k) • When computing Mij, we need to look at all possible gap lengths “back”: Xi|G Yj|C

  18. Alignment with arbitrary gap functions Recursive Computation: k=0,…,i-1 F(i-1,j-1) + (xi,yj) k=0,…,j-1 F(i,j) = max F(k,j) + γ(i-k) F(i,k) + γ(j-k)

  19. Complexity Suppose the two sequences are of length n.

  20. LCS • Longest common non-contigous subsequence: • Use global alignment with similarity scores • +1 for match • 0 for indel • -∞ for mismatches

  21. Exercise: Shortest common supersequence • A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A. • e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) • Problem: Given AandB, find their shortest common supersequence

  22. Solution • For A=“PRIDE” B=“PARADE”: • Compute LCS using global align: A=P-R-IDE B=PARA-DE • PARAIDE – Shortest common supersequence • Notice that PRDE is the longest common subsequence of A and B.

  23. Exercise: Finding repeats • Basic objective: find a pair of subsequences within a string x with maximum similarity • Simple (albeit wrong) idea: Find an optimal alignment of x with itself! (Why is this wrong?) • But using local alignment is still a good idea

  24. Variant #1 • First variant: the two sequences may not overlap • Solution: Absence of overlap means that there exists an index k such that one substring is in x[1..k] and another in x[k+1..n] • Check local alignments between x[1..k] and x[k+1..n] for all 1<=k<n • Pick the highest-scoring alignment • Complexity: O(n3) time and O(n) space

  25. Variant #1, Pictorially

  26. Variant #2 • Second variant: the two sequences must be consecutive (tandem repeat) • Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between x[1..k] and x[k+1..n], • No penalties for gaps in the beginning of x[1..k] • No penalties for gaps in the end of x[k+1..n] • Complexity: O(n3) time and O(n) space

  27. Variant #2, Pictorially

More Related