1 / 16

Sequence Alignment Tutorial #3

Sequence Alignment Tutorial #3. © Ydo Wexler & Dan Geiger. Sequence Alignment (Reminder). Global Alignment :. Input: two sequences S 1 , S 2 over the same alphabet Output: two sequences S’ 1 , S’ 2 of equal length ( S’ 1 , S’ 2 are S 1 , S 2 with possibly additional gaps)

ethel
Download Presentation

Sequence Alignment Tutorial #3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence AlignmentTutorial #3 . © Ydo Wexler & Dan Geiger

  2. Sequence Alignment (Reminder) Global Alignment: Input: two sequences S1, S2 over the same alphabet Output: two sequences S’1, S’2 of equal length (S’1, S’2 are S1, S2 with possibly additional gaps) Example: • S1= GCGCATGGATTGAGCGA • S2= TGCGCCATTGATGACC • A possible alignment: S’1=-GCGC-ATGGATTGAGCGA S’2= TGCGCCATTGAT-GACC-- Goal: How similar are two sequences S1 and S2

  3. Sequence Alignment (Reminder) Local Alignment: Input: two sequences S1, S2 over the same alphabet Output: two sequences S’1, S’2 of equal length (S’1, S’2 are substrings of S1, S2 with possibly additional gaps) Example: • S1=GCGCATGGATTGAGCGA • S2=TGCGCCATTGATGACC • A possible alignment: S’1=ATTGA-G S’2= ATTGATG Goal: Find the pair of substrings in two input sequences which have the highest similarity

  4. Sequence Alignment (Reminder) -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel) • Score each position independently • Score of an alignment is sum of position scores

  5. Breaking Number • Input: Two sequences M,E over the same alphabet (|M|≥|E|) • Output: The smallest k, s.t. there exist partitions: M=M1M2…Mk , E=E1E2…Ek s.t Ei is a substring of Mi for all i = 1..k. If no such k exists, then return ∞. Example: M=AAAATTTAAATTTA E=AATTATA M1=AAAATTT M2=AAATT M3=A E1= AATT E2= AT E3=A AAAATTTAAATTTA --AATT---AT--A Find an O(|M||E|) algorithm for finding the breaking number of M,E.

  6. (d) (e) Affine gap penalty Breaking Number (cont) • Solution: Reduce the problem to global alignment with modifications: • Do not allow mismatches • Do not allow gaps in M • No penalty for gaps in start/end of sequence • Constant penalty for gaps (regardless of their length) • Scoring scheme: • Match – 0 • Mismatch - -∞ • Gap intr. - -1 • Gap elong. -0     AAAATTTAAATTTA --AATT---AT--A breaking number = -score of the alignment + 1.

  7. Breaking Number (cont) • Complexity: Standard O(|M||E|) Dynamic Programming • Correctness: Two-way argument • An alignment of score –(k-1) corresponds to a partition of M,E to k subsequences • A partition of M,E to k subsequences has an alignment score of –(k-1) • Optimal alignment has score of -∞ There is no valid partition(2) • Optimal alignment has score –k  • There is a valid partition to k+1 blocks (1) • There is no valid partition to less blocks (2)

  8. A - T A G - G T T G G G G T G G - - T - A T T A - - A - T A C C A C C C - G C - G - Possible alignment Possible alignment Multiple Sequence Alignment S1=AGGTC S2=GTTCG S3=TGAAC

  9. Multiple Sequence Alignment (cont) • Input: Sequences S1, S2,…, Sk over the same alphabet • Output: Gapped sequences S’1, S’2,…, S’k of equal length • |S’1|= |S’2|=…= |S’k| • Removal of spaces from S’iobtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

  10. Multiple Sequence Alignment Example Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Scoring scheme: match - 0 mismatch/indel - -1 SP score: -4 -3 -5 =-12

  11. Multiple Sequence AlignmentComplexity Given kstrings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: • Instead of a 2-dimensional table we have a k-dimensional table • Each dimension is of length ‘n’+1 • Each entry depends on 2k-1 adjacent entries Complexity:O(2knk) This problem is known to be NP-hard (no polynomial-time algorithm)

  12. Multiple Sequence Alignment Approximation Algorithm We use cost instead of score  Find alignment of minimal cost Assumption:the cost function δ is a distance function • δ(x,x) = 0 • δ(x,y) = δ(y,x) ≥ 0 • δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T

  13. Multiple Sequence Alignment Approximation Algorithm The ‘star’ algorithm: Input: Γ - set of k strings S1,…,Sk. • Find the string S’ (center) that minimizes • Denote S1=S’and the rest of the strings as S2,…,Sk • Iteratively add S2,…,Sk to the alignment as follows: • Suppose S1,…,Si-1are alreadyaligned as S’1,…,S’i-1 • AlignSi to S’1 to produce S’i and S’’1 aligned • AdjustS’2,…,S’i-1by adding spaces where spaces were added to S’’1 • Replace S’1 by S’’1

  14. total complexity Multiple Sequence Alignment Approximation Algorithm Time analysis: • Choosing S1 – execute DP for all sequence-pairs - O(k2n2) • Adding Si to the alignment -execute DP for Si , S’1 - O(i·n2). (In the ith stage the length of S’1can be up-to i· n)

  15. Multiple Sequence Alignment Approximation Algorithm Approximation ratio: • M* - optimal alignment • M - The alignment produced by this algorithm • d(i,j) - the distanceMinduces on the pair Si,Sj For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

  16. Multiple Sequence Alignment Approximation Algorithm Triangle inequality Approximation ratio: Definition of S1:

More Related