1 / 73

COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment (PSA)

COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment (PSA). Xingquan Zhu Dept. of Computer Science and Engineering. Outline. Dynamic Programming Dot Matrix Plot FASTA BLAST. Difficulty of PSA. Consider two sequences of length n Number of possible global alignments

saad
Download Presentation

COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment (PSA)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COT 6930HPC and BioinformaticsPairwise Sequence Alignment (PSA) Xingquan Zhu Dept. of Computer Science and Engineering

  2. Outline • Dynamic Programming • Dot Matrix Plot • FASTA • BLAST

  3. Difficulty of PSA • Consider two sequences of length n • Number of possible global alignments • For n = 40, there are over 1023 possible alignments • Approaches • “Optimal” solution • Dynamic programming • Heuristic solutions • Dot matrix plot • FASTA • BLAST

  4. Dynamic Programming (DP) • Alignment • Problem can be subdivided • Optimal alignment of n bases • 1. Incorporates optimal alignment of 1…n-1 bases • 2. Combine with best alignment for nth base • Bioinformatics application • Global alignment [Needleman-Wunsch 1970] • Local alignment [Smith-Waterman 1981] • Complexity • O(n2) or O(n × m) algorithm (sequences of length n, m) • Feasible for moderate sized sequences, not entire genomes

  5. Dynamic Programming (DP) • Dynamic programming usually consists of three components. • Recursive relation • Tabular computation • Trace back • This efficient recursive method is used to search through all possible alignments and finding the one with the optimal score.

  6. How DP is used for alignment? • An alignment is represented as a path through a matrix. To search through the matrix of all possible paths and find the optimal path DP is used.

  7. Four possible outcomes in aligning two sequences [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!)

  8. Intuition of Dynamic Programming If we already have the optimal solution to: XY AB then we know the next pair of characters will either be: XYZ or XY- or XYZ ABCABC AB- (where “-” indicates a gap). So we can extend the match by determining which of these has the highest score.

  9. Intuition of Dynamic Programming Considering sequences: “HEAGA” and “PA”

  10. DP (Global Alignment) • Recursive formula • Notation • xi ith letter of string x • yj jth letter of string y • x1..i prefix of x from letters 1 through i • F matrix of optimal scores (DP Matrix) • F(i, j) represents optimal score lining up x 1..i with y1..j • d gap penalty • s scoring matrix

  11. Constant vs. Non-constant Gap Penalty • Constant gap penalty • Non-constant gap penalty

  12. DP (Global Alignment) • Algorithm • Initialize: F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap penalties) • Fill from top left to bottom right using recursive formula

  13. DP Example • Sequences • Seq1: HEAGAWGHEE • Seq2: PAWHEAE • Gap Penalty: -8 (Constant) • Scoring Matrix (Blosum 50)

  14. DP Example Algorithm Initialize F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap penalties) Fill from top left to bottom right using recursive formula Blosum 50

  15. DP Example Algorithm Initialize F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap penalties) Fill from top left to bottom right using recursive formula Blosum 50

  16. DP Example (Traceback) • Trace arrows from bottom right to top left • Diagonal – both match • Up – left sequence match a gap • Or insert a gap to top sequence • Left – top sequence match a gap • Or insert a gap to left sequence

  17. Global alignment vs. local alignment • Global alignment: the entire sequence of each protein or DNA sequence is contained in the alignment. • Local alignment: only regions of greatest similarity between two sequences are aligned percent identity: ~26% RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

  18. Global vs. local alignment • Global alignment are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. • Global methods are useful when you want to force two sequences to align over their entire length • Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences.

  19. DP (local alignment) • Make 0 minimal score (i.e., start new alignment) • Alignment can start / end anywhere • Start at highest score(s) • End when 0 reached

  20. DP (local alignment) Blosum 50

  21. DP (Local Alignment) • Traceback • Start at highest score and trace arrows back to first 0

  22. Multiple, Repeat, & Overlap Matches • Multiple alignments • “Optimal” global alignment may be best by small margin • Can report several high-scoring alignments • Repeat matches • Allow sections to match repeatedly • Find repeated patterns (domain / motif) • Overlap matches • Matching when the two sequences overlap • Does not penalize overhanging ends

  23. Overlap Alignment Consider the following problem: • Find the most significant overlap between two sequences S,T ? • Possible overlap relations: a. b. Difference from local alignment: require alignment between the endpoints of the two sequences.

  24. Overlap Alignment Formally: given S[1..n] , T[1..m] find i,j such that: d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) } is maximal. Solution: Same asGlobal alignment except we don’t not penalise overhanging ends.

  25. Overlap Alignment • Initialization:F(i,0)=0,F(0,j)=0 • Recurrence:as in global alignment • Score:maximum value at the bottom line and rightmost line

  26. Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE • Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5

  27. Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE • Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5

  28. Overlap Alignment (Example) S =PAWHEAE T =HEAGAWGHEE • Scoring scheme: • Match: +4 • Mismatch: -1 • Indel: -5

  29. Scoring scheme : • Match: +4 • Mismatch: -1 • Indel: -5 -2 Overlap Alignment (Example) The best overlap is: PAWHEAE------ ---HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: ---PAW-HEAE HEAGAWGHEE-

  30. Outline • Dynamic Programming • Dot Matrix Plot • FASTA • BLAST

  31. Dot Matrix Plot • Method • Put two sequences on vertical and horizontal axes of graph • Put dots wherever there is a match • Diagonal line is region of identity (local alignment) • Filter non-diagonals (minimal length, identity threshold)

  32. Simple Dot Plot Example

  33. Simple Dot Plot Example

  34. The Dotplot is a Table for overview of the similarities between two sequences The palindromic sequence: MAXISTAYAWAYATSIXAM The stretches of similar residues show up as diagonals The palindromic sequence is very important part of DNA sequence. Many enzyme recognize these sequences

  35. Dot Matrix Plot Example • Chimpanzee DNA aligned with itself • Filters needed to eliminate random matches filtered 6 of 10 filtered 8 of 10

  36. Dot Matrix Plot • Advantage • Extremely simple • Can use scoring matrix • Can use multiple filters to rank matches • Example • Red > 50 bases, 95% identical (very close) • Blue > 25 bases, 75% identical (close) • Green > 10 bases, 50% identical (distant) • Limitations • O(n2) algorithm • Relies on visual analysis • Difficult to find “best” alignment • Despite multiple filters, difficult to compare alignments

  37. Dot Matrix Plot • Demo • http://www.vivo.colostate.edu/molkit/dnadot/

  38. Outline • Dynamic Programming • Dot Matrix Plot • FASTA • BLAST

  39. G A A T T C A G T T A G 1 1 1 1 1 1 1 1 1 1 1 G 1 1 1 1 1 1 1 2 2 2 2 A 1 2 2 2 2 2 2 2 2 2 2 T 1 2 2 3 3 3 3 3 3 3 3 C 1 2 2 3 3 4 4 4 4 4 4 G 1 2 2 3 3 4 4 5 5 5 5 A 1 2 3 3 3 4 5 5 5 5 6 FASTA - Idea - • Problem of Dynamic Programming DP compute the score in a lot of useless area for optimal sequence • FASTA focuses on diagonal area

  40. FASTA - Heuristic - • Heuristic Good local alignment should have some exact match subsequence. FASTA focus on this area

  41. FASTA High Level Description Let q be a query max  0 For each sequence, s in DB compare q with s and compute a score, y if max < y max  y; bestSequence  s ; Return bestSequence

  42. FASTA - Algorithm - • Step 1 Find all hot-spots // Hot spots is pairs of words of length k that exactly match k=4-6 for DNA, k=1-2 for Amino acid sequences Sequence 1 Hot Spots Sequence 2

  43. G A A T T C A G T T A G * * Q Location G * * A 2,3,7,11 A * * * * C 6 T * * * * G 1,8 C * G * * T 4,5,9,10 A * * * * FASTA - Algorithm - • Step 1 in detail Use look-up Table Query : G A A T T C A G T T A Sequence: G G A T C G A Dot—Matrix Look-up Table

  44. FASTA - Algorithm - • Step 2 Score the Hot-spot and locate the ten best diagonal run. // There is some scoring system; ex. PAM250

  45. FASTA - Algorithm - • Step 3 Combine sub-alignments into one alignment with GAP GAP One of local alignment

  46. FASTA - Algorithm - • Step 4 # Consider weighted direct graph. # Let node be a sub-alignment found in step 1 # Let u and v be nodes # Edge (u,v) exists if alignment u is before in the sequence. # Each edge has gap penalty (negative) # Find the maximum weight path Sub-sequence Edge One Sequence

  47. One of Sequence FASTA - Algorithm - GAP • Step 4 in detail Sub-alignment Gap -5 -3 -3 Max Weight Path

  48. FASTA - Algorithm - • Step 5 Use the dynamic programming in restricted area around the best-score alignment to find out the higher-score alignment than the best-score alignment Width of this band is a parameter

  49. FASTA - Algorithm - • Summary of Algorithm 1: Find all hot-spots // Hot spots is pairs of words of length k that exactly match 2: Score the Hot-spot and locate the ten best diagonal run. 3: Combine sub-alignments into one alignment 4: Score Each alignment with gap penalty and pick up the best-score alignment 5: Use the dynamic programming in restricted area around the best-score alignment to find out the alignment greater than the best-score alignment.

  50. FASTA - Complexity - • Complexity # Step 1 and 2 // select the best 10 diagonal run Let n be a sequence from DB O(n+m) because Step 1 just uses look up the table O(n+m) << O(mn) m,n = 100 to 200

More Related