1 / 46

Pairwise Sequence Alignment and Scoring Matrices

Stat 115 Lecture 3. Pairwise Sequence Alignment and Scoring Matrices. Xiaole Shirley Liu And Jun Liu. Mol Bio Quick Facts. Building block of DNA is deoxyribonucleic acid Building block of protein is amino acid Protein is a peptide, a long peptide

gerda
Download Presentation

Pairwise Sequence Alignment and Scoring Matrices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stat 115 Lecture 3 Pairwise Sequence Alignmentand Scoring Matrices Xiaole Shirley Liu And Jun Liu

  2. Mol Bio Quick Facts • Building block of DNA is deoxyribonucleic acid • Building block of protein is amino acid • Protein is a peptide, a long peptide • Only exons can code for functional proteins or RNAs • Introns are spliced out STAT 115

  3. Outline • Motivation and introduction • Dynamic programming • Global sequence alignment • Needleman-Wunsch • 3 steps: Initialize, fill matrix, trace back • Gap penalties • Local sequence alignment, Smith-Waterman • Scoring matrices • PAM • BLOSUM STAT 115

  4. gap match mismatch Pairwise Sequence Alignment • Given: two sequences, scoring for match/mismatch/gap • Goal: find pairing of letters in the two sequences that optimize the total score This is a hard example. That is another easy example. This is a --hard---- example. || ||||| | | ||||||||| That is another easy example. STAT 115

  5. Align Biological Sequences • DNA (4 nt + gap) TTGATCAC TTTA-CAC • Protein (20 aa + gap) RKVA--GMAKPNM RKIAVAAASKPAV • Sometimes > 4 nt for DNA and > 20 aa for proteins • A word on IUPAC STAT 115

  6. A adenosine C cytidine G guanine T thymidine U uridine R G A (purine) Y T C (pyrimidine) K G T (keto) M A C (amino) S G C (strong) W A T (weak) B C G T (not A) D A G T (not C) H A C T (not G) V A C G (not T) N A C G T (any) – gap IUPAC for DNA STAT 115

  7. A Ala B Asp or Asn C Cys D Asp E Glu F Phe G Gly H His I Iso K Lys L Leu M Met N Asn P Pro G Gln R Arg S Ser T Thr U Sel V Val W Try Y Tyr Z Glu or Gln X Any * Translation stop – Gap IUPAC for Protein STAT 115

  8. Why Align Two Sequences • If two sequences are similar, they might share the same ancestor • If two sequences are similar, they may share the same structure, therefore similar function • In genome sequencing assembly, if two sequences have overlapping similar regions, they might be connected to represent longer sequenced region. STAT 115

  9. Scoring Schemes • Match, mismatch, gap score determines the final alignment This is a --hard---- example. || ||||| | | ||||||||| That is another easy example. • Match OK, mismatch costly, gap cheap. This is a-- h-ard---- example. Th--at is anothe-r easy example. • Match cheap, mismatch cheap, gap costly. This is a hard example.------ That is another easy example. STAT 115

  10. Dot Matrix Approach • Naïve algorithm • Dot – match, find diagonal lines • Can’t afford more complex scoring • Visual analysis, hard to find optimal alignments STAT 115

  11. Dynamic Programming • Essence of dynamic programming: • Store the sub-problem solutions for later use • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two • Earliest method, Needleman & Wunsch 1969 • Still the best (sensitive and optimal) algorithm for pair-wise alignment STAT 115

  12. Dynamic Programming • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two Best previous alignment i j STAT 115

  13. Dynamic Programming • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two • Repeat the process until reaching the two sequences’ ends New best alignment = Best previous alignment + align (i,j) i j STAT 115

  14. Dynamic Programming Steps • Initialize NxM matrix • N, M are the length of the two sequences • Fill the matrix • Each element record the current best score, and pointer to the previous best alignment • Always search the previous column and row for the best previous alignment • Trace back to obtain optimal alignment STAT 115

  15. Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A G G C STAT 115

  16. Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G G C STAT 115

  17. Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G C STAT 115

  18. Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C STAT 115

  19. Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C -4 -2 -1 0 1 2 STAT 115

  20. Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C -4 -2 -1 0 1 2 Trace back AATGC -AGGC AATGC AG-GC STAT 115

  21. F(i-1,j-1) F(i,j-1) F(i-1,j) F(i,j) Alignment Recursion (linear gap penalty) C A T T G 0 -1 -2 -3 -4 -5 T C ATG -1 0 -1 0 -2 1 0 -1 E.g., -3 3 2 1 0 5 -4 -1 3 4 4 5 6 -5 -2 2 STAT 115

  22. F(i,j) Affine Gap Penalty • Gap penalty function: • Typically a>b (e.g., a=12; b=2) • An order O(nm) algorithm. Key: keep 3 functions, each recording a directional optimum. STAT 115

  23. Gap Penalty • Gap penalty = g + l e A A T G C A 1 1 0 0 0 G 0 1 1 1.4 0.3 G C g – gap start l – gap length e – gap extend e.g. g = -0.5 e = -0.1 STAT 115

  24. Gap Penalty • Gap penalty = g + l e A A T G C A 1 1 0 0 0 G 0 1 1 1.4 0.3 G 0 0.4 1 2 1.4 C 0 0.3 0.4 1 3 g – gap start l – gap length e – gap extend e.g. g = -0.5 e = -0.1 STAT 115

  25. End Gaps • Should we penalize gaps at the ends? ATCCGCATACCGGA --CCGCATAC---- • If two sequences similar length and entire sequences are supposed to be similar, penalize. • If two sequences have very different length, do not penalize (most of the time, ignore end gap penalties) STAT 115

  26. Global vs. Local Alignment • Global: Needleman-Wunsch • Find best alignment for the whole 2 sequences • Could have no penalty for mismatches/gaps • Trace back from lower right corner to upper left corner • Local: Smith-Waterman • Find high scoring subsequences • E.g. Two proteins only share one similar functional domain • Can be achieved by modifying global alignment method STAT 115

  27. Local Alignment Modifications • Use negative mismatch and gap penalties • The minimum score for [i, j] is 0 • If S[i,j] < 0, rewrite it to 0, point to self • If previous col/row is all 0, S[i,j] point to self • The best score is sought anywhere in the matrix • Not just last column and last row (should keep a global pointer to the best score) • Trace back until a cell pointing to itself (not necessary to the beginning of the two sequences) STAT 115

  28. Matrix Filling in Smith-Waterman max {k < j} S(i,j-k) + g(k) S(i,j) = max S(i-1,j-1) + m(i,j) max {l < i} S(i-l,j) + g(l) 0 S(i,j-2) g(2) S(i-1,j-1) S(i,j) S(i-3,j) g(3) STAT 115

  29. Example STAT 115

  30. Finding suboptimal alignments STAT 115

  31. Smith-Waterman • Negative mis-match & negative gaps • Scoring matrix >= 0 • Trace from maximum STAT 115

  32. Scoring Matrices • For DNA, match + 5, mismatch – 4 • For proteins, different amino acid pairs receive different scores • Consideration: size, shape, electric charge, van der Waals interaction, ability to form salt, hydrophobic, and hydrogen bonds • Substitution matrices • Often symmetrical • + / – scores: functional similarity STAT 115

  33. PAM Matrices • MO Dayhoff 1978 • PAM: percent accepted mutations • Database of 1572 changes in 71 groups of closely related proteins (> 85% similar) • Construct phylogenetic tree of each group, tabulate probability of amino acid changes between every pairs of aa • For statistician: Markov chain transition b/w aa pairs STAT 115

  34. PAM Matrix Family • PAM-N • PAM-0: 1 on diagonal and 0 all the rest • PAM-N: what would happen if N out of 100 aa mutate • For statisticians: matrix multiplication N times • Bigger N, more diverged substitution matrice • Final matrix STAT 115

  35. PAM250 • 250 substitutions in 100 residues, only 1/5 residue remain unchanged STAT 115

  36. BLOSUM Matrices • BLOcks amino acid SUbstitution Matrices • Henikoff and Henikoff, PNAS. 1992, 89:10915-9 • Check >500 protein families in the Prosite database (Bairoch 1991) • Find ~2000 blocks of aligned segments • BLOCKS database • Characterize ungapped patterns 3-60 aa long • Check aligned columns for observed substitutions STAT 115

  37. BLOSUM Matrix Entry • How frequently do aa appear • How often do we expect to see i, j together • How often do we actually see them together in all the alignments • BLOSUM entry STAT 115

  38. BLOSUM62 Matrix STAT 115

  39. BLOSUM Matrices • Blocks are grouped before looking at aa substitutions • BLOSUMN: if sequences > N% identical, their contributions are weighted to sum to 1 • Most widely used: BLOSUM62 and PAM250 STAT 115

  40. More About Dynamic Programming • Example 1: Suppose I have x0 amount of savings at retirement, and also receive stamount of social security payment every year. Annual interest rate is qt, what is an optimal spending plan if I want to leave zero dollars at the end (say, year 5)? Maximizing total spending? STAT 115

  41. Example 2: Secretary Problem • We get to observe the “qualities” of m secretaries: X1,…, Xm sequentially according to a random order. Our goal is to maximize the probability of finding the “best” candidate with no looking back! • Heuristic: We start our reasoning backwards. Suppose we have seen X1,…, Xj, should we stop or go on? STAT 115

  42. Let’s start reasoning ... • What if we wait till the last person? • What if we wait till having two people left? • Strategy: if m-1st person is better than previous ones, take her; otherwise wait till the last one. • Get a recursion? If we let go j-1 people, and take the best-person-so-far starting from jth person ... Pj maximized at STAT 115

  43. Final “answer” • We should reject the first 37% of the candidates and start to look: recruit the first person who is the best among all that have been interviewed. • The chance of getting the best one is ~37% ! STAT 115

  44. Summary • Dynamic programming finds optimal alignment between two sequences • Keep subproblem solution for later use • Needleman & Wunsch, global • Smith & Waterman, local • Scoring sensitive to gap penalty and substitution matrices used • Substitution matrices capture aa similarity • PAM and BLOSUM matrices STAT 115

  45. Question for Thoughts • Given a string of integers (both positive and negative) • E.g. 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1, -7, 6 • Can you read each number only ONCE, and tell from which number to which number you will get the largest sum? • 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1, -7, 6, -2 • Largest sum = 12 • Hint: dynamic programming STAT 115

  46. Acknowledgement • Russ Altman • Theodor Hanekamp • Eric Rouchka STAT 115

More Related