Pairwise Sequence Alignment and Scoring Matrices

Stat 115 Lecture 3 Pairwise Sequence Alignmentand Scoring Matrices Xiaole Shirley Liu And Jun Liu

Mol Bio Quick Facts • Building block of DNA is deoxyribonucleic acid • Building block of protein is amino acid • Protein is a peptide, a long peptide • Only exons can code for functional proteins or RNAs • Introns are spliced out STAT 115

Outline • Motivation and introduction • Dynamic programming • Global sequence alignment • Needleman-Wunsch • 3 steps: Initialize, fill matrix, trace back • Gap penalties • Local sequence alignment, Smith-Waterman • Scoring matrices • PAM • BLOSUM STAT 115

gap match mismatch Pairwise Sequence Alignment • Given: two sequences, scoring for match/mismatch/gap • Goal: find pairing of letters in the two sequences that optimize the total score This is a hard example. That is another easy example. This is a --hard---- example. || ||||| | | ||||||||| That is another easy example. STAT 115

Align Biological Sequences • DNA (4 nt + gap) TTGATCAC TTTA-CAC • Protein (20 aa + gap) RKVA--GMAKPNM RKIAVAAASKPAV • Sometimes > 4 nt for DNA and > 20 aa for proteins • A word on IUPAC STAT 115

A adenosine C cytidine G guanine T thymidine U uridine R G A (purine) Y T C (pyrimidine) K G T (keto) M A C (amino) S G C (strong) W A T (weak) B C G T (not A) D A G T (not C) H A C T (not G) V A C G (not T) N A C G T (any) – gap IUPAC for DNA STAT 115

A Ala B Asp or Asn C Cys D Asp E Glu F Phe G Gly H His I Iso K Lys L Leu M Met N Asn P Pro G Gln R Arg S Ser T Thr U Sel V Val W Try Y Tyr Z Glu or Gln X Any * Translation stop – Gap IUPAC for Protein STAT 115

Why Align Two Sequences • If two sequences are similar, they might share the same ancestor • If two sequences are similar, they may share the same structure, therefore similar function • In genome sequencing assembly, if two sequences have overlapping similar regions, they might be connected to represent longer sequenced region. STAT 115

Scoring Schemes • Match, mismatch, gap score determines the final alignment This is a --hard---- example. || ||||| | | ||||||||| That is another easy example. • Match OK, mismatch costly, gap cheap. This is a-- h-ard---- example. Th--at is anothe-r easy example. • Match cheap, mismatch cheap, gap costly. This is a hard example.------ That is another easy example. STAT 115

Dot Matrix Approach • Naïve algorithm • Dot – match, find diagonal lines • Can’t afford more complex scoring • Visual analysis, hard to find optimal alignments STAT 115

Dynamic Programming • Essence of dynamic programming: • Store the sub-problem solutions for later use • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two • Earliest method, Needleman & Wunsch 1969 • Still the best (sensitive and optimal) algorithm for pair-wise alignment STAT 115

Dynamic Programming • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two Best previous alignment i j STAT 115

Dynamic Programming • Best alignment at (i,j) is the best alignment previous to (i,j) plus aligning these two • Repeat the process until reaching the two sequences’ ends New best alignment = Best previous alignment + align (i,j) i j STAT 115

Dynamic Programming Steps • Initialize NxM matrix • N, M are the length of the two sequences • Fill the matrix • Each element record the current best score, and pointer to the previous best alignment • Always search the previous column and row for the best previous alignment • Trace back to obtain optimal alignment STAT 115

Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A G G C STAT 115

Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G G C STAT 115

Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G C STAT 115

Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C STAT 115

Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C -4 -2 -1 0 1 2 STAT 115

Fill the Matrix • BestScore[i, j] = max{BS[i-1, j] - ; BS[i, j-1] - ; BS[i-1,j-1]+match(i,j)}  A A T G C  0 -1 -2 -3 -4 -5 A -1 1 0 -1 -2 -3 G -2 0 1 0 0 -1 G -3 -1 0 1 1 0 C -4 -2 -1 0 1 2 Trace back AATGC -AGGC AATGC AG-GC STAT 115

F(i-1,j-1) F(i,j-1) F(i-1,j) F(i,j) Alignment Recursion (linear gap penalty) C A T T G 0 -1 -2 -3 -4 -5 T C ATG -1 0 -1 0 -2 1 0 -1 E.g., -3 3 2 1 0 5 -4 -1 3 4 4 5 6 -5 -2 2 STAT 115

F(i,j) Affine Gap Penalty • Gap penalty function: • Typically a>b (e.g., a=12; b=2) • An order O(nm) algorithm. Key: keep 3 functions, each recording a directional optimum. STAT 115

Gap Penalty • Gap penalty = g + l e A A T G C A 1 1 0 0 0 G 0 1 1 1.4 0.3 G C g – gap start l – gap length e – gap extend e.g. g = -0.5 e = -0.1 STAT 115

Gap Penalty • Gap penalty = g + l e A A T G C A 1 1 0 0 0 G 0 1 1 1.4 0.3 G 0 0.4 1 2 1.4 C 0 0.3 0.4 1 3 g – gap start l – gap length e – gap extend e.g. g = -0.5 e = -0.1 STAT 115

End Gaps • Should we penalize gaps at the ends? ATCCGCATACCGGA --CCGCATAC---- • If two sequences similar length and entire sequences are supposed to be similar, penalize. • If two sequences have very different length, do not penalize (most of the time, ignore end gap penalties) STAT 115

Global vs. Local Alignment • Global: Needleman-Wunsch • Find best alignment for the whole 2 sequences • Could have no penalty for mismatches/gaps • Trace back from lower right corner to upper left corner • Local: Smith-Waterman • Find high scoring subsequences • E.g. Two proteins only share one similar functional domain • Can be achieved by modifying global alignment method STAT 115

Local Alignment Modifications • Use negative mismatch and gap penalties • The minimum score for [i, j] is 0 • If S[i,j] < 0, rewrite it to 0, point to self • If previous col/row is all 0, S[i,j] point to self • The best score is sought anywhere in the matrix • Not just last column and last row (should keep a global pointer to the best score) • Trace back until a cell pointing to itself (not necessary to the beginning of the two sequences) STAT 115

Matrix Filling in Smith-Waterman max {k < j} S(i,j-k) + g(k) S(i,j) = max S(i-1,j-1) + m(i,j) max {l < i} S(i-l,j) + g(l) 0 S(i,j-2) g(2) S(i-1,j-1) S(i,j) S(i-3,j) g(3) STAT 115

Example STAT 115

Finding suboptimal alignments STAT 115

Smith-Waterman • Negative mis-match & negative gaps • Scoring matrix >= 0 • Trace from maximum STAT 115

Scoring Matrices • For DNA, match + 5, mismatch – 4 • For proteins, different amino acid pairs receive different scores • Consideration: size, shape, electric charge, van der Waals interaction, ability to form salt, hydrophobic, and hydrogen bonds • Substitution matrices • Often symmetrical • + / – scores: functional similarity STAT 115

PAM Matrices • MO Dayhoff 1978 • PAM: percent accepted mutations • Database of 1572 changes in 71 groups of closely related proteins (> 85% similar) • Construct phylogenetic tree of each group, tabulate probability of amino acid changes between every pairs of aa • For statistician: Markov chain transition b/w aa pairs STAT 115

PAM Matrix Family • PAM-N • PAM-0: 1 on diagonal and 0 all the rest • PAM-N: what would happen if N out of 100 aa mutate • For statisticians: matrix multiplication N times • Bigger N, more diverged substitution matrice • Final matrix STAT 115

PAM250 • 250 substitutions in 100 residues, only 1/5 residue remain unchanged STAT 115

BLOSUM Matrices • BLOcks amino acid SUbstitution Matrices • Henikoff and Henikoff, PNAS. 1992, 89:10915-9 • Check >500 protein families in the Prosite database (Bairoch 1991) • Find ~2000 blocks of aligned segments • BLOCKS database • Characterize ungapped patterns 3-60 aa long • Check aligned columns for observed substitutions STAT 115

BLOSUM Matrix Entry • How frequently do aa appear • How often do we expect to see i, j together • How often do we actually see them together in all the alignments • BLOSUM entry STAT 115

BLOSUM62 Matrix STAT 115

BLOSUM Matrices • Blocks are grouped before looking at aa substitutions • BLOSUMN: if sequences > N% identical, their contributions are weighted to sum to 1 • Most widely used: BLOSUM62 and PAM250 STAT 115

More About Dynamic Programming • Example 1: Suppose I have x0 amount of savings at retirement, and also receive stamount of social security payment every year. Annual interest rate is qt, what is an optimal spending plan if I want to leave zero dollars at the end (say, year 5)? Maximizing total spending? STAT 115

Example 2: Secretary Problem • We get to observe the “qualities” of m secretaries: X1,…, Xm sequentially according to a random order. Our goal is to maximize the probability of finding the “best” candidate with no looking back! • Heuristic: We start our reasoning backwards. Suppose we have seen X1,…, Xj, should we stop or go on? STAT 115

Let’s start reasoning ... • What if we wait till the last person? • What if we wait till having two people left? • Strategy: if m-1st person is better than previous ones, take her; otherwise wait till the last one. • Get a recursion? If we let go j-1 people, and take the best-person-so-far starting from jth person ... Pj maximized at STAT 115

Final “answer” • We should reject the first 37% of the candidates and start to look: recruit the first person who is the best among all that have been interviewed. • The chance of getting the best one is ~37% ! STAT 115

Summary • Dynamic programming finds optimal alignment between two sequences • Keep subproblem solution for later use • Needleman & Wunsch, global • Smith & Waterman, local • Scoring sensitive to gap penalty and substitution matrices used • Substitution matrices capture aa similarity • PAM and BLOSUM matrices STAT 115

Question for Thoughts • Given a string of integers (both positive and negative) • E.g. 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1, -7, 6 • Can you read each number only ONCE, and tell from which number to which number you will get the largest sum? • 3, -1, -5, 2, 4, -3, 6, -4, 2, 5, -8, 3, 1, -7, 6, -2 • Largest sum = 12 • Hint: dynamic programming STAT 115

Acknowledgement • Russ Altman • Theodor Hanekamp • Eric Rouchka STAT 115

Pairwise Sequence Alignment and Scoring Matrices

Pairwise Sequence Alignment and Scoring Matrices

Presentation Transcript

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment (I)

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise Sequence Alignment (II)

Pairwise and multiple sequence alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment (cont.)

Scoring Matrices for Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Scoring Matrices for Sequence Alignment

Pairwise sequence alignment (practice)