Pair-wise Sequence Alignment

C E N T E R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment

Bioinformatics • “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) • “Nothing in bioinformatics makes sense except in the light of Biology”

The Genetic Code

A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Searching for similarities What is the function of the new gene? The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques): – Find a set of similar protein sequences to the unknown sequence – Identify similarities and differences – For long proteins: first identify domains

Is similarity really interesting?

Evolutionary and functional relationships • Reconstruct evolutionary relation: • Based on sequence • -Identity (simplest method) • -Similarity • Homology (common ancestry: the ultimate goal) • Other (e.g., 3D structure) • Functional relation: • SequenceStructureFunction

Searching for similarities Common ancestry is moreinteresting: Makes it more likely that genes share the same function Homology: sharing a commonancestor – a binary property (yes/no) – it’s a nice tool: When (anunknown) gene X ishomologous to (a known) gene G itmeans that we gain a lot of informationon X: what we know about G can be transferred to X as a good suggestion.

How to go from DNA to protein sequence A piece of double stranded DNA: 5’attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’ DNA direction is from 5’ to 3’

How to go from DNA to protein sequence 6-frame translation using the codon table (last lecture): 5’attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’

Bioinformatics tool Algorithm Data tool Biological Interpretation (model)

Example today: Pairwise sequence alignment needs sense of evolution Global dynamic programming MDAGSTVILCFVG Evolution M D A A S T I L C G S Amino Acid Exchange Matrix Search matrix MDAGSTVILCFVG- Gap penalties (open,extension) MDAAST-ILC--GS

Z X Y How to determine similarity? • Frequent evolutionary events:1. Substitution2. Insertion, deletion3. Duplication4. Inversion • Evolution at work We’ll only use these Common ancestor, usually extinct available

GTACT--C GGA-TGAC Alignment - the problem Algorithmsforgenomes ||||||| algorithms4genomes algorithmsforgenomes ||||||||||algorithms4genomes algorithmsforgenomes ||||||||||? |||||||algorithms4--genomes • Operations: • substitution, • Insertion • deletion

A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Dynamic programmingScoring alignments – Substitution (or match/mismatch) • DNA • proteins – Gap penalty • Linear: gp(k)=ak • Affine: gp(k)=b+ak • Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores of all alignment columns

Dynamic programmingScoring alignments – Substitution (or match/mismatch) • DNA • proteins – Gap penalty • Linear: gp(k)=ak • Affine: gp(k)=b+ak • Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores over all alignment columns affine penalty concave linear , • / Gap length General alignment score: Sa,b=

Substitution Matrices: DNA define a score for match/mismatch ofletters Simple: Used in genome alignments:

Substitution matrices for a.a. • Amino acids are not equal: • Some aresimilar and easily substituted: • biochemical properties • structure • Some mutations occur more often due to similar codons • The two above give us substitution matrices http://www.people.virginia.edu/~rjh9u/aminacid.html orange: nonpolar and hydrophobic. green: polar and hydrophilic magenta box are acidic light blue box are basic http://www.cimr.cam.ac.uk/links/codon.htm

BLOSUM 62 substitution matrix http://www.carverlab.org/testing/epp.html

Gap Scoring intro extension Constant -2 -2 Affine -3 -1 Constant vs. affine gap scoring …and +1 for match Seq1 G T A - - G - T - ASeq2 - - A T G - A T G - Const -2 –2 1 –2 –2 (SUM = -7) -2 –2 1 -2 –2 (SUM = -7) Affine –4 1 -4 (SUM = -7) -3 –3 1 -3 –3 (SUM = -11)‏

Dynamic programmingScoring alignments T D W V T A L K T D W L - - I K 2020 10 1 Affine gap penalties (open, extension) Amino Acid Exchange Matrix Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px + +s(L,I)+s(K,K)

Amino acid exchange matrices 2020 How do we get one? And how do we get associated gap penalties? First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure

A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V PAM250 matrix amino acid exchange matrix (log odds) Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation

Amino acid exchange matrices Amino acids are not equal: 1. Some are easily substituted because they have similar: • physico-chemical properties • structure 2. Some mutations between amino acids occur more often due tosimilar codons The two above observations give us ways to define substitutionmatrices

Pair-wise alignment T D W V T A L K T D W L - - I K Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 22n = ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

Technique to overcome the combinatorial explosion:Dynamic Programming • Alignment is simulated as Markov process, all sequence positions are seen as independent • Chances of sequence events are independent • Therefore, probabilities per aligned position need to be multiplied • Amino acid matrices contain so-called log-odds values (log10 of the probabilities), so probabilities can be summed

To say the same more statistically… • To perform statistical analyses on messages or sequences, we need a reference model. • The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner. • This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment. • Given a probability distribution, Pi, for the letters in ai.i.d. message, the probability of seeing a particular sequence of letters i, j, k, ... n is simply Pi Pj Pk···Pn. • As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log Pi + log Pj + log Pk+ ··· + log Pn. • In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score.

Sequence alignmentHistory of Dynamic Programming algorithm • 1970Needleman-Wunsch global pair-wise alignment • Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins,J Mol Biol. 48(3):443-53. • 1981Smith-Waterman local pair-wise alignment • Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG Evolution M D A A S T I L C G S Amino Acid Exchange Matrix Search matrix Gap penalties (open,extension) MDAGSTVILCFVG- MDAAST-ILC--GS

Global dynamic programming j-1j i-1 i Value from residue exchange matrix H(i-1,j-1)+ S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal H(i,j) = Max This is a recursive formula

Global alignmentThe Needleman-Wunsch algorithm • Goal: find the maximal scoring alignment • Scores: m match, -s mismatch, -g for insertion/deletion • Dynamic programming • Solve smaller subproblem(s)‏ • Iteratively extend the solution • The best alignment for X[1…i] and Y[1…j]is called M[i, j] X1 … Xi-1 -- Xi Y1 … - Yj-1 Yj -

Goal: find the maximal scoring alignment • Scores: m match, -s mismatch, -g for insertion/deletion • The best alignment for X[1…i]and Y[1…j]is called M[i, j] • 3 ways to extend the alignment X[1…i-1]X[i] X[1…i] - X[1…i-1] X[i] Y[1…j-1]Y[j] Y[1…j-1]Y[j] Y[1…j] - + m M[i-1, j-1] - s The algorithm M[i,j] = M[i, j-1] - g M[i-1, j] - g

The algorithm – final equation Corresponds to: * X1…Xi-1 XiY1…Yj-1 Yj M[i-1, j-1] + score(X[i],Y[j])‏ M[i, j] = max M[i, j-1] – g X1…Xi -Y1…Yj-1 Yj M[i-1, j] – g X1…Xi-1 XiY1…Yj - j-1 j i-1 * -g i -g

Example: global alignment of two sequences • Align two DNA sequences: • GAGTGA • GAGGCGA (note the length difference)‏ • Parameters of the algorithm: • Match:score(A,A) = 1 • Mismatch:score(A,T) = -1 • Gap:g = -2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2

j 0 1 2 3 4 5 6 i - G A G T G A 0 - 0 1 G 2 A 3 G 4 G 5 C 6 G 7 A The algorithm. Step 1: init M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 • Create the matrix • Initiation • 0 at [0,0] • Apply the equation… M[i-1, j] – 2

- G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 A -4 G -6 G -8 C -10 G -12 A -14 The algorithm. Step 1: init M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 • Initiation of the matrix: • 0 at pos [0,0] • Fill in the first row using the “” rule • Fill in the first column using “” M[i-1, j] – 2 j i

- G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 A -4 -1 2 G -6 G -8 C -10 G -12 A -14 The algorithm. Step 2: fill in M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 • Continue filling in of the matrix, remembering from which cell the result comes (arrows)‏ M[i-1, j] – 2 j i

- G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 The algorithm. Step 2: fill in M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 • We are done… • Where’s the result? M[i-1, j] – 2 j The lowest-rightmost cell i

- G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 The algorithm. Step 3:traceback M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 • Start at the last cell of the matrix • Go against the direction of arrows • Sometimes the value may be obtained from more than one cell (which one?)‏ M[i-1, j] – 2 j i

- G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 The algorithm. Step 3: traceback The ‘low’ and the ‘high’ road • Extract the alignments j a)‏ GAGT-GA a GAGGCGA b)‏ b GA-GTGA GAGGCGA i You can also jump from the high to the low road in the middle (following the arrow): to what alignment does that lead?

X1…Xi -Y1…Yj-1 Yj Affine gaps • Penalties: go - gap opening (e.g. -8)‏ ge - gap extension (e.g. -1)‏ M[i-1, j-1] X1…XiY1…Yj M[i, j] = max score(X[i], Y[j]) + Ix[i-1, j-1] Iy[i-1, j-1] M[i, j-1] + go X1…-Y1… Yj Ix[i, j] = max Ix[i, j-1] + ge X1… - -Y1…Yj-1 Yj M[i-1, j] + go Iy[i, j] = max Iy[i-1, j] + gx • @ home: think of boundary values M[*,0], I[*,0] etc.

Semi-global pairwise alignment • Global alignment: all gaps are penalised • Semi-global alignment: N- and C-terminal (5’ and 3’) gaps (end-gaps) are not penalised MSTGAVLIY--TS----- ---GGILLFHRTSGTSNS End-gaps End-gaps

Variation on global alignment • Global alignment: previous algorithms is called globalalignment, because it uses all letters from both sequences.CAGCACTTGGATTCTCGG CAGC-----G-T----GG • Semi-globalalignment: don’t penalize for end gaps CAGCA-CTTGGATTCTCGG ---CAGCGTGG--------

Semi-global pairwise alignment Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other Danger:really bad alignments for divergent sequences (particularly if gap penalties are high) Protein sequences have N- and C-terminal amino acids that are often small and hydrophilic, and so these ends match seq X: seq Y:

- G A G T G - 0 0 0 0 0 0 A 0 -1 1 -1 -1 -1 G 0 1 -2 2 0 -2 T 0 -1 0 0 3 1 Semi-global alignment • Ignore 5’ or N-terminal end gaps • First row/column set to 0 • Ignore C-terminal or 3’ end gaps • Read the result from last row/column

Take-home messages • Homology • Why are we interested in similarity? • Pairwise alignment: global alignment and semi-global alignment • Three types of gap penalty regimes: linear, affine and concave • Make sure you can calculate alignment using the DP algorithm

a heuristic • Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

Global dynamic programmingPAM250, Gap=6 (linear) These values are copied from the PAM250 matrix (see earlier slide) The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps Higgs & Attwood, p. 124

Global dynamic programmingLinear, Affine or Concave gap penalties j-1 All penalty schemes are possible because the exact length of the gap is known i-1 Gap opening penalty Max{S0<x<i-1, j-1- Pi - (i-x-1)Px} Si-1,j-1 Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px} Si,j = si,j + Max Gap extension penalty

Pair-wise Sequence Alignment