C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Introduction to bioinformatics 2007 Lecture 5. Pairwise Sequence Alignment. Bioinformatics.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
C
E
N
T
E
R
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Introduction to bioinformatics 2007
Lecture 5
Pairwise Sequence Alignment
Bioinformatics
A protein sequence alignment
MSTGAVLIYTSILIKECHAMPAGNE
GGILLFHRTHELIKESHAMANDEGGSNNS
* * * **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
atttggcggatcgcctctacgggcc
*** **** **** ** ******
Searching for similarities
What is the function of the new gene?
The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques):
– Find a set of similar protein sequences to the unknown sequence
– Identify similarities and differences
– For long proteins: first identify domains
Is similarity really interesting?
Searching for similarities
Common ancestry is moreinteresting:
Makes it more likely that genes share
the same function
Homology: sharing a commonancestor
– a binary property (yes/no)
– it’s a nice tool:
When (anunknown) gene X ishomologous to (a known) gene G itmeans that we gain a lot of informationon X: what we know about G can be transferred to X as a good suggestion.
A piece of double stranded DNA:
5’attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
DNA direction is from 5’ to 3’
6frame translation using the codon table (last lecture):
5’attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
Bioinformatics tool
Algorithm
Data
tool
Biological
Interpretation
(model)
Example today: Pairwise sequence alignment needs sense of evolution
Global dynamic programming
MDAGSTVILCFVG
Evolution
M
D
A
A
S
T
I
L
C
G
S
Amino Acid Exchange
Matrix
Search matrix
MDAGSTVILCFVG
Gap penalties (open,extension)
MDAASTILCGS
Z
X
Y
How to determine similarity?
We’ll only use these
Common ancestor, usually extinct
available
GTACTC
GGATGAC
Alignment  the problem
Algorithmsforgenomes
 algorithms4genomes
algorithmsforgenomes
algorithms4genomes
algorithmsforgenomes
? algorithms4genomes
A protein sequence alignment
MSTGAVLIYTSILIKECHAMPAGNE
GGILLFHRTHELIKESHAMANDEGGSNNS
* * * **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
atttggcggatcgcctctacgggcc
*** **** **** ** ******
Dynamic programmingScoring alignments
– Substitution (or match/mismatch)
• DNA
• proteins
– Gap penalty
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)
The score for an alignment is the sum of the scores of all alignment columns
Dynamic programmingScoring alignments
– Substitution (or match/mismatch)
• DNA
• proteins
– Gap penalty
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)
The score for an alignment is the sum
of the scores over all alignment columns
affine
penalty
concave
linear
,
Gap length
General alignment score:
Sa,b=
Substitution Matrices: DNA
define a score for match/mismatch ofletters
Simple:
Used in genome alignments:
Substitution matrices for a.a.
http://www.people.virginia.edu/~rjh9u/aminacid.html
orange: nonpolar and hydrophobic.
green: polar and hydrophilic
magenta box are acidic
light blue box are basic
http://www.cimr.cam.ac.uk/links/codon.htm
BLOSUM 62 substitution matrix
http://www.carverlab.org/testing/epp.html
Gap
Scoring
intro
extension
Constant
2
2
Affine
3
1
Constant vs. affine gap scoring
…and +1 for match
Seq1G T A  G  T  ASeq2  A T G A T G 
Const2 –2 1 –2 –2 (SUM = 7)2 –2 1 2 –2 (SUM = 7)
Affine –4 1 4 (SUM = 7)3 –3 1 3 –3 (SUM = 11)
Dynamic programmingScoring alignments
T D W V T A L K
T D W L   I K
2020
10
1
Affine gap penalties (open, extension)
Amino Acid Exchange Matrix
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)Po2Px +
+s(L,I)+s(K,K)
Amino acid exchange matrices
2020
How do we get one?
And how do we get associated gap penalties?
First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure
A 2
R 2 6
N 0 0 2
D 0 1 2 4
C 2 4 4 5 12
Q 0 1 1 2 5 4
E 0 1 1 3 5 2 4
G 1 3 0 1 3 1 0 5
H 1 2 2 1 3 3 1 2 6
I 1 2 2 2 2 2 2 3 2 5
L 2 3 3 4 6 2 3 4 2 2 6
K 1 3 1 0 5 1 0 2 0 2 3 5
M 1 0 2 3 5 1 2 3 2 2 4 0 6
F 4 4 4 6 4 5 5 5 2 1 2 5 0 9
P 1 0 1 1 3 0 1 1 0 2 3 1 2 5 6
S 1 0 1 0 0 1 0 1 1 1 3 0 2 3 1 2
T 1 1 0 0 2 1 0 0 1 0 2 0 1 3 0 1 3
W 6 2 4 7 8 5 7 7 3 5 2 3 4 0 6 2 5 17
Y 3 4 2 4 0 4 4 5 0 1 1 4 2 7 5 3 3 0 10
V 0 2 2 2 2 2 2 1 2 4 2 2 2 1 1 1 0 6 2 4
A R N D C Q E G H I L K M F P S T W Y V
PAM250 matrix
amino acid exchange matrix (log odds)
Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation
Amino acid exchange matrices
Amino acids are not equal:
1. Some are easily substituted because they have similar:
• physicochemical properties
• structure
2. Some mutations between amino acids occur more often due tosimilar codons
The two above observations give us ways to define substitutionmatrices
Pairwise alignment
T D W V T A L K
T D W L   I K
Combinatorial explosion
 1 gap in 1 sequence: n+1 possibilities
 2 gaps in 1 sequence: (n+1)n
 3 gaps in 1 sequence: (n+1)n(n1), etc.
2n (2n)! 22n
= ~
n (n!)2 n
2 sequences of 300 a.a.: ~1088 alignments
2 sequences of 1000 a.a.: ~10600 alignments!
To say the same more statistically…
Sequence alignmentHistory of Dynamic Programming algorithm
Pairwise sequence alignment
Global dynamic programming
MDAGSTVILCFVG
Evolution
M
D
A
A
S
T
I
L
C
G
S
Amino Acid Exchange
Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG
MDAASTILCGS
Global dynamic programming
j1j
i1
i
Value from residue exchange matrix
H(i1,j1)+ S(i,j)
H(i1,j)  g
H(i,j1)  g
diagonal
vertical
horizontal
H(i,j) = Max
This is a recursive formula
X1 … Xi1  Xi
Y1 …  Yj1 Yj 
X[1…i1]X[i]
X[1…i] 
X[1…i1] X[i]
Y[1…j1]Y[j]
Y[1…j1]Y[j]
Y[1…j] 
+ m
M[i1, j1]
 s
M[i,j] =
M[i, j1]  g
M[i1, j]  g
Corresponds to:
*
X1…Xi1 XiY1…Yj1 Yj
M[i1, j1] + score(X[i],Y[j])
M[i, j] =
max
M[i, j1] – g
X1…Xi Y1…Yj1 Yj
M[i1, j] – g
X1…Xi1 XiY1…Yj 
j1
j
i1
*
g
i
g
M[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
0
1
2
3
4
5
6
i

G
A
G
T
G
A
0

0
1
G
2
A
3
G
4
G
5
C
6
G
7
A
M[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2

G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
A
4
G
6
G
8
C
10
G
12
A
14
M[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
i

G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
1
1
3
A
4
1
2
G
6
G
8
C
10
G
12
A
14
M[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
i

G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
1
1
3
5
7
9
A
4
1
2
0
2
4
6
G
6
3
0
3
1
1
3
G
8
5
2
1
2
2
0
C
10
7
4
1
0
1
1
G
12
9
6
3
2
1
0
A
14
11
8
5
4
1
2
M[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
The lowestrightmost cell
i

G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
1
1
3
5
7
9
A
4
1
2
0
2
4
6
G
6
3
0
3
1
1
3
G
8
5
2
1
2
2
0
C
10
7
4
1
0
1
1
G
12
9
6
3
2
1
0
A
14
11
8
5
4
1
2
M[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
i

G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
1
1
3
5
7
9
A
4
1
2
0
2
4
6
G
6
3
0
3
1
1
3
G
8
5
2
1
2
2
0
C
10
7
4
1
0
1
1
G
12
9
6
3
2
1
0
A
14
11
8
5
4
1
2
The ‘low’ and the ‘high’ road
j
a)
GAGTGA
a
GAGGCGA
b)
b
GAGTGA
GAGGCGA
i
You can also jump from the high to the low road in the middle (following the arrow): to what alignment does that lead?
X1…Xi Y1…Yj1 Yj
go  gap opening (e.g. 8)
ge  gap extension (e.g. 1)
M[i1, j1]
X1…XiY1…Yj
M[i, j] =
max
score(X[i], Y[j]) +
Ix[i1, j1]
Iy[i1, j1]
M[i, j1] + go
X1…Y1… Yj
Ix[i, j] =
max
Ix[i, j1] + ge
X1…  Y1…Yj1 Yj
M[i1, j] + go
Iy[i, j] =
max
Iy[i1, j] + gx
MSTGAVLIYTS
GGILLFHRTSGTSNS
Endgaps
Endgaps
CAGCACTTGGATTCTCGGCAGCGTGG
seq X:
seq Y:
Applications of semiglobal:
– Finding a gene in genome
– Placing marker onto a chromosome
– One sequence much longer than the other
Danger: if gap penalties high  really bad alignments for divergent sequences
Protein sequences have N and Cterminal amino acids that are often small and hydrophilic

G
A
G
T
G

0
0
0
0
0
0
A
0
1
1
1
1
1
G
0
1
2
2
0
2
T
0
1
0
0
3
1
Takehome messages
Global dynamic programmingPAM250, Gap=6 (linear)
These values are copied from the PAM250 matrix (see earlier slide)
The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps
Higgs & Attwood, p. 124
Global dynamic programmingLinear, Affine or Concave gap penalties
j1
All penalty schemes are possible because the exact length of the gap is known
i1
Gap opening penalty
Max{S0<x<i1, j1 Pi  (ix1)Px}
Si1,j1
Max{Si1, 0<y<j1  Pi  (jy1)Px}
Si,j = si,j + Max
Gap extension penalty
These values are copied from the PAM250 matrix (see earlier slide), after being made nonnegative by adding 8 to each PAM250 matrix cell (8 is the lowest number in the PAM250 matrix)
The extra bottom row and rightmost column give the final global alignment scores
j1
i1
Global dynamic programming
Semiglobal dynamic programming two examples with different gap penalties 
These values are copied from the PAM250 matrix (see earlier slide), after being made nonnegative by adding 8 to each PAM250 matrix cell (8 is the lowest number in the PAM250 matrix)
Global score is 65 –10 – 1*2 –10 – 2*2
Local dynamic programming(Smith & Waterman, 1981)
LCFVMLAGSTVIVGTR
E
D
A
S
T
I
L
C
G
S
Negative
numbers
Amino Acid
Exchange Matrix
Search matrix
Gap penalties
(open, extension)
AGSTVIVG
ASTILCG
Local dynamic programming(Smith & Waterman, 1981)
j1
i1
Gap opening penalty
Si,j + Max{S0<x<i1,j1  Pi  (ix1)Px}
Si,j + Si1,j1
Si,j + Max {Si1,0<y<j1  Pi  (jy1)Px}
0
Si,j = Max
Gap extension penalty
Local dynamic programming
Comparing two sequences
We want to be able to choose the best alignment between two sequences.
A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis.
Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold
They can identify insertions, deletions, inversions