C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Introduction to bioinformatics 2007 Lecture 5. Pairwise Sequence Alignment. Bioinformatics.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
E
N
T
E
R
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Introduction to bioinformatics 2007
Lecture 5
Pairwise Sequence Alignment
MSTGAVLIYTSILIKECHAMPAGNE
GGILLFHRTHELIKESHAMANDEGGSNNS
* * * **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
atttggcggatcgcctctacgggcc
*** **** **** ** ******
What is the function of the new gene?
The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques):
– Find a set of similar protein sequences to the unknown sequence
– Identify similarities and differences
– For long proteins: first identify domains
Common ancestry is moreinteresting:
Makes it more likely that genes share
the same function
Homology: sharing a commonancestor
– a binary property (yes/no)
– it’s a nice tool:
When (anunknown) gene X ishomologous to (a known) gene G itmeans that we gain a lot of informationon X: what we know about G can be transferred to X as a good suggestion.
A piece of double stranded DNA:
5’attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
DNA direction is from 5’ to 3’
6frame translation using the codon table (last lecture):
5’attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
Global dynamic programming
MDAGSTVILCFVG
Evolution
M
D
A
A
S
T
I
L
C
G
S
Amino Acid Exchange
Matrix
Search matrix
MDAGSTVILCFVG
Gap penalties (open,extension)
MDAASTILCGS
X
Y
How to determine similarity?
We’ll only use these
Common ancestor, usually extinct
available
GGATGAC
Alignment  the problem
Algorithmsforgenomes
 algorithms4genomes
algorithmsforgenomes
algorithms4genomes
algorithmsforgenomes
? algorithms4genomes
MSTGAVLIYTSILIKECHAMPAGNE
GGILLFHRTHELIKESHAMANDEGGSNNS
* * * **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
atttggcggatcgcctctacgggcc
*** **** **** ** ******
– Substitution (or match/mismatch)
• DNA
• proteins
– Gap penalty
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)
The score for an alignment is the sum of the scores of all alignment columns
– Substitution (or match/mismatch)
• DNA
• proteins
– Gap penalty
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)
The score for an alignment is the sum
of the scores over all alignment columns
affine
penalty
concave
linear
,
Gap length
General alignment score:
Sa,b=
define a score for match/mismatch ofletters
Simple:
Used in genome alignments:
http://www.people.virginia.edu/~rjh9u/aminacid.html
orange: nonpolar and hydrophobic.
green: polar and hydrophilic
magenta box are acidic
light blue box are basic
http://www.cimr.cam.ac.uk/links/codon.htm
http://www.carverlab.org/testing/epp.html
Scoring
intro
extension
Constant
2
2
Affine
3
1
Constant vs. affine gap scoring
…and +1 for match
Seq1 G T A   G  T  ASeq2   A T G  A T G 
Const 2 –2 1 –2 –2 (SUM = 7) 2 –2 1 2 –2 (SUM = 7)
Affine –4 1 4 (SUM = 7) 3 –3 1 3 –3 (SUM = 11)
T D W V T A L K
T D W L   I K
2020
10
1
Affine gap penalties (open, extension)
Amino Acid Exchange Matrix
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)Po2Px +
+s(L,I)+s(K,K)
2020
How do we get one?
And how do we get associated gap penalties?
First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure
R 2 6
N 0 0 2
D 0 1 2 4
C 2 4 4 5 12
Q 0 1 1 2 5 4
E 0 1 1 3 5 2 4
G 1 3 0 1 3 1 0 5
H 1 2 2 1 3 3 1 2 6
I 1 2 2 2 2 2 2 3 2 5
L 2 3 3 4 6 2 3 4 2 2 6
K 1 3 1 0 5 1 0 2 0 2 3 5
M 1 0 2 3 5 1 2 3 2 2 4 0 6
F 4 4 4 6 4 5 5 5 2 1 2 5 0 9
P 1 0 1 1 3 0 1 1 0 2 3 1 2 5 6
S 1 0 1 0 0 1 0 1 1 1 3 0 2 3 1 2
T 1 1 0 0 2 1 0 0 1 0 2 0 1 3 0 1 3
W 6 2 4 7 8 5 7 7 3 5 2 3 4 0 6 2 5 17
Y 3 4 2 4 0 4 4 5 0 1 1 4 2 7 5 3 3 0 10
V 0 2 2 2 2 2 2 1 2 4 2 2 2 1 1 1 0 6 2 4
A R N D C Q E G H I L K M F P S T W Y V
PAM250 matrix
amino acid exchange matrix (log odds)
Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation
Amino acids are not equal:
1. Some are easily substituted because they have similar:
• physicochemical properties
• structure
2. Some mutations between amino acids occur more often due tosimilar codons
The two above observations give us ways to define substitutionmatrices
T D W V T A L K
T D W L   I K
Combinatorial explosion
 1 gap in 1 sequence: n+1 possibilities
 2 gaps in 1 sequence: (n+1)n
 3 gaps in 1 sequence: (n+1)n(n1), etc.
2n (2n)! 22n
= ~
n (n!)2 n
2 sequences of 300 a.a.: ~1088 alignments
2 sequences of 1000 a.a.: ~10600 alignments!
Global dynamic programming
MDAGSTVILCFVG
Evolution
M
D
A
A
S
T
I
L
C
G
S
Amino Acid Exchange
Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG
MDAASTILCGS
j1j
i1
i
Value from residue exchange matrix
H(i1,j1)+ S(i,j)
H(i1,j)  g
H(i,j1)  g
diagonal
vertical
horizontal
H(i,j) = Max
This is a recursive formula
X1 … Xi1  Xi
Y1 …  Yj1 Yj 
X[1…i1]X[i]
X[1…i] 
X[1…i1] X[i]
Y[1…j1]Y[j]
Y[1…j1]Y[j]
Y[1…j] 
+ m
M[i1, j1]
 s
The algorithmM[i,j] =
M[i, j1]  g
M[i1, j]  g
Corresponds to:
*
X1…Xi1 XiY1…Yj1 Yj
M[i1, j1] + score(X[i],Y[j])
M[i, j] =
max
M[i, j1] – g
X1…Xi Y1…Yj1 Yj
M[i1, j] – g
X1…Xi1 XiY1…Yj 
j1
j
i1
*
g
i
g
M[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
0
1
2
3
4
5
6
i

G
A
G
T
G
A
0

0
1
G
2
A
3
G
4
G
5
C
6
G
7
A
The algorithm. Step 1: initM[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
A
4
G
6
G
8
C
10
G
12
A
14
The algorithm. Step 1: initM[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
i
G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
1
1
3
A
4
1
2
G
6
G
8
C
10
G
12
A
14
The algorithm. Step 2: fill inM[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
i
G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
1
1
3
5
7
9
A
4
1
2
0
2
4
6
G
6
3
0
3
1
1
3
G
8
5
2
1
2
2
0
C
10
7
4
1
0
1
1
G
12
9
6
3
2
1
0
A
14
11
8
5
4
1
2
The algorithm. Step 2: fill inM[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
The lowestrightmost cell
i
G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
1
1
3
5
7
9
A
4
1
2
0
2
4
6
G
6
3
0
3
1
1
3
G
8
5
2
1
2
2
0
C
10
7
4
1
0
1
1
G
12
9
6
3
2
1
0
A
14
11
8
5
4
1
2
The algorithm. Step 3:tracebackM[i1, j1] ± 1
M[i, j] =
max
M[i, j1] – 2
M[i1, j] – 2
j
i
G
A
G
T
G
A

0
2
4
6
8
10
12
G
2
1
1
3
5
7
9
A
4
1
2
0
2
4
6
G
6
3
0
3
1
1
3
G
8
5
2
1
2
2
0
C
10
7
4
1
0
1
1
G
12
9
6
3
2
1
0
A
14
11
8
5
4
1
2
The algorithm. Step 3: tracebackThe ‘low’ and the ‘high’ road
j
a)
GAGTGA
a
GAGGCGA
b)
b
GAGTGA
GAGGCGA
i
You can also jump from the high to the low road in the middle (following the arrow): to what alignment does that lead?
go  gap opening (e.g. 8)
ge  gap extension (e.g. 1)
M[i1, j1]
X1…XiY1…Yj
M[i, j] =
max
score(X[i], Y[j]) +
Ix[i1, j1]
Iy[i1, j1]
M[i, j1] + go
X1…Y1… Yj
Ix[i, j] =
max
Ix[i, j1] + ge
X1…  Y1…Yj1 Yj
M[i1, j] + go
Iy[i, j] =
max
Iy[i1, j] + gx
MSTGAVLIYTS
GGILLFHRTSGTSNS
Endgaps
Endgaps
CAGCACTTGGATTCTCGG CAGCGTGG
seq X:
seq Y:
Applications of semiglobal:
– Finding a gene in genome
– Placing marker onto a chromosome
– One sequence much longer than the other
Danger: if gap penalties high  really bad alignments for divergent sequences
Protein sequences have N and Cterminal amino acids that are often small and hydrophilic
G
A
G
T
G

0
0
0
0
0
0
A
0
1
1
1
1
1
G
0
1
2
2
0
2
T
0
1
0
0
3
1
Semiglobal alignmentThese values are copied from the PAM250 matrix (see earlier slide)
The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps
Higgs & Attwood, p. 124
j1
All penalty schemes are possible because the exact length of the gap is known
i1
Gap opening penalty
Max{S0
Si1,j1
Max{Si1, 0
Si,j = si,j + Max
Gap extension penalty
These values are copied from the PAM250 matrix (see earlier slide), after being made nonnegative by adding 8 to each PAM250 matrix cell (8 is the lowest number in the PAM250 matrix)
The extra bottom row and rightmost column give the final global alignment scores
j1
i1
These values are copied from the PAM250 matrix (see earlier slide), after being made nonnegative by adding 8 to each PAM250 matrix cell (8 is the lowest number in the PAM250 matrix)
Global score is 65 –10 – 1*2 –10 – 2*2
LCFVMLAGSTVIVGTR
E
D
A
S
T
I
L
C
G
S
Negative
numbers
Amino Acid
Exchange Matrix
Search matrix
Gap penalties
(open, extension)
AGSTVIVG
ASTILCG
j1
i1
Gap opening penalty
Si,j + Max{S0
Si,j + Si1,j1
Si,j + Max {Si1,0
0
Si,j = Max
Gap extension penalty
We want to be able to choose the best alignment between two sequences.
A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis.
They can identify insertions, deletions, inversions