Pair wise sequence alignment
This presentation is the property of its rightful owner.
Sponsored Links
1 / 61

Pair-wise Sequence Alignment PowerPoint PPT Presentation


  • 37 Views
  • Uploaded on
  • Presentation posted in: General

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Introduction to bioinformatics 2007 Lecture 5. Pair-wise Sequence Alignment. Bioinformatics.

Download Presentation

Pair-wise Sequence Alignment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Pair wise sequence alignment

C

E

N

T

E

R

F

O

R

I

N

T

E

G

R

A

T

I

V

E

B

I

O

I

N

F

O

R

M

A

T

I

C

S

V

U

Introduction to bioinformatics 2007

Lecture 5

Pair-wise Sequence Alignment


Pair wise sequence alignment

Bioinformatics

  • “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

  • “Nothing in bioinformatics makes sense except in the light of Biology”


The genetic code

The Genetic Code


Pair wise sequence alignment

A protein sequence alignment

MSTGAVLIY--TSILIKECHAMPAGNE-----

---GGILLFHRTHELIKESHAMANDEGGSNNS

* * * **** ***

A DNA sequence alignment

attcgttggcaaatcgcccctatccggccttaa

att---tggcggatcg-cctctacgggcc----

*** **** **** ** ******


Pair wise sequence alignment

Searching for similarities

What is the function of the new gene?

The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques):

– Find a set of similar protein sequences to the unknown sequence

– Identify similarities and differences

– For long proteins: first identify domains


Pair wise sequence alignment

Is similarity really interesting?


Pair wise sequence alignment

  • Evolutionary and functional relationships

  • Reconstruct evolutionary relation:

  • Based on sequence

    • -Identity (simplest method)

    • -Similarity

  • Homology (common ancestry: the ultimate goal)

  • Other (e.g., 3D structure)

  • Functional relation:

  • SequenceStructureFunction


Pair wise sequence alignment

Searching for similarities

Common ancestry is moreinteresting:

Makes it more likely that genes share

the same function

Homology: sharing a commonancestor

– a binary property (yes/no)

– it’s a nice tool:

When (anunknown) gene X ishomologous to (a known) gene G itmeans that we gain a lot of informationon X: what we know about G can be transferred to X as a good suggestion.


How to go from dna to protein sequence

How to go from DNA to protein sequence

A piece of double stranded DNA:

5’attcgttggcaaatcgcccctatccggc 3’

3’ taagcaaccgtttagcggggataggccg 5’

DNA direction is from 5’ to 3’


How to go from dna to protein sequence1

How to go from DNA to protein sequence

6-frame translation using the codon table (last lecture):

5’attcgttggcaaatcgcccctatccggc 3’

3’ taagcaaccgtttagcggggataggccg 5’


Pair wise sequence alignment

Bioinformatics tool

Algorithm

Data

tool

Biological

Interpretation

(model)


Pair wise sequence alignment

Example today: Pairwise sequence alignment needs sense of evolution

Global dynamic programming

MDAGSTVILCFVG

Evolution

M

D

A

A

S

T

I

L

C

G

S

Amino Acid Exchange

Matrix

Search matrix

MDAGSTVILCFVG-

Gap penalties (open,extension)

MDAAST-ILC--GS


Pair wise sequence alignment

Z

X

Y

How to determine similarity?

  • Frequent evolutionary events:1. Substitution2. Insertion, deletion3. Duplication4. Inversion

  • Evolution at work

We’ll only use these

Common ancestor, usually extinct

available


Pair wise sequence alignment

GTACT--C

GGA-TGAC

Alignment - the problem

Algorithmsforgenomes

||||||| algorithms4genomes

algorithmsforgenomes

||||||||||algorithms4genomes

algorithmsforgenomes

||||||||||? |||||||algorithms4--genomes

  • Operations:

    • substitution,

    • Insertion

    • deletion


Pair wise sequence alignment

A protein sequence alignment

MSTGAVLIY--TSILIKECHAMPAGNE-----

---GGILLFHRTHELIKESHAMANDEGGSNNS

* * * **** ***

A DNA sequence alignment

attcgttggcaaatcgcccctatccggccttaa

att---tggcggatcg-cctctacgggcc----

*** **** **** ** ******


Pair wise sequence alignment

Dynamic programmingScoring alignments

– Substitution (or match/mismatch)

• DNA

• proteins

– Gap penalty

• Linear: gp(k)=ak

• Affine: gp(k)=b+ak

• Concave, e.g.: gp(k)=log(k)

The score for an alignment is the sum of the scores of all alignment columns


Pair wise sequence alignment

Dynamic programmingScoring alignments

– Substitution (or match/mismatch)

• DNA

• proteins

– Gap penalty

• Linear: gp(k)=ak

• Affine: gp(k)=b+ak

• Concave, e.g.: gp(k)=log(k)

The score for an alignment is the sum

of the scores over all alignment columns

affine

penalty

concave

linear

,

  • /

Gap length

General alignment score:

Sa,b=


Pair wise sequence alignment

Substitution Matrices: DNA

define a score for match/mismatch ofletters

Simple:

Used in genome alignments:


Pair wise sequence alignment

Substitution matrices for a.a.

  • Amino acids are not equal:

    • Some aresimilar and easily substituted:

      • biochemical properties

      • structure

    • Some mutations occur more often due to similar codons

  • The two above give us substitution matrices

http://www.people.virginia.edu/~rjh9u/aminacid.html

orange: nonpolar and hydrophobic.

green: polar and hydrophilic

magenta box are acidic

light blue box are basic

http://www.cimr.cam.ac.uk/links/codon.htm


Pair wise sequence alignment

BLOSUM 62 substitution matrix

http://www.carverlab.org/testing/epp.html


Pair wise sequence alignment

Gap

Scoring

intro

extension

Constant

-2

-2

Affine

-3

-1

Constant vs. affine gap scoring

…and +1 for match

Seq1G T A - -G - T - ASeq2- - A T G- A T G -

Const-2 –2 1 –2 –2 (SUM = -7)-2 –2 1 -2 –2 (SUM = -7)

Affine –4 1 -4 (SUM = -7)-3 –3 1 -3 –3 (SUM = -11)‏


Pair wise sequence alignment

Dynamic programmingScoring alignments

T D W V T A L K

T D W L - - I K

2020

10

1

Affine gap penalties (open, extension)

Amino Acid Exchange Matrix

Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px +

+s(L,I)+s(K,K)


Pair wise sequence alignment

Amino acid exchange matrices

2020

How do we get one?

And how do we get associated gap penalties?

First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure


Pair wise sequence alignment

A 2

R -2 6

N 0 0 2

D 0 -1 2 4

C -2 -4 -4 -5 12

Q 0 1 1 2 -5 4

E 0 -1 1 3 -5 2 4

G 1 -3 0 1 -3 -1 0 5

H -1 2 2 1 -3 3 1 -2 6

I -1 -2 -2 -2 -2 -2 -2 -3 -2 5

L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6

K -1 3 1 0 -5 1 0 -2 0 -2 -3 5

M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6

F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9

P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6

S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2

T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3

W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17

Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10

V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

A R N D C Q E G H I L K M F P S T W Y V

PAM250 matrix

amino acid exchange matrix (log odds)

Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation


Pair wise sequence alignment

Amino acid exchange matrices

Amino acids are not equal:

1. Some are easily substituted because they have similar:

• physico-chemical properties

• structure

2. Some mutations between amino acids occur more often due tosimilar codons

The two above observations give us ways to define substitutionmatrices


Pair wise sequence alignment

Pair-wise alignment

T D W V T A L K

T D W L - - I K

Combinatorial explosion

- 1 gap in 1 sequence: n+1 possibilities

- 2 gaps in 1 sequence: (n+1)n

- 3 gaps in 1 sequence: (n+1)n(n-1), etc.

2n (2n)! 22n

= ~

n (n!)2 n

2 sequences of 300 a.a.: ~1088 alignments

2 sequences of 1000 a.a.: ~10600 alignments!


Technique to overcome the combinatorial explosion dynamic programming

Technique to overcome the combinatorial explosion:Dynamic Programming

  • Alignment is simulated as Markov process, all sequence positions are seen as independent

  • Chances of sequence events are independent

    • Therefore, probabilities per aligned position need to be multiplied

    • Amino acid matrices contain so-called log-odds values (log10 of the probabilities), so probabilities can be summed


Pair wise sequence alignment

To say the same more statistically…

  • To perform statistical analyses on messages or sequences, we need a reference model.

  • The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner.

  • This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment.

  • Given a probability distribution, Pi, for the letters in ai.i.d. message, the probability of seeing a particular sequence of letters i, j, k, ... n is simply Pi Pj Pk···Pn.

  • As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log Pi + log Pj + log Pk+ ··· + log Pn.

  • In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score.


Pair wise sequence alignment

Sequence alignmentHistory of Dynamic Programming algorithm

  • 1970Needleman-Wunsch global pair-wise alignment

  • Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins,J Mol Biol. 48(3):443-53.

  • 1981Smith-Waterman local pair-wise alignment

    • Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.


Pair wise sequence alignment

Pairwise sequence alignment

Global dynamic programming

MDAGSTVILCFVG

Evolution

M

D

A

A

S

T

I

L

C

G

S

Amino Acid Exchange

Matrix

Search matrix

Gap penalties (open,extension)

MDAGSTVILCFVG-

MDAAST-ILC--GS


Pair wise sequence alignment

Global dynamic programming

j-1j

i-1

i

Value from residue exchange matrix

H(i-1,j-1)+ S(i,j)

H(i-1,j) - g

H(i,j-1) - g

diagonal

vertical

horizontal

H(i,j) = Max

This is a recursive formula


Global alignment the needleman wunsch algorithm

Global alignmentThe Needleman-Wunsch algorithm

  • Goal: find the maximal scoring alignment

  • Scores: m match, -s mismatch, -g for insertion/deletion

  • Dynamic programming

    • Solve smaller subproblem(s)‏

    • Iteratively extend the solution

  • The best alignment for X[1…i] and Y[1…j]is called M[i, j]

X1 … Xi-1 -- Xi

Y1 … - Yj-1 Yj -


The algorithm

  • Goal: find the maximal scoring alignment

  • Scores: m match, -s mismatch, -g for insertion/deletion

  • The best alignment for X[1…i]and Y[1…j]is called M[i, j]

  • 3 ways to extend the alignment

X[1…i-1]X[i]

X[1…i] -

X[1…i-1] X[i]

Y[1…j-1]Y[j]

Y[1…j-1]Y[j]

Y[1…j] -

+ m

M[i-1, j-1]

- s

The algorithm

M[i,j] =

M[i, j-1] - g

M[i-1, j] - g


The algorithm final equation

The algorithm – final equation

Corresponds to:

*

X1…Xi-1 XiY1…Yj-1 Yj

M[i-1, j-1] + score(X[i],Y[j])‏

M[i, j] =

max

M[i, j-1] – g

X1…Xi -Y1…Yj-1 Yj

M[i-1, j] – g

X1…Xi-1 XiY1…Yj -

j-1

j

i-1

*

-g

i

-g


Example global alignment of two sequences

Example: global alignment of two sequences

  • Align two DNA sequences:

    • GAGTGA

    • GAGGCGA (note the length difference)‏

  • Parameters of the algorithm:

    • Match:score(A,A) = 1

    • Mismatch:score(A,T) = -1

    • Gap:g = -2

M[i-1, j-1] ± 1

M[i, j] =

max

M[i, j-1] – 2

M[i-1, j] – 2


The algorithm step 1 init

j

0

1

2

3

4

5

6

i

-

G

A

G

T

G

A

0

-

0

1

G

2

A

3

G

4

G

5

C

6

G

7

A

The algorithm. Step 1: init

M[i-1, j-1] ± 1

M[i, j] =

max

M[i, j-1] – 2

  • Create the matrix

  • Initiation

    • 0 at [0,0]

    • Apply the equation…

M[i-1, j] – 2


The algorithm step 1 init1

-

G

A

G

T

G

A

-

0

-2

-4

-6

-8

-10

-12

G

-2

A

-4

G

-6

G

-8

C

-10

G

-12

A

-14

The algorithm. Step 1: init

M[i-1, j-1] ± 1

M[i, j] =

max

M[i, j-1] – 2

  • Initiation of the matrix:

    • 0 at pos [0,0]

    • Fill in the first row using the “” rule

    • Fill in the first column using “”

M[i-1, j] – 2

j

i


The algorithm step 2 fill in

-

G

A

G

T

G

A

-

0

-2

-4

-6

-8

-10

-12

G

-2

1

-1

-3

A

-4

-1

2

G

-6

G

-8

C

-10

G

-12

A

-14

The algorithm. Step 2: fill in

M[i-1, j-1] ± 1

M[i, j] =

max

M[i, j-1] – 2

  • Continue filling in of the matrix, remembering from which cell the result comes (arrows)‏

M[i-1, j] – 2

j

i


The algorithm step 2 fill in1

-

G

A

G

T

G

A

-

0

-2

-4

-6

-8

-10

-12

G

-2

1

-1

-3

-5

-7

-9

A

-4

-1

2

0

-2

-4

-6

G

-6

-3

0

3

1

-1

-3

G

-8

-5

-2

1

2

2

0

C

-10

-7

-4

-1

0

1

1

G

-12

-9

-6

-3

-2

1

0

A

-14

-11

-8

-5

-4

-1

2

The algorithm. Step 2: fill in

M[i-1, j-1] ± 1

M[i, j] =

max

M[i, j-1] – 2

  • We are done…

  • Where’s the result?

M[i-1, j] – 2

j

The lowest-rightmost cell

i


The algorithm step 3 trace back

-

G

A

G

T

G

A

-

0

-2

-4

-6

-8

-10

-12

G

-2

1

-1

-3

-5

-7

-9

A

-4

-1

2

0

-2

-4

-6

G

-6

-3

0

3

1

-1

-3

G

-8

-5

-2

1

2

2

0

C

-10

-7

-4

-1

0

1

1

G

-12

-9

-6

-3

-2

1

0

A

-14

-11

-8

-5

-4

-1

2

The algorithm. Step 3:traceback

M[i-1, j-1] ± 1

M[i, j] =

max

M[i, j-1] – 2

  • Start at the last cell of the matrix

  • Go against the direction of arrows

  • Sometimes the value may be obtained from more than one cell (which one?)‏

M[i-1, j] – 2

j

i


The algorithm step 3 trace back1

-

G

A

G

T

G

A

-

0

-2

-4

-6

-8

-10

-12

G

-2

1

-1

-3

-5

-7

-9

A

-4

-1

2

0

-2

-4

-6

G

-6

-3

0

3

1

-1

-3

G

-8

-5

-2

1

2

2

0

C

-10

-7

-4

-1

0

1

1

G

-12

-9

-6

-3

-2

1

0

A

-14

-11

-8

-5

-4

-1

2

The algorithm. Step 3: traceback

The ‘low’ and the ‘high’ road

  • Extract the alignments

j

a)‏

GAGT-GA

a

GAGGCGA

b)‏

b

GA-GTGA

GAGGCGA

i

You can also jump from the high to the low road in the middle (following the arrow): to what alignment does that lead?


Affine gaps

X1…Xi -Y1…Yj-1 Yj

Affine gaps

  • Penalties:

    go - gap opening (e.g. -8)‏

    ge - gap extension (e.g. -1)‏

M[i-1, j-1]

X1…XiY1…Yj

M[i, j] =

max

score(X[i], Y[j]) +

Ix[i-1, j-1]

Iy[i-1, j-1]

M[i, j-1] + go

X1…-Y1… Yj

Ix[i, j] =

max

Ix[i, j-1] + ge

X1… - -Y1…Yj-1 Yj

M[i-1, j] + go

Iy[i, j] =

max

Iy[i-1, j] + gx

  • @ home: think of boundary values M[*,0], I[*,0] etc.


Semi global pairwise alignment

Semi-global pairwise alignment

  • Global alignment: all gaps are penalised

  • Semi-global alignment: N- and C-terminal (5’ and 3’) gaps (end-gaps) are not penalised

    MSTGAVLIY--TS-----

    ---GGILLFHRTSGTSNS

End-gaps

End-gaps


Variation on global alignment

Variation on global alignment

  • Global alignment: previous algorithms is called globalalignment, because it uses all letters from both sequences.CAGCACTTGGATTCTCGGCAGC-----G-T----GG

  • Semi-globalalignment: don’t penalize for end gaps

    CAGCA-CTTGGATTCTCGG---CAGCGTGG--------

    • Applications of semi-global:

      • Finding a gene in genome

      • Placing marker onto a chromosome

      • One sequence much longer than the other

  • Danger! – really bad alignments for divergent seqs

seq X:

seq Y:


Semi global pairwise alignment1

Semi-global pairwise alignment

Applications of semi-global:

– Finding a gene in genome

– Placing marker onto a chromosome

– One sequence much longer than the other

Danger: if gap penalties high -- really bad alignments for divergent sequences

Protein sequences have N- and C-terminal amino acids that are often small and hydrophilic


Semi global alignment

-

G

A

G

T

G

-

0

0

0

0

0

0

A

0

-1

1

-1

-1

-1

G

0

1

-2

2

0

-2

T

0

-1

0

0

3

1

Semi-global alignment

  • Ignore 5’ or N-terminal end gaps

    • First row/column set to 0

  • Ignore C-terminal or 3’ end gaps

    • Read the result from last row/column


Pair wise sequence alignment

Take-home messages

  • Homology

  • Why are we interested in similarity?

  • Pairwise alignment: global alignment and semi-global alignment

  • Three types of gap penalty regimes: linear, affine and concave

  • Make sure you can calculate alignment using the DP algorithm


Pair wise sequence alignment

  • a heuristic

    • Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work


Pair wise sequence alignment

Global dynamic programmingPAM250, Gap=6 (linear)

These values are copied from the PAM250 matrix (see earlier slide)

The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps

Higgs & Attwood, p. 124


Pair wise sequence alignment

Global dynamic programmingLinear, Affine or Concave gap penalties

j-1

All penalty schemes are possible because the exact length of the gap is known

i-1

Gap opening penalty

Max{S0<x<i-1, j-1- Pi - (i-x-1)Px}

Si-1,j-1

Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}

Si,j = si,j + Max

Gap extension penalty


Global dynamic programming gap o 10 gap e 2

Global dynamic programmingGapo=10, Gape=2

These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

The extra bottom row and rightmost column give the final global alignment scores


Easy dp recipe for using affine gap penalties

Easy DP recipe for using affine gap penalties

j-1

  • M[i,j] is optimal alignment (highest scoring alignment until [i,j])

  • Check

    • preceding row until j-2: apply appropriate gap penalties

    • preceding row until i-2: apply appropriate gap penalties

    • and cell[i-1, j-1]: apply score for cell[i-1, j-1]

i-1


Dp is a two step process

DP is a two-step process

  • Forward step: calculate scores

  • Trace back: start at highest score and reconstruct the path leading to the highest score

    • These two steps lead to the highest scoring alignment (the optimal alignment)

    • This is guaranteed when you use DP!


Pair wise sequence alignment

Global dynamic programming


Pair wise sequence alignment

Semi-global dynamic programming- two examples with different gap penalties -

These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

Global score is 65 –10 – 1*2 –10 – 2*2


Pair wise sequence alignment

Local dynamic programming(Smith & Waterman, 1981)

LCFVMLAGSTVIVGTR

E

D

A

S

T

I

L

C

G

S

Negative

numbers

Amino Acid

Exchange Matrix

Search matrix

Gap penalties

(open, extension)

AGSTVIVG

A-STILCG


Pair wise sequence alignment

Local dynamic programming(Smith & Waterman, 1981)

j-1

i-1

Gap opening penalty

Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}

Si,j + Si-1,j-1

Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}

0

Si,j = Max

Gap extension penalty


Pair wise sequence alignment

Local dynamic programming


Dot plots

Dot plots

  • Way of representing (visualising) sequence similarity without doing dynamic programming (DP)

  • Make same matrix, but locally represent sequence similarity by averaging using a window


Pair wise sequence alignment

Comparing two sequences

We want to be able to choose the best alignment between two sequences.

A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis.


Pair wise sequence alignment

Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold

They can identify insertions, deletions, inversions


  • Login