Sequence alignment
This presentation is the property of its rightful owner.
Sponsored Links
1 / 66

Sequence Alignment PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on
  • Presentation posted in: General

Sequence Alignment. Arthur W. Chou Clark University Spring 2008. Sequence Alignment. Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: GCGCATGGATTGAGCGA TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A.

Download Presentation

Sequence Alignment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sequence alignment

Sequence Alignment

Arthur W. Chou

Clark University

Spring 2008


Sequence alignment1

Sequence Alignment

Input: two sequences over the same alphabet

Output: an alignment of the two sequences

Example:

  • GCGCATGGATTGAGCGA

  • TGCGCCATTGATGACCA

    A possible alignment:

    -GCGC-ATGGATTGAGCGA

    TGCGCCATTGAT-GACC-A


Why align sequences

Why align sequences?

Lots of sequences don’t have known ancestry, structure, or function. A few of them do.

If they align, they are similar.

If they are similar, they might have the same

ancestry, similar structure or function.

If one of them has known ancestry, structure, or

function, then alignment to the others yields

insight about them.


Sequence alignment

Alignments

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three kinds of match:

Exact matches

Mismatches

Indels (gaps)


Choosing alignments

Choosing Alignments

There are many possible alignments

For example, compare:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

to

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Which one is better?


Scoring alignments

Scoring Alignments

  • Similar sequences evolved from a common ancestor

  • Evolution changed the sequences from this ancestral sequence by mutations:

    • Replacement: one letter replaced by another

    • Deletion: deletion of a character

    • Insertion:insertion of a character

  • Scoring of sequence similarity should examine how many and which operations took place


Sequence alignment

Simple Scoring Rule

Score each position independently:

  • Match: +1

  • Mismatch: -1

  • Indel: -2

    Score of an alignment is sum of position scores


Sequence alignment

Example

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+113) + (-1  2) + (-2  4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score: (+1  5) + (-1  6) + (-2  11) = -23


Sequence alignment

More General Scores

  • The choice of +1,-1, and -2 scores is quite arbitrary

  • Depending on the context, some changes are more plausible than others

    • Exchange of an amino-acid by one with similar properties (size, charge, etc.) vs.

    • Exchange of an amino-acid by one with opposite properties

  • Probabilistic interpretation: How likely is one alignment versus another ?


Sequence alignment

Dot Matrix Method

  • A dot is placed at each position where two residues match.

  • It's a visual aid. The human eye can rapidly identify similar regions in sequences.

  • It's a good way to explore sequence organization: e.g. sequence repeats.

  • It does not provide an alignment.

  • This method produces dot-plots with too much noise

  • to be useful

  • The noise can be reduced by calculating a score using a window of residues.

  • The score is compared to a threshold or stringency.

THEFA-TCAT

||||| ||||

THEFASTCAT


Sequence alignment

Tissue-Type plasminogen Activator

Urokinase-Type plasminogen Activator

Dot Matrix Representation

  • Produces a graphical representation of similarity regions

  • The horizontal and vertical dimensions correspond to the compared sequences

  • A region of similarity stands out as a diagonal


Sequence alignment

HEF

THE

THE

|||

THE

THE

HEF

CAT

THE

Score: -5

Score: 23

Score: -5

Score: -4

Dot Matrix or Dot-plot

  • Each window of the first sequence is aligned (without

    gaps) to each window of the 2nd sequence

  • A colour is set into a rectangular array according to the

    score of the aligned windows


Sequence alignment

Dot Matrix Display

  • Diagonal rows ( ) of dots

    reveal sequence similarity

    or repeats.

  • Anti-diagonal rows ( )

    of dots represent inverted

    repeats.

  • Isolated dots represent

    random similarity.

H C G E T F G R W F T P E W

K

C •

G •

P •

T • •

F • •

G •

R •

I

A

C •

G • •

E • •

M


Sequence alignment

Dot matrix web server

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

We can filter it by using a

sliding window looking for longer strings of matches and eliminates random matches


Longest common subsequence

Longest Common Subsequence

Sequence A:nematode_knowledge

Sequence B:empty_bottle

n e m a t o d e _ k n o w l e d g e

| | | | | | |

e m p t y _ b o t t l e

  • LCS Alignment with match score 1,

    mismatch score 0, and gap penalty 0


What is an algorithm

What is an algorithm?

  • A step-by-step description of the procedures to accomplish a task.

  • Properties:

  • Determination of output for each input

  • Generality

  • Termination

  • Criteria:

  • Correctness (proof, test, etc.)

  • Time efficiency (no. of steps is small)

  • Space efficiency (spaced used is small)


Na ve algorithm exhaustive search

Naïve algorithm: exhaustive search

sequences of length “n”

i

G C G A A T G G A T T G A G C G T

T G A G C C A T T G A T G A C C A

j

i j j i j i j j i i . . . . . . . . . . . . . .

2n

Worst case time complexity is ~ 2


Dynamic programming algorithms for pairwise sequence alignment

Dynamic programming algorithms for pairwise sequence alignment

  • Similar to Longest Common Subsequence

  • Introduced for biological sequences by

    • S. B. Needleman & C. D. Wunsch.A general method applicable to the search for similarities in the amino acid sequence of two proteins.J. Mol. Biol. 48:443-453 (1970)


Dynamic programming

Dynamic Programming

  • Optimality substructure

  • Reduction to a “small” number of sub-problems

  • Memorization of solutions to sub-problems in a table

  • Table look-up and tracing

- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - A


Recursive lcs relation

Recursive LCS Relation

lcs_len( i , j): length of LCS from i-th position onward in String A and from j-th position onward in String B

Two strings:

A: n e m a t o d e _ k n o w l e d g e

B: e m p t y _ b o t t l e

n e m a t o d e _ k n o w l e d g e

| | | | | | |

e m p t y _ b o t t l e

A[ 2 ] == B[ 1 ]

hence lcs_len ( 2 , 1 ) = 1 + lcs_len ( 2+1, 1+1 )


Recursive lcs relation1

Recursive LCS Relation

lcs_len( i , j): length of LCS from i-th position onward in String A and from j-th position onward in String B

int lcs_len ( i , j ) {

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’ ) return 0 ;

else

if (A[ i ] == B[ j ] )

return ( 1 + lcs_len ( i+1, j+1 ) ) ;

else

return max ( lcs_len ( i+1, j ) ,

lcs_len ( i, j+1 )

);

}


Reduction to subproblems

Reduction to Subproblems

intlcs_len ( String A , String B )

{ returnsubproblem ( 0, 0 ); }

intsubproblem ( int i, intj)

{ if (A[i] == ‘\n’ orB[j] == ‘\n’) return 0;

else

if ( A[i] == B[j] )

return (1 +subproblem(i+1, j+1));

else return

max (subproblem(i+1, j) ,

subproblem ( i,j+1) );

}


Memorizing the solutions

Memorizing the solutions :

Matrix L[ i , j ] = -1 ; // initializing the memory device

int subproblem ( int i, int j ) {

if ( L[i, j] < 0 ) {

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’) L[i , j] = 0;

else if ( A[ i ] == B[ j ] )

L[i, j] = 1 + subproblem(i+1, j+1);

else L[i, j] = max( subproblem(i+1, j),

subproblem(i, j+1));

} return L[ i, j ] ;

}


Sequence alignment

Iterative LCS: Table Look-up

To get the length of LCS of Sq. A and Sq. B

{ // length of A is m and length of B is n

first allocate storage for the matrix L;

for each row i from m downto 0

for each column j from n downto 0

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

}

return L[0, 0];

}


Iterative lcs table look up

Iterative LCS: Table Look-up

int lcs_len ( String A , String B ) // the length

{

// First allocate storage for the matrix L;

for ( i = m ; i >= 0 ; i-- ) // A has length m+1

for ( j = n ; j >= 0 ; j-- ) { // B has length n+1

if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

}

return L[0, 0];

}


Dynamic programming algorithm

Dynamic Programming Algorithm

B

j

j+1

A

Matrix L

L[i, j] = 1 + L[i+1, j+1] , if A[ i ] == B[ j ] ;

L[i, j] = max ( L[i+1, j], L[i, j+1] ) otherwise

L[ i, j ]

L[ i, j+1 ]

i

L[ i+1, j ]

L[i+1, j+1]

i+1


Sequence alignment

n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d  g  ee   7  7  6  5  5  5  5  5  4  3  3  3  2  2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0


Obtain the subsequence

Obtain the subsequence

Sequence S = empty; // the LCS

i = 0; j = 0;

while ( i < m and j < n) {

if ( A[ i ] == B[ j ] ) {

add A[i] to end of S;

i++; j++;

} else

if ( L[i+1, j] >= L[i, j+1]) i++;

else j++;

}


Sequence alignment

n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d  g  ee   7  7  6  5  5  5  5  5  4  3  3  3  2  2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0


Dynamic programming with scores and penalties

i

x

Dynamic Programming with scores and penalties

y

j


Dynamic programming with scores and penalties general

Dynamic Programming with scores and penalties: general

  • from ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

    s ( A[i] , B[j] ) + S[i+1, j+1]

    S[i , j] = max max { S[i+x, j] – w( x );

    gap x in sequence B }

    max { S[i, j+y] – w( y );

    gap y in sequence A }

s : score

w : penalty function

best score

from i, j onward


Algorithm for simple gap penalty

Algorithm for simple gap penalty

If for each gap, the penalty is a fixed constant “c”, then

• s(A[ i ] , B[ j ]) + S[i+1, j+1];

S[i , j] = max• S[ i+1, j ] – c ; // one gap

• S[ i, j+1 ] – c ; // one gap


Table tracing

Table Tracing

To do table tracing based on similarity matrix of amino acids, we re-define S[i , j] to be the optimal score of choosing the match of A[i] with B[j].

S[ i , j ] = s (A[ i ] , B[ j ]) + // s : score

S[i+1, j+1] // w : gap penalty

max { S[i+1+x, j+1] – w( x );

+ max gap x in sequence B }

max { S[i+1, j+1+y] – w( y );

gap y in sequence A }


Diagram

Diagram

Matrix S:

j

j+1

i

i+1


Summation operation

Summation operation

1. Start at lower right corner.

2. Move diagonally up one position.

3. Find largest value in either

row segment starting diagonally below current position and extending to the right or

column segment starting diagonally below current position and extending down.

4. Add this value to the value in the current cell.

5. Repeat steps 3 and 4 for all cells to the left in current row and all cells above in current column.

6. If we are not in the top left corner, go to step 2.


Sequence alignment

----V

HGQKV


Sequence alignment

----VA

HGQKVA


Sequence alignment

----VADALTK

HGQKVADALTK


Sequence alignment

----VADALTK

HGQKVADALTK


Sequence alignment

----VADALTKPVNFKFA

HGQKVADALTK------A


Sequence alignment

----VADALTKPVNFKFAVAH

HGQKVADALTK------AVAH


Dynamic programming by tracing a similarity matrix

Dynamic Programming by Tracing a Similarity Matrix

  • Recall the algorithm Table Tracing

  • Tracing a Similarity Matrix of Amino Acids


Use of dynamic programming to evaluate homology between pairs of sequences

Use of dynamic programming to evaluate homology between pairs of sequences

  • If we just want to know maximum match possible between two sequences, then we don’t need to do trace-back but can just look at the highest value in the first row or column (“match score”). This represents the best possible alignment score.


Gap penalty alternatives

Gap penalty alternatives :

  • constant gap penalty for gap > 1

  • gap penalty proportional to gap size (affine gap penalty)

    • one penalty for starting a gap (gap opening penalty)

    • different (lower) penalty for adding to a gap (gap extension penalty)

    • dynamic programming algorithm can be made more efficient


Gap penalty alternatives cont

Gap penalty alternatives (cont.)

  • gap penalty proportional to gap size and sequence

    • for nucleic acids, can be used to mimic thermodynamics of helix formation.

    • two kinds of gap opening penalties

      • one for gap closed by AT, different for GC.

    • different gap extension penalty.


End gaps

End gaps

  • Some programs treat end gaps as normal gaps and apply penalties, other programs do not apply penalties for end gaps.


End gaps cont

End gaps (cont.)

  • Can determine which a program does by adding extra (unmatched) bases to the end of one sequence and seeing if match score changes.

  • Penalties for end gaps appropriate for aligned sequences where ends "should match“.

  • Penalties for end gaps inappropriate when surrounding sequences are expected to be different (e.g., conserved exon surrounded by varying introns).


Global vs local similarity

Global vs. Local Similarity

  • Should result of alignment include all amino acids or proteins or just those that match?

    • If yes, a global alignment is desired

    • If no, a local alignment is desired

      • Global alignment is accomplished by including negative scores for “mismatched” positions, thus scores get worse as we move away from region of match (local alignment).

      • Instead of starting trace-back with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero.


Local alignment general scheme

Local Alignment: General Scheme

s : score

  • From ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

    s ( A[i] , B[j] ) + H[i+1, j+1]

    H[i , j] = max max { H[i+x, j] – w( x );

    gap x in sequence B }

    max { H[i, j+y] – w( y );

    gap y in sequence A }

w : penalty function

Best score of any prefix of the subsequence from i, j onward.

0


Local alignment table tracing

Local Alignment: Table Tracing

Best score of any prefix of the subsequence from i, j onward if matching A[i] with B[j].

To do table tracing based on similarity matrix of amino acids, we re-define H[i , j] to be the optimal score of choosing the match of A[i] with B[j].

H[ i , j ] = s (A[ i ] , B[ j ]) +

H[i+1, j+1] // w : gap penalty

max { H[i+1+x, j+1] – w( x );

+ max gap x in sequence B }

max { H[i+1, j+1+y] – w( y );

gap y in sequence A }

if positive; otherwise 0.


  • Login