Sequence alignment
Download
1 / 66

Sequence Alignment - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Sequence Alignment. Arthur W. Chou Clark University Spring 2008. Sequence Alignment. Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: GCGCATGGATTGAGCGA TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Sequence Alignment' - suchin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Sequence alignment

Sequence Alignment

Arthur W. Chou

Clark University

Spring 2008


Sequence alignment1
Sequence Alignment

Input: two sequences over the same alphabet

Output: an alignment of the two sequences

Example:

  • GCGCATGGATTGAGCGA

  • TGCGCCATTGATGACCA

    A possible alignment:

    -GCGC-ATGGATTGAGCGA

    TGCGCCATTGAT-GACC-A


Why align sequences

Why align sequences?

Lots of sequences don’t have known ancestry, structure, or function. A few of them do.

If they align, they are similar.

If they are similar, they might have the same

ancestry, similar structure or function.

If one of them has known ancestry, structure, or

function, then alignment to the others yields

insight about them.


Alignments

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three kinds of match:

Exact matches

Mismatches

Indels (gaps)


Choosing alignments

Choosing Alignments

There are many possible alignments

For example, compare:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

to

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Which one is better?


Scoring alignments
Scoring Alignments

  • Similar sequences evolved from a common ancestor

  • Evolution changed the sequences from this ancestral sequence by mutations:

    • Replacement: one letter replaced by another

    • Deletion: deletion of a character

    • Insertion:insertion of a character

  • Scoring of sequence similarity should examine how many and which operations took place


Simple Scoring Rule

Score each position independently:

  • Match: +1

  • Mismatch: -1

  • Indel: -2

    Score of an alignment is sum of position scores


Example

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+113) + (-1  2) + (-2  4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score: (+1  5) + (-1  6) + (-2  11) = -23


More General Scores

  • The choice of +1,-1, and -2 scores is quite arbitrary

  • Depending on the context, some changes are more plausible than others

    • Exchange of an amino-acid by one with similar properties (size, charge, etc.) vs.

    • Exchange of an amino-acid by one with opposite properties

  • Probabilistic interpretation: How likely is one alignment versus another ?


Dot Matrix Method

  • A dot is placed at each position where two residues match.

  • It's a visual aid. The human eye can rapidly identify similar regions in sequences.

  • It's a good way to explore sequence organization: e.g. sequence repeats.

  • It does not provide an alignment.

  • This method produces dot-plots with too much noise

  • to be useful

  • The noise can be reduced by calculating a score using a window of residues.

  • The score is compared to a threshold or stringency.

THEFA-TCAT

||||| ||||

THEFASTCAT


Tissue-Type plasminogen Activator

Urokinase-Type plasminogen Activator

Dot Matrix Representation

  • Produces a graphical representation of similarity regions

  • The horizontal and vertical dimensions correspond to the compared sequences

  • A region of similarity stands out as a diagonal


HEF

THE

THE

|||

THE

THE

HEF

CAT

THE

Score: -5

Score: 23

Score: -5

Score: -4

Dot Matrix or Dot-plot

  • Each window of the first sequence is aligned (without

    gaps) to each window of the 2nd sequence

  • A colour is set into a rectangular array according to the

    score of the aligned windows


Dot Matrix Display

  • Diagonal rows ( ) of dots

    reveal sequence similarity

    or repeats.

  • Anti-diagonal rows ( )

    of dots represent inverted

    repeats.

  • Isolated dots represent

    random similarity.

H C G E T F G R W F T P E W

K

C •

G •

P •

T • •

F • •

G •

R •

I

A

C •

G • •

E • •

M


Dot matrix web server

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

We can filter it by using a

sliding window looking for longer strings of matches and eliminates random matches


Longest common subsequence
Longest Common Subsequence

Sequence A:nematode_knowledge

Sequence B:empty_bottle

n e m a t o d e _ k n o w l e d g e

| | | | | | |

e m p t y _ b o t t l e

  • LCS Alignment with match score 1,

    mismatch score 0, and gap penalty 0


What is an algorithm
What is an algorithm?

  • A step-by-step description of the procedures to accomplish a task.

  • Properties:

  • Determination of output for each input

  • Generality

  • Termination

  • Criteria:

  • Correctness (proof, test, etc.)

  • Time efficiency (no. of steps is small)

  • Space efficiency (spaced used is small)


Na ve algorithm exhaustive search

Naïve algorithm: exhaustive search

sequences of length “n”

i

G C G A A T G G A T T G A G C G T

T G A G C C A T T G A T G A C C A

j

i j j i j i j j i i . . . . . . . . . . . . . .

2n

Worst case time complexity is ~ 2


Dynamic programming algorithms for pairwise sequence alignment
Dynamic programming algorithms for pairwise sequence alignment

  • Similar to Longest Common Subsequence

  • Introduced for biological sequences by

    • S. B. Needleman & C. D. Wunsch.A general method applicable to the search for similarities in the amino acid sequence of two proteins.J. Mol. Biol. 48:443-453 (1970)


Dynamic programming
Dynamic Programming alignment

  • Optimality substructure

  • Reduction to a “small” number of sub-problems

  • Memorization of solutions to sub-problems in a table

  • Table look-up and tracing

- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - A


Recursive lcs relation
Recursive LCS Relation alignment

lcs_len( i , j): length of LCS from i-th position onward in String A and from j-th position onward in String B

Two strings:

A: n e m a t o d e _ k n o w l e d g e

B: e m p t y _ b o t t l e

n e m a t o d e _ k n o w l e d g e

| | | | | | |

e m p t y _ b o t t l e

A[ 2 ] == B[ 1 ]

hence lcs_len ( 2 , 1 ) = 1 + lcs_len ( 2+1, 1+1 )


Recursive lcs relation1
Recursive LCS Relation alignment

lcs_len( i , j): length of LCS from i-th position onward in String A and from j-th position onward in String B

int lcs_len ( i , j ) {

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’ ) return 0 ;

else

if (A[ i ] == B[ j ] )

return ( 1 + lcs_len ( i+1, j+1 ) ) ;

else

return max ( lcs_len ( i+1, j ) ,

lcs_len ( i, j+1 )

);

}


Reduction to subproblems
Reduction to Subproblems alignment

intlcs_len ( String A , String B )

{ returnsubproblem ( 0, 0 ); }

intsubproblem ( int i, intj)

{ if (A[i] == ‘\n’ orB[j] == ‘\n’) return 0;

else

if ( A[i] == B[j] )

return (1 +subproblem(i+1, j+1));

else return

max (subproblem(i+1, j) ,

subproblem ( i,j+1) );

}


Memorizing the solutions

Memorizing the solutions : alignment

Matrix L[ i , j ] = -1 ; // initializing the memory device

int subproblem ( int i, int j ) {

if ( L[i, j] < 0 ) {

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’) L[i , j] = 0;

else if ( A[ i ] == B[ j ] )

L[i, j] = 1 + subproblem(i+1, j+1);

else L[i, j] = max( subproblem(i+1, j),

subproblem(i, j+1));

} return L[ i, j ] ;

}


Iterative LCS: Table Look-up alignment

To get the length of LCS of Sq. A and Sq. B

{ // length of A is m and length of B is n

first allocate storage for the matrix L;

for each row i from m downto 0

for each column j from n downto 0

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

}

return L[0, 0];

}


Iterative lcs table look up

Iterative LCS: Table Look-up alignment

int lcs_len ( String A , String B ) // the length

{

// First allocate storage for the matrix L;

for ( i = m ; i >= 0 ; i-- ) // A has length m+1

for ( j = n ; j >= 0 ; j-- ) { // B has length n+1

if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

}

return L[0, 0];

}


Dynamic programming algorithm

Dynamic Programming Algorithm alignment

B

j

j+1

A

Matrix L

L[i, j] = 1 + L[i+1, j+1] , if A[ i ] == B[ j ] ;

L[i, j] = max ( L[i+1, j], L[i, j+1] ) otherwise

L[ i, j ]

L[ i, j+1 ]

i

L[ i+1, j ]

L[i+1, j+1]

i+1


n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d  g  ee   7  7  6  5  5  5  5  5  4  3  3  3  2  2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0


Obtain the subsequence

Obtain the subsequence d  g  e

Sequence S = empty; // the LCS

i = 0; j = 0;

while ( i < m and j < n) {

if ( A[ i ] == B[ j ] ) {

add A[i] to end of S;

i++; j++;

} else

if ( L[i+1, j] >= L[i, j+1]) i++;

else j++;

}


n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d  g  ee   7  7  6  5  5  5  5  5  4  3  3  3  2  2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0


Dynamic programming with scores and penalties

i d  g  e

x

Dynamic Programming with scores and penalties

y

j


Dynamic programming with scores and penalties general
Dynamic Programming with d  g  escores and penalties: general

  • from ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

    s ( A[i] , B[j] ) + S[i+1, j+1]

    S[i , j] = max max { S[i+x, j] – w( x );

    gap x in sequence B }

    max { S[i, j+y] – w( y );

    gap y in sequence A }

s : score

w : penalty function

best score

from i, j onward


Algorithm for simple gap penalty

Algorithm for simple gap penalty d  g  e

If for each gap, the penalty is a fixed constant “c”, then

• s(A[ i ] , B[ j ]) + S[i+1, j+1];

S[i , j] = max• S[ i+1, j ] – c ; // one gap

• S[ i, j+1 ] – c ; // one gap


Table tracing

Table Tracing d  g  e

To do table tracing based on similarity matrix of amino acids, we re-define S[i , j] to be the optimal score of choosing the match of A[i] with B[j].

S[ i , j ] = s (A[ i ] , B[ j ]) + // s : score

S[i+1, j+1] // w : gap penalty

max { S[i+1+x, j+1] – w( x );

+ max gap x in sequence B }

max { S[i+1, j+1+y] – w( y );

gap y in sequence A }


Diagram
Diagram d  g  e

Matrix S:

j

j+1

i

i+1


Summation operation
Summation operation d  g  e

1. Start at lower right corner.

2. Move diagonally up one position.

3. Find largest value in either

row segment starting diagonally below current position and extending to the right or

column segment starting diagonally below current position and extending down.

4. Add this value to the value in the current cell.

5. Repeat steps 3 and 4 for all cells to the left in current row and all cells above in current column.

6. If we are not in the top left corner, go to step 2.


----V d  g  e

HGQKV


----VA d  g  e

HGQKVA


----VADALTK d  g  e

HGQKVADALTK


----VADALTK d  g  e

HGQKVADALTK


----VADALTKPVNFKFA d  g  e

HGQKVADALTK------A


----VADALTKPVNFKFAVAH d  g  e

HGQKVADALTK------AVAH


Dynamic programming by tracing a similarity matrix
Dynamic Programming by Tracing a Similarity Matrix d  g  e

  • Recall the algorithm Table Tracing

  • Tracing a Similarity Matrix of Amino Acids


Use of dynamic programming to evaluate homology between pairs of sequences
Use of dynamic programming to evaluate homology between pairs of sequences

  • If we just want to know maximum match possible between two sequences, then we don’t need to do trace-back but can just look at the highest value in the first row or column (“match score”). This represents the best possible alignment score.


Gap penalty alternatives
Gap penalty alternatives pairs of sequences :

  • constant gap penalty for gap > 1

  • gap penalty proportional to gap size (affine gap penalty)

    • one penalty for starting a gap (gap opening penalty)

    • different (lower) penalty for adding to a gap (gap extension penalty)

    • dynamic programming algorithm can be made more efficient


Gap penalty alternatives cont
Gap penalty alternatives (cont.) pairs of sequences

  • gap penalty proportional to gap size and sequence

    • for nucleic acids, can be used to mimic thermodynamics of helix formation.

    • two kinds of gap opening penalties

      • one for gap closed by AT, different for GC.

    • different gap extension penalty.


End gaps
End gaps pairs of sequences

  • Some programs treat end gaps as normal gaps and apply penalties, other programs do not apply penalties for end gaps.


End gaps cont
End gaps (cont.) pairs of sequences

  • Can determine which a program does by adding extra (unmatched) bases to the end of one sequence and seeing if match score changes.

  • Penalties for end gaps appropriate for aligned sequences where ends "should match“.

  • Penalties for end gaps inappropriate when surrounding sequences are expected to be different (e.g., conserved exon surrounded by varying introns).


Global vs local similarity
Global vs. Local Similarity pairs of sequences

  • Should result of alignment include all amino acids or proteins or just those that match?

    • If yes, a global alignment is desired

    • If no, a local alignment is desired

      • Global alignment is accomplished by including negative scores for “mismatched” positions, thus scores get worse as we move away from region of match (local alignment).

      • Instead of starting trace-back with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero.


Local alignment general scheme
Local Alignment: General Scheme pairs of sequences

s : score

  • From ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

    s ( A[i] , B[j] ) + H[i+1, j+1]

    H[i , j] = max max { H[i+x, j] – w( x );

    gap x in sequence B }

    max { H[i, j+y] – w( y );

    gap y in sequence A }

w : penalty function

Best score of any prefix of the subsequence from i, j onward.

0


Local alignment table tracing

Local Alignment: Table Tracing pairs of sequences

Best score of any prefix of the subsequence from i, j onward if matching A[i] with B[j].

To do table tracing based on similarity matrix of amino acids, we re-define H[i , j] to be the optimal score of choosing the match of A[i] with B[j].

H[ i , j ] = s (A[ i ] , B[ j ]) +

H[i+1, j+1] // w : gap penalty

max { H[i+1+x, j+1] – w( x );

+ max gap x in sequence B }

max { H[i+1, j+1+y] – w( y );

gap y in sequence A }

if positive; otherwise 0.


ad