1 / 66

# Sequence Alignment - PowerPoint PPT Presentation

Sequence Alignment. Arthur W. Chou Clark University Spring 2008. Sequence Alignment. Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: GCGCATGGATTGAGCGA TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Sequence Alignment' - suchin

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Sequence Alignment

Arthur W. Chou

Clark University

Spring 2008

Input: two sequences over the same alphabet

Output: an alignment of the two sequences

Example:

• GCGCATGGATTGAGCGA

• TGCGCCATTGATGACCA

A possible alignment:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

### Why align sequences?

Lots of sequences don’t have known ancestry, structure, or function. A few of them do.

If they align, they are similar.

If they are similar, they might have the same

ancestry, similar structure or function.

If one of them has known ancestry, structure, or

function, then alignment to the others yields

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Three kinds of match:

Exact matches

Mismatches

Indels (gaps)

### Choosing Alignments

There are many possible alignments

For example, compare:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

to

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Which one is better?

• Similar sequences evolved from a common ancestor

• Evolution changed the sequences from this ancestral sequence by mutations:

• Replacement: one letter replaced by another

• Deletion: deletion of a character

• Insertion:insertion of a character

• Scoring of sequence similarity should examine how many and which operations took place

Score each position independently:

• Match: +1

• Mismatch: -1

• Indel: -2

Score of an alignment is sum of position scores

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Score: (+113) + (-1  2) + (-2  4) = 3

------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Score: (+1  5) + (-1  6) + (-2  11) = -23

More General Scores

• The choice of +1,-1, and -2 scores is quite arbitrary

• Depending on the context, some changes are more plausible than others

• Exchange of an amino-acid by one with similar properties (size, charge, etc.) vs.

• Exchange of an amino-acid by one with opposite properties

• Probabilistic interpretation: How likely is one alignment versus another ?

• A dot is placed at each position where two residues match.

• It's a visual aid. The human eye can rapidly identify similar regions in sequences.

• It's a good way to explore sequence organization: e.g. sequence repeats.

• It does not provide an alignment.

• This method produces dot-plots with too much noise

• to be useful

• The noise can be reduced by calculating a score using a window of residues.

• The score is compared to a threshold or stringency.

THEFA-TCAT

||||| ||||

THEFASTCAT

Urokinase-Type plasminogen Activator

Dot Matrix Representation

• Produces a graphical representation of similarity regions

• The horizontal and vertical dimensions correspond to the compared sequences

• A region of similarity stands out as a diagonal

THE

THE

|||

THE

THE

HEF

CAT

THE

Score: -5

Score: 23

Score: -5

Score: -4

Dot Matrix or Dot-plot

• Each window of the first sequence is aligned (without

gaps) to each window of the 2nd sequence

• A colour is set into a rectangular array according to the

score of the aligned windows

• Diagonal rows ( ) of dots

reveal sequence similarity

or repeats.

• Anti-diagonal rows ( )

of dots represent inverted

repeats.

• Isolated dots represent

random similarity.

H C G E T F G R W F T P E W

K

C •

G •

P •

T • •

F • •

G •

R •

I

A

C •

G • •

E • •

M

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

We can filter it by using a

sliding window looking for longer strings of matches and eliminates random matches

Longest Common Subsequence

Sequence A:nematode_knowledge

Sequence B:empty_bottle

n e m a t o d e _ k n o w l e d g e

| | | | | | |

e m p t y _ b o t t l e

• LCS Alignment with match score 1,

mismatch score 0, and gap penalty 0

• A step-by-step description of the procedures to accomplish a task.

• Properties:

• Determination of output for each input

• Generality

• Termination

• Criteria:

• Correctness (proof, test, etc.)

• Time efficiency (no. of steps is small)

• Space efficiency (spaced used is small)

### Naïve algorithm: exhaustive search

sequences of length “n”

i

G C G A A T G G A T T G A G C G T

T G A G C C A T T G A T G A C C A

j

i j j i j i j j i i . . . . . . . . . . . . . .

2n

Worst case time complexity is ~ 2

• Similar to Longest Common Subsequence

• Introduced for biological sequences by

• S. B. Needleman & C. D. Wunsch.A general method applicable to the search for similarities in the amino acid sequence of two proteins.J. Mol. Biol. 48:443-453 (1970)

Dynamic Programming alignment

• Optimality substructure

• Reduction to a “small” number of sub-problems

• Memorization of solutions to sub-problems in a table

• Table look-up and tracing

- G C G C – A T G G A T T G A G C G A

T G C G C C A T T G A T – G A C C - A

Recursive LCS Relation alignment

lcs_len( i , j): length of LCS from i-th position onward in String A and from j-th position onward in String B

Two strings:

A: n e m a t o d e _ k n o w l e d g e

B: e m p t y _ b o t t l e

n e m a t o d e _ k n o w l e d g e

| | | | | | |

e m p t y _ b o t t l e

A[ 2 ] == B[ 1 ]

hence lcs_len ( 2 , 1 ) = 1 + lcs_len ( 2+1, 1+1 )

Recursive LCS Relation alignment

lcs_len( i , j): length of LCS from i-th position onward in String A and from j-th position onward in String B

int lcs_len ( i , j ) {

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’ ) return 0 ;

else

if (A[ i ] == B[ j ] )

return ( 1 + lcs_len ( i+1, j+1 ) ) ;

else

return max ( lcs_len ( i+1, j ) ,

lcs_len ( i, j+1 )

);

}

Reduction to Subproblems alignment

intlcs_len ( String A , String B )

{ returnsubproblem ( 0, 0 ); }

intsubproblem ( int i, intj)

{ if (A[i] == ‘\n’ orB[j] == ‘\n’) return 0;

else

if ( A[i] == B[j] )

return (1 +subproblem(i+1, j+1));

else return

max (subproblem(i+1, j) ,

subproblem ( i,j+1) );

}

### Memorizing the solutions : alignment

Matrix L[ i , j ] = -1 ; // initializing the memory device

int subproblem ( int i, int j ) {

if ( L[i, j] < 0 ) {

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’) L[i , j] = 0;

else if ( A[ i ] == B[ j ] )

L[i, j] = 1 + subproblem(i+1, j+1);

else L[i, j] = max( subproblem(i+1, j),

subproblem(i, j+1));

} return L[ i, j ] ;

}

Iterative LCS: Table Look-up alignment

To get the length of LCS of Sq. A and Sq. B

{ // length of A is m and length of B is n

first allocate storage for the matrix L;

for each row i from m downto 0

for each column j from n downto 0

if (A[ i ] == ‘\n’ or B[ j ] == ‘\n’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

}

return L[0, 0];

}

### Iterative LCS: Table Look-up alignment

int lcs_len ( String A , String B ) // the length

{

// First allocate storage for the matrix L;

for ( i = m ; i >= 0 ; i-- ) // A has length m+1

for ( j = n ; j >= 0 ; j-- ) { // B has length n+1

if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0;

else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];

else L[ i, j ] = max(L[i+1, j], L[i, j+1]);

}

return L[0, 0];

}

### Dynamic Programming Algorithm alignment

B

j

j+1

A

Matrix L

L[i, j] = 1 + L[i+1, j+1] , if A[ i ] == B[ j ] ;

L[i, j] = max ( L[i+1, j], L[i, j+1] ) otherwise

L[ i, j ]

L[ i, j+1 ]

i

L[ i+1, j ]

L[i+1, j+1]

i+1

n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d  g  ee   7  7  6  5  5  5  5  5  4  3  3  3  2  2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

### Obtain the subsequence d  g  e

Sequence S = empty; // the LCS

i = 0; j = 0;

while ( i < m and j < n) {

if ( A[ i ] == B[ j ] ) {

add A[i] to end of S;

i++; j++;

} else

if ( L[i+1, j] >= L[i, j+1]) i++;

else j++;

}

n  e  m  a  t  o  d  e  _  k  n  o  w  l  e  d  g  ee   7  7  6  5  5  5  5  5  4  3  3  3  2  2  2  1  1  1  0m   6  6  6  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0p   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0t   5  5  5  5  5  4  4  4  4  3  3  3  2  2  1  1  1  1  0y   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0_   4  4  4  4  4  4  4  4  4  3  3  3  2  2  1  1  1  1  0b   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0o   3  3  3  3  3  3  3  3  3  3  3  3  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0t   3  3  3  3  3  2  2  2  2  2  2  2  2  2  1  1  1  1  0l   2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  0e   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

i d  g  e

x

### Dynamic Programming with scores and penalties

y

j

Dynamic Programming with d  g  escores and penalties: general

• from ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

s ( A[i] , B[j] ) + S[i+1, j+1]

S[i , j] = max max { S[i+x, j] – w( x );

gap x in sequence B }

max { S[i, j+y] – w( y );

gap y in sequence A }

s : score

w : penalty function

best score

from i, j onward

### Algorithm for simple gap penalty d  g  e

If for each gap, the penalty is a fixed constant “c”, then

• s(A[ i ] , B[ j ]) + S[i+1, j+1];

S[i , j] = max• S[ i+1, j ] – c ; // one gap

• S[ i, j+1 ] – c ; // one gap

### Table Tracing d  g  e

To do table tracing based on similarity matrix of amino acids, we re-define S[i , j] to be the optimal score of choosing the match of A[i] with B[j].

S[ i , j ] = s (A[ i ] , B[ j ]) + // s : score

S[i+1, j+1] // w : gap penalty

max { S[i+1+x, j+1] – w( x );

+ max gap x in sequence B }

max { S[i+1, j+1+y] – w( y );

gap y in sequence A }

Diagram d  g  e

Matrix S:

j

j+1

i

i+1

Summation operation d  g  e

1. Start at lower right corner.

2. Move diagonally up one position.

3. Find largest value in either

row segment starting diagonally below current position and extending to the right or

column segment starting diagonally below current position and extending down.

4. Add this value to the value in the current cell.

5. Repeat steps 3 and 4 for all cells to the left in current row and all cells above in current column.

6. If we are not in the top left corner, go to step 2.

----V d  g  e

HGQKV

----VA d  g  e

HGQKVA

• Recall the algorithm Table Tracing

• Tracing a Similarity Matrix of Amino Acids

Use of dynamic programming to evaluate homology between pairs of sequences

• If we just want to know maximum match possible between two sequences, then we don’t need to do trace-back but can just look at the highest value in the first row or column (“match score”). This represents the best possible alignment score.

Gap penalty alternatives pairs of sequences :

• constant gap penalty for gap > 1

• gap penalty proportional to gap size (affine gap penalty)

• one penalty for starting a gap (gap opening penalty)

• different (lower) penalty for adding to a gap (gap extension penalty)

• dynamic programming algorithm can be made more efficient

Gap penalty alternatives (cont.) pairs of sequences

• gap penalty proportional to gap size and sequence

• for nucleic acids, can be used to mimic thermodynamics of helix formation.

• two kinds of gap opening penalties

• one for gap closed by AT, different for GC.

• different gap extension penalty.

End gaps pairs of sequences

• Some programs treat end gaps as normal gaps and apply penalties, other programs do not apply penalties for end gaps.

End gaps (cont.) pairs of sequences

• Can determine which a program does by adding extra (unmatched) bases to the end of one sequence and seeing if match score changes.

• Penalties for end gaps appropriate for aligned sequences where ends "should match“.

• Penalties for end gaps inappropriate when surrounding sequences are expected to be different (e.g., conserved exon surrounded by varying introns).

Global vs. Local Similarity pairs of sequences

• Should result of alignment include all amino acids or proteins or just those that match?

• If yes, a global alignment is desired

• If no, a local alignment is desired

• Global alignment is accomplished by including negative scores for “mismatched” positions, thus scores get worse as we move away from region of match (local alignment).

• Instead of starting trace-back with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero.

Local Alignment: General Scheme pairs of sequences

s : score

• From ‘i-th’ pos. in A and ‘j-th’ pos. in B onward

s ( A[i] , B[j] ) + H[i+1, j+1]

H[i , j] = max max { H[i+x, j] – w( x );

gap x in sequence B }

max { H[i, j+y] – w( y );

gap y in sequence A }

w : penalty function

Best score of any prefix of the subsequence from i, j onward.

0

### Local Alignment: Table Tracing pairs of sequences

Best score of any prefix of the subsequence from i, j onward if matching A[i] with B[j].

To do table tracing based on similarity matrix of amino acids, we re-define H[i , j] to be the optimal score of choosing the match of A[i] with B[j].

H[ i , j ] = s (A[ i ] , B[ j ]) +

H[i+1, j+1] // w : gap penalty

max { H[i+1+x, j+1] – w( x );

+ max gap x in sequence B }

max { H[i+1, j+1+y] – w( y );

gap y in sequence A }

if positive; otherwise 0.