Global Sequence Alignment by Dynamic Programming

Global Sequence Alignment by Dynamic Programming

Needleman-Wunsch Algorithm • General Algorithm for sequence comparison • Maximize a similarity score to give the maximum match • Maximum match=largest number of amino acids or nucleotides of one sequence that can be matched with another, allowing for all possible deletions.

Needleman-Wunsch Algorithm • Finds the best GLOBAL alignment of any two sequences. • N-M involves an iterative matrix method of calculation • All possible pairs (nucleotides or amino acids) are represented in a two-dimensional array. • All possible alignments are represented by pathways through this array.

Needleman-Wunsch Algorithm • Sequence alignment methods predate dot-matrix searches, and all of the alignment methods in use today are related to the original method of Needleman and Wunsch (1970). • Needleman and Wunsch wanted to quantify the similarity between two sequences.

Needleman-Wunsch Algorithm • Over the course of evolution, some positions undergo base or amino acid substitutions, and bases or amino acids can be inserted or deleted. • Any measurement of similarity must therefore be done with respect to the best possible alignment between two sequences. • Because insertion/deletion events are rare compared to base substitutions, it makes sense to penalize gaps more heavily than mismatches when calculating a similarity score.

Dynamic Programming • Finding the best alignment of 2 sequences is a hard problem solved by a computational method called dynamic programming • Multiple sequence alignment of 3 or more sequences can be solved by dynamic programming and statistical methods

How Do We Generate the Correct Alignment ? • We can't. • We can never guarantee that a particular alignment is correct except for the simplest, unambiguous alignments! • Such an alignment would require aligning each sequence in turn to the ancestral sequence first. • Since there is no possibility to know the ancestral sequence and the evolutionary steps, the evolutionary correctness of any alignment cannot be determined.

The Optimal Alignment. • If the optimal alignment does not support homology, then the correct alignment (which has a smaller or equal score) will not support homology either. • But again: there is no guarantee that the optimal alignment is the correct alignment, even though it may be the best guess.

Dynamic Programming • Dynamic programming is a term from operations research, where it was first used to describe a class of algorithms for the optimization of dynamic systems.

Dynamic Programming • In dynamic programming the principle of divide-and-conquer is used extensively: subdivide a problem that is to large to be computed, into smaller problems that may be efficiently computed. • Then assemble the answers to give a solution for the large problem. • When you do not know which smaller problem to solve, simply solve all smaller problems, store the answers and assembled them later to a solution for the large problem.

Dynamic Programming • Global optimal alignment is a difficult problem. • The major difficulty comes from the fact, that one cannot simply slide one sequence along another and sum over the similarity scores looked up in the appropriate mutation data matrix. • This will not work, because biological sequences may have gaps or insertions of sequences relative to each other.

Three steps in Dynamic Programming 1. Initialization 2 Matrix fill or scoring 3. Traceback and alignment

Sample Matrix • To align with a cell in the diagonal means an alignment in the next position. • An increasing diagonal line means a stretch of sequence identity. • To align with an off-diagonal cell requires the insertion of a corresponding number of gaps.

Needleman-Wunsch Algorithm • We can compute for every cell the highest possible score that can be obtained for a path originating from that cell. • That is done by looking in all the elements permissible for extending the path, and adding the highest value found to the contents of the cell.

Needleman-Wunsch Algorithm • If this is done in an orderly way, the highest score found is the global maximum alignment score. • Then the optimal path consists of all those cells that contributed to the global maximum alignment score. • There can be several equivalent optimal paths.

Sample Matrix • Create a matrix with M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned. • Since this example assumes there is no gap opening or gap extension penalty, the first row and first column of the matrix can be initially filled with 0.

Matrix Fill Step • One possible solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j for each position in the matrix. • In order to find Mi,j for any i,j it is minimal to know the score for the matrix positions to the left, above and diagonal to i, j. • In terms of matrix positions, it is necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1.

Matrix Fill Step • In the example, Mi-1,j-1 will be red, Mi,j-1 will be green and Mi-1,j will be blue. • Since the gap penalty (w) is 0, the rest of row 1 and column 1 can be filled in with the value 1.

Matrix Fill Step • Now let's look at column 2. The location at row 2 will be assigned the value of the maximum of 1(mismatch), 1(horizontal gap) or 1 (vertical gap). So its value is 1. • At the position column 2 row 3, there is an A in both sequences. Thus, its value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so its value is 2.

Traceback Step • After the matrix fill step, the maximum alignment score for the two test sequences is 6. • The traceback step determines the actual alignment(s) that result in the maximum score. • With a simple scoring algorithm like this one, there are likely to be multiple maximal alignments. • The traceback step begins in the M,J position in the matrix- the position that leads to the maximal score. • In this case, there is a 6 in that location.

Traceback Step • Traceback takes the current cell and looks to the neighbor cells that could be direct predecessors. • This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). • The algorithm for traceback chooses as the next cell in the sequence one of the possible predecessors. The neighbors are marked in red and are also equal to 5.

Traceback Step • Since the current cell has a value of 6 and the scores are 1 for a match and 0 for anything else, the only possible predecessor is the diagonal match/mismatch neighbor. • If more than one possible predecessor exists, any can be chosen.

Traceback Step • This gives us a current alignment of (Seq #1) A | (Seq #2) A • So now we look at the current cell and determine which cell is its direct predecessor. • In this case, it is the cell with the red 5.

This is One Traceback Giving an alignment of : G A A T T C A G T T A | | | | | | G G A _ T C _ G _ _ A

This is an alternative Traceback giving an alignment of : G _ A A T T C A G T T A | | | | | | G G _ A _ T C _ G _ _ A

What Is the Best Alignment Between These Two Amino Acid Sequences? • A zinc-finger core sequence: CKHVFCRVCI • A sequence fragment from a viral protein: CKKCFCKCV

Practice Matrix C K H V F C R V C I +-------------------- C | K | K | C | F | C | K | C | V |

Place a 1 Where There Is a Match in the Matrix C K H V F C R V C I +-------------------- C | 1 1 1 K | 1 K | 1 C | 1 1 1 F | 1 C | 1 1 1 K | 1 C | 1 1 1 V | 1 1

Place Zeroes on the Ends Since They Can’t Align C K H V F C R V C I +-------------------- C | 1 1 1 0 K | 1 0 K | 1 0 C | 1 1 1 0 F | 1 0 C | 1 1 1 0 K | 1 0 C | 1 1 1 0 V | 0 0 0 1 0 0 0 1 0 0

Loading the Matrix • Then proceed to the next row and column, adding to each matrix cell the maximal value of any other cell that could be the next step on a path to the matrix. • For instance the value 1 at (C6,C8) now becomes a 2, since it could be extended from the 1 at (V8,V9) through a 1.

Filling the Matrix C K H V F C R V C I +-------------------- C | 1 1 1 0 K | 1 0 0 K | 1 0 0 C | 1 1 1 0 F | 1 0 0 C | 1 1 1 0 K | 1 0 0 C | 21 1 1 1 21 0 1 0 V | 0 0 0 1 0 0 0 1 0 0

Repeat This Procedure for the Next Row and Column: C K HV F C R V C I +-------------------- C | 1 1 1 1 0 K | 1 1 0 0 K | 1 1 0 0 C | 1 1 1 1 0 F | 1 1 0 0 C | 1 1 1 1 0 K | 2 32 22 1 1 1 0 0 C | 2 1 1 1 1 21 0 1 0 V | 0 0 0 1 0 0 0 1 0 0

And the Next … C K H V F C R V C I +-------------------- C | 1 1 1 1 1 0 K | 1 1 1 0 0 K | 1 1 1 0 0 C | 1 1 1 1 1 0 F | 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 32 2 2 1 1 1 0 0 C | 2 1 1 1 1 21 0 1 0 V | 0 0 0 1 0 0 0 1 0 0

And the Next Row… C K H V F C R V C I +-------------------- C | 1 2 1 1 1 0 K | 1 1 1 1 0 0 K | 1 1 1 1 0 0 C | 1 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 21 0 1 0 V | 0 0 0 1 0 0 0 1 0 0

And the Next … C K H V F C R V C I +-------------------- C | 1 2 2 1 1 1 0 K | 1 2 1 1 1 0 0 K | 1 2 1 1 1 0 0 C | 43 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0

And the Next … C K H V F C R V C I +-------------------- C | 1 3 2 2 1 1 1 0 K | 1 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 21 0 1 0 V | 0 0 0 1 0 0 0 1 0 0

And the Next … C K H V F C R V C I +-------------------- C | 1 3 3 2 2 1 1 1 0 K | 4 4 3 3 2 1 1 1 0 0 K | 3 4 3 3 2 1 1 1 0 0 C | 4 3 3 3 2 2 1 1 1 0 F | 3 2 2 2 3 1 1 1 0 0 C | 4 2 2 2 2 2 1 1 1 0 K | 2 3 2 2 2 1 1 1 0 0 C | 2 1 1 1 1 2 1 0 1 0 V | 0 0 0 1 0 0 0 1 0 0

Global Sequence Alignment by Dynamic Programming