Introduction to Bioinformatics: Lecture V Alignment Counting and Alignment Algorithms

Introduction to Bioinformatics: Lecture VAlignment Counting and Alignment Algorithms Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

Outline of the lecture • Complexity of inexact string matching: exercises in counting alignments with gaps • The dynamic programming algorithm for sequence alignment: how it works • The dynamic programming algorithm for sequence alignment: why it works • Limitations and faster heuristic approaches: BLAST JM - http://folding.chmcc.org

Web watch: “Genes and Disease” and other NCBI resources Genes (proteins) work in herds. Being co-localized may imply co-expression and interactions. Always check the neighbors of your favorite gene! http://www.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC&rid=gnd.TOC&depth=2 Additional reading materials regarding sequence alignment: check out the web site for the course … JM - http://folding.chmcc.org

How many alignments with gaps are there? All the possible alignments (with gaps, however without the unnecessary alignment of two gaps against each other) may be represented in the form of a grid with only three steps (South, East, Southeast) allowed, i.e., there is a bijection between the set of walks on such a grid and the set of alignments. 0 1 1 1 1 3 5 7 1 5 13 25 1 7 25 63 321 1683 JM - http://folding.chmcc.org

How many alignments with gaps are there? JM - http://folding.chmcc.org

Time for the main idea of the algorithm … • Suppose we knew best alignments (and their scores) up to the nodes • which delineate the yellow part of the DP graph. What would be the best • extension, given that we have three choices: • Align the last characters in each string (diagonal extension) and add the score for this pair, e.g., s(a3,b3)= -5 • Align a3to a gap (horizontal extension), s(a3,-)= -8 • Align b3 to a gap (vertical extension), s(-,b3)= -8 Note well, the score for an extension does not depend on the alignment up to this point. 5 12 4 -3 JM - http://folding.chmcc.org

Tracing back optimal local extensions given the best alignment up to the previous node in the graph … s(a2,b1)= 10 s(a3,b2)= 10 s(a1,b1)= -5 2 -6 5 12 -8 -16 4 -3 -5 2 -6 5 12 4 -3 JM - http://folding.chmcc.org

This conceptual step may be reversed to obtain the best score and alignment up to a given point The score is a sum of independent piecewise scores, in particular, the score up to a certain point is the best score up to the point one step before plus the incremental score of the new step: • Global alignment (Needleman-Wunsch): F(0,0) = 0; F(k,0) = F(0,k) = - k d; F(i,j) = max { F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d } • Local alignment (Smith-Waterman): F(0,0) = 0; F(k,0) = F(0,k) = 0; F(i,j) = max { 0 ; F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d } JM - http://folding.chmcc.org

The general scheme of the NW algorithm • Use the recurrence relations, starting from the left upper corner (convention). • Find the highest score in the DP table (last, bottom right cell in the global alignment by definition) • Trace back the alignment using the pointers in the DP graph that show how the best local steps led to the best overall alignment. JM - http://folding.chmcc.org

Examples of pairwise scores from the Blosum50 matrix JM - http://folding.chmcc.org

An example of DP table for global alignment HEAGAWGHE --P-AW-HE JM - http://folding.chmcc.org

An example of DP table for local alignment AWGHE AW-HE JM - http://folding.chmcc.org

Why does it work? • All the possible alignments (with gaps) are represented in the DP table (graph) • The score is a sum of independent piecewise scores, in particular, the score up to a certain point is the best score up to the point one step before plus the incremental score of the new step • Once the best score in the DP table is found the trace back procedure generates the alignment since only the best “past” leading to the present score is represented by the pointers between the cells JM - http://folding.chmcc.org

Why does it work? Formally, one needs to show that the walk (alignment) found using the NW DP recurrence relations and the traceback procedure is indeed optimal, i.e., it maximizes the alignment score. An argument instead of a proof In case of global alignment each path starts at cell (0,0) and must end at cell (n,m). Consider the latter cell and the immediate past that led to it through one (most favorable together with the cost of the incremental step) of the 3 neighboring cells. Changing the last step (e.g. from initially chosen, optimal diagonal step) to an alternative one does not affect the scores at the preceding cells that represent the best trajectory up to this point. Clearly we get suboptimal solution if we assume that optimal solutions have been found in the previous steps. Hence, formalizing this argument we get proof by induction with reductio ad absurdum. Problem How to modify the NW algorithm for suffix-prefix matches? Problem What is the meaning of the cut off threshold for the SW algorithm? JM - http://folding.chmcc.org

Approximate, heuristic solutions may be nearly as good and much faster: BLAST algorithm • BLAST approach: gapless seeds (High Scoring Pairs with well defined confidence measures), DP extensions: JM - http://folding.chmcc.org

Introduction to Bioinformatics: Lecture V Alignment Counting and Alignment Algorithms

Introduction to Bioinformatics: Lecture V Alignment Counting and Alignment Algorithms

Presentation Transcript

Sequence alignment algorithms

Introduction to Sequence Alignment

Alignment and Algorithms

Space Efficient Alignment Algorithms

Bioinformatics Tutorial I BLAST and Sequence Alignment

Sequence Matching and alignment algorithms in the field of Bioinformatics

Mesh Alignment Algorithms

Introduction to Transportation Engineering Alignment Design Vertical Alignment

Heuristic alignment algorithms and cost matrices

Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms .

Introduction to bioinformatics Lecture 9 Multiple sequence alignment (3)

Introduction to bioinformatics Lecture 7 Multiple sequence alignment (1)

Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Intro to Alignment Algorithms: Global and Local

Introduction to bioinformatics Lecture 8 Multiple sequence alignment (2)

An Introduction to Sequence Alignment

Sequence alignment is central to bioinformatics!

Heuristic Alignment Algorithms

Algorithms for Pairwise Sequence Alignment

INTRODUCTION TO PROFESSIONAL WHEEL ALIGNMENT

CAP5510 – Bioinformatics Multiple Alignment

CAP5510 – Bioinformatics Multiple Sequence Alignment