1 / 17

Introduction to Bioinformatics: Lecture V Alignment Counting and Alignment Algorithms

Introduction to Bioinformatics: Lecture V Alignment Counting and Alignment Algorithms. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture.

Download Presentation

Introduction to Bioinformatics: Lecture V Alignment Counting and Alignment Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics: Lecture VAlignment Counting and Alignment Algorithms Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

  2. Outline of the lecture • Complexity of inexact string matching: exercises in counting alignments with gaps • The dynamic programming algorithm for sequence alignment: how it works • The dynamic programming algorithm for sequence alignment: why it works • Limitations and faster heuristic approaches: BLAST JM - http://folding.chmcc.org

  3. Web watch: “Genes and Disease” and other NCBI resources Genes (proteins) work in herds. Being co-localized may imply co-expression and interactions. Always check the neighbors of your favorite gene! http://www.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC&rid=gnd.TOC&depth=2 Additional reading materials regarding sequence alignment: check out the web site for the course … JM - http://folding.chmcc.org

  4. How many alignments with gaps are there? All the possible alignments (with gaps, however without the unnecessary alignment of two gaps against each other) may be represented in the form of a grid with only three steps (South, East, Southeast) allowed, i.e., there is a bijection between the set of walks on such a grid and the set of alignments. 0 1 1 1 1 3 5 7 1 5 13 25 1 7 25 63 321 1683 JM - http://folding.chmcc.org

  5. How many alignments with gaps are there? JM - http://folding.chmcc.org

  6. How many alignments with gaps are there? JM - http://folding.chmcc.org

  7. How many alignments with gaps are there? JM - http://folding.chmcc.org

  8. Time for the main idea of the algorithm … • Suppose we knew best alignments (and their scores) up to the nodes • which delineate the yellow part of the DP graph. What would be the best • extension, given that we have three choices: • Align the last characters in each string (diagonal extension) and add the score for this pair, e.g., s(a3,b3)= -5 • Align a3to a gap (horizontal extension), s(a3,-)= -8 • Align b3 to a gap (vertical extension), s(-,b3)= -8 Note well, the score for an extension does not depend on the alignment up to this point. 5 12 4 -3 JM - http://folding.chmcc.org

  9. Tracing back optimal local extensions given the best alignment up to the previous node in the graph … s(a2,b1)= 10 s(a3,b2)= 10 s(a1,b1)= -5 2 -6 5 12 -8 -16 4 -3 -5 2 -6 5 12 4 -3 JM - http://folding.chmcc.org

  10. This conceptual step may be reversed to obtain the best score and alignment up to a given point The score is a sum of independent piecewise scores, in particular, the score up to a certain point is the best score up to the point one step before plus the incremental score of the new step: • Global alignment (Needleman-Wunsch): F(0,0) = 0; F(k,0) = F(0,k) = - k d; F(i,j) = max { F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d } • Local alignment (Smith-Waterman): F(0,0) = 0; F(k,0) = F(0,k) = 0; F(i,j) = max { 0 ; F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d } JM - http://folding.chmcc.org

  11. The general scheme of the NW algorithm • Use the recurrence relations, starting from the left upper corner (convention). • Find the highest score in the DP table (last, bottom right cell in the global alignment by definition) • Trace back the alignment using the pointers in the DP graph that show how the best local steps led to the best overall alignment. JM - http://folding.chmcc.org

  12. Examples of pairwise scores from the Blosum50 matrix JM - http://folding.chmcc.org

  13. An example of DP table for global alignment HEAGAWGHE --P-AW-HE JM - http://folding.chmcc.org

  14. An example of DP table for local alignment AWGHE AW-HE JM - http://folding.chmcc.org

  15. Why does it work? • All the possible alignments (with gaps) are represented in the DP table (graph) • The score is a sum of independent piecewise scores, in particular, the score up to a certain point is the best score up to the point one step before plus the incremental score of the new step • Once the best score in the DP table is found the trace back procedure generates the alignment since only the best “past” leading to the present score is represented by the pointers between the cells JM - http://folding.chmcc.org

  16. Why does it work? Formally, one needs to show that the walk (alignment) found using the NW DP recurrence relations and the traceback procedure is indeed optimal, i.e., it maximizes the alignment score. An argument instead of a proof In case of global alignment each path starts at cell (0,0) and must end at cell (n,m). Consider the latter cell and the immediate past that led to it through one (most favorable together with the cost of the incremental step) of the 3 neighboring cells. Changing the last step (e.g. from initially chosen, optimal diagonal step) to an alternative one does not affect the scores at the preceding cells that represent the best trajectory up to this point. Clearly we get suboptimal solution if we assume that optimal solutions have been found in the previous steps. Hence, formalizing this argument we get proof by induction with reductio ad absurdum. Problem How to modify the NW algorithm for suffix-prefix matches? Problem What is the meaning of the cut off threshold for the SW algorithm? JM - http://folding.chmcc.org

  17. Approximate, heuristic solutions may be nearly as good and much faster: BLAST algorithm • BLAST approach: gapless seeds (High Scoring Pairs with well defined confidence measures), DP extensions: JM - http://folding.chmcc.org

More Related