1 / 36

Sequence comparison: Dynamic programming

Sequence comparison: Dynamic programming. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas. Sequence comparison overview. Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need

carlyn
Download Presentation

Sequence comparison: Dynamic programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

  2. Sequence comparison overview • Problem: Find the “best” alignment between a query sequence and a target sequence. • To solve this problem, we need • a method for scoring alignments, and • an algorithm for finding the alignment with the best score. • The alignment score is calculated using • a substitution matrix • gap penalties. • The algorithm for finding the best alignment is dynamic programming (DP).

  3. A simple alignment problem. • Problem: find the best pairwise alignment of GAATCandCATAC. • Use a linear gap penalty of -4. • Use the following substitution matrix:

  4. How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC • How many different possible alignments of two sequences of length nexist? GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C

  5. How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC • How many different alignments of two sequences of length n exist? GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C 5 2.5x102 10 1.8x105 20 1.4x1011 30 1.2x1017 40 1.1x1023 FYI for two sequences of length m and n, possible alignments number:

  6. GA CA DP matrix j 0 1 2 3 etc. i 0 1 2 3 4 5 5 The value in position (i,j) is the score of the best alignment of the first i positions of one sequence versus the first j positions of the other sequence. init. row and column

  7. DP matrix GAA CA- Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

  8. GA- CAT DP matrix Moving vertically in the matrix introduces a gap in the sequence along the top edge.

  9. GAA CAT DP matrix Moving diagonally in the matrix aligns two residues

  10. Initialization Start at top left and move progressively

  11. G - Introducing a gap

  12. - C Introducing a gap

  13. Complete first rowand column ----- CATAC

  14. Three ways to get to i=1, j=1 G- -C j 0 1 2 3 etc. i 0 1 2 3 4 5

  15. Three ways to get to i=1, j=1 -G C- j 0 1 2 3 etc. i 0 1 2 3 4 5

  16. G C Three ways to get to i=1, j=1 j 0 1 2 3 etc. i 0 1 2 3 4 5

  17. Accept the highest scoring of the three Then simply repeat the same rule progressively across the matrix

  18. DP matrix

  19. -G CA G- CA --G CA- DP matrix -4 -9 -12 0 -4 -4

  20. -G CA G- CA --G CA- DP matrix -4 -9 -12 0 -4 -4

  21. DP matrix

  22. DP matrix

  23. DP matrix

  24. DP matrix What is the alignment associated with this entry? (just follow the arrows back - this is called the traceback)

  25. DP matrix -G-A CATA

  26. DP matrix Continue and find the optimal global alignment, and its score.

  27. Best alignment starts at lower right and follows traceback arrows to top left

  28. GA-ATC CATA-C One best traceback

  29. GAAT-C -CATAC Another best traceback

  30. GAAT-C -CATAC GA-ATC CATA-C

  31. Multiple solutions • When a program returns a single sequence alignment, it may not be the only best alignment but it is guaranteed to be one of them. • In our example, all of the alignments at the left have equal scores. GA-ATC CATA-C GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC

  32. DP in equation form • Align sequence x and y. • Fis the DP matrix; s is the substitution matrix; d is the linear gap penalty.

  33. DP equation graphically take the max of these three

  34. Dynamic programming • Yes, it’s a weird name. • DP is closely related to recursion and to mathematical induction. • We can prove that the resulting score is optimal.

  35. Summary • Scoring a pairwise alignment requires a substitution matrix and gap penalties. • Dynamic programming is an efficient algorithm for finding an optimal alignment. • Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to that position. • DP iteratively fills in the matrix using a simple mathematical rule.

  36. Practice problem: find a best pairwise alignment of GAATC and AATTC d = -4

More Related