1 / 25

Bioinformatics (4)

Bioinformatics (4). Sequence Analysis. DNA2: the last 5000 generations. NA1: Common & simple. Sequence Similarity and Homology. figure. Alignments & Scores. Local (motif) ACCACACA :::: ACACCATA Score= 4(+1) = 4. Global (e.g. haplotype) ACCACACA ::xx::x: ACACCATA

hisa
Download Presentation

Bioinformatics (4)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics (4) Sequence Analysis

  2. DNA2: the last 5000 generations NA1: Common & simple Sequence Similarity andHomology figure

  3. Alignments & Scores Local (motif) ACCACACA :::: ACACCATA Score= 4(+1) = 4 Global (e.g. haplotype) ACCACACA ::xx::x: ACACCATA Score= 5(+1) + 3(-1) = 2 Suffix (shotgun assembly) ACCACACA ::: ACACCATA Score= 3(+1) =3

  4. "Hardness" of (multi-) sequence alignment Align 2 sequences of length N allowing gaps. ACCAC-ACA ACCACACA ::x::x:x: :xxxxxx: AC-ACCATA , A-----CACCATA , etc. 2N gap positions, gap lengths of 0 to N each: A naïve algorithm might scale by O(N2N). For N= 3x109this is rather large. Now, what about k>2 sequences? or rearrangements other than gaps?

  5. Increasingly complex (accurate) searches Exact (stringsearch) CGCG Regular expression (PrositeSearch) CGN{0-9}CG = CGAACG Substitution matrix (BlastN) CGCG ~= CACG Profile matrix (PSI-blast) CGc(g/a) ~ = CACG Gaps (Gap-Blast) CGCG ~= CGAACG Dynamic Programming (NW, SM) CGCG ~= CAGACG

  6. Comparisons of homology scores Pearson WR Protein Sci 1995 Jun;4(6):1145-60 Comparison of methods for searching protein sequence databases. Methods Enzymol 1996;266:227-58 Effective protein sequence comparison. Algorithm: FASTA, Blastp, Blitz Substitution matrix:PAM120, PAM250, BLOSUM50, BLOSUM62 Database: PIR, SWISS-PROT, GenPept

  7. Scoring matrix based on large set of distantly related blocks: Blosum62

  8. Scoring Functions and Alignments • Scoring function: • (match) = +1; or substitution matrix • (mismatch) = -1; " • (indel) = -2; • (other) = 0. • Alignment score: sum of columns. • Optimal alignment: maximum score.

  9. Calculating Alignment Scores

  10. What is dynamic programming? A dynamic programming algorithm solves every subsubproblems just once and then saves its answer in a table, avoiding the work of recomputing the answer every time the subsubproblem is encountered. -- Cormen et al. "Introduction to Algorithms", The MIT Press.

  11. Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b). (a) Path Matrix(b) Search Tree A I M S A M O S X X . . . . . . . . . . . . . . Alignment AIM-S A-MOS Pruning by an optimization function

  12. Di, j-l Di, j-l Di-1, j-1 Di-1, j Di-1, j Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant. (b) the gap penalty is a linear function of the gap length. (a) (b) Di-1, j-1 d ws(i), t(j) d b Di, j(2) Di,j ws(i), t(j) b Di,j(1) Di,j(3)

  13. Recursion of Optimal Global Alignments

  14. 0 0 . . . . . . 0 . . . . 0 Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the initial values are assigned to the path matrix. (a) Global vs. Global (b) Local vs. Global 0 0 . . . . . . 0 0 (c) Local vs. Local 0 0 . . . . . . 0 . . . . 0 X

  15. Recursion of Optimal Local Alignments

  16. The dynamic programming algorithm can be applied to limited areas, rather than to the entire matrix, after rapidly searching the diagonals that contain candidate markers. i 1 n 1 1 j l m m n +m -1 l

  17. Computing Row-by-Row

  18. Traceback Optimal Global Alignment

  19. Local and Global Alignments

  20. Time and Space Complexity of Computing Alignments

  21. The order of computing matrix elements in the path matrix, which is suitable for (a) sequential processing and (b) parallel processing. (a) (i -1, j -1) (I, j -1) (i +1, j-1) (i -1, j ) (i, j) (i +1, j ) (b) (i, j -2) (i+1, j -2) (i -1, j -1) (i, j -1) (i +1, j -1) (i, j) (i -1, j )

  22. Time and Space Problems • Comparing two one-megabase genomes. • Space: • An entry: 4 bytes; • Table: 4 * 10^6 * 10^6 = 4 G bytes memory. • Time: • 1000 MHz CPU: 1M entries/second; • 10^12 entries: 1M seconds = 10 days.

  23. A Multiple Alignment of Immunoglobulins

  24. A multiple alignment <=> Dynamic programming on a hyperlattice From G. Fullen, 1996.

  25. Computing a Node on Hyperlattice A S V

More Related