1 / 57

Algorithms in Bioinformatics

Algorithms in Bioinformatics. Lawrence D’Antonio Ramapo College of New Jersey. Topics. Algorithm basics Types of algorithms in bioinformatics Sequence alignment Database Searches. Algorithm basics. What is an algorithm? Algorithm complexity P vs. NP NP completeness.

marli
Download Presentation

Algorithms in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey Bioinformatics Workshop, Fall 2003

  2. Topics • Algorithm basics • Types of algorithms in bioinformatics • Sequence alignment • Database Searches Bioinformatics Workshop, Fall 2003

  3. Algorithm basics • What is an algorithm? • Algorithm complexity • P vs. NP • NP completeness Bioinformatics Workshop, Fall 2003

  4. What is an algorithm? • An algorithm is a step-by-step procedure to solve a problem • The word “algorithm” comes from the 9th century Islamic mathematician al-Khwarizmi Bioinformatics Workshop, Fall 2003

  5. Algorithm Complexity • If the algorithm works with n pieces of data and the number of steps is proportional to n, then we say that the running time is O(n). • If the number of steps is proportional to log n, then the running time is O(log n). Bioinformatics Workshop, Fall 2003

  6. Example • Problem: find the largest element in a sequence of n elements. • Solution idea: Iteratively compare size of elements in sequence. Bioinformatics Workshop, Fall 2003

  7. Algorithm: • Initialize first element as largest. • For each remaining element. • If current element larger than largest, make that element largest. Running time: O(n) Bioinformatics Workshop, Fall 2003

  8. Polynomial Time • An algorithm is said to run in polynomial time if its running time can be written in the form O(nk) for some power k. • The underlying problem is said to be of class P. Bioinformatics Workshop, Fall 2003

  9. Polynomial Time Examples • Searching Binary Search: O(log n) • Sorting Quick Sort: O(n log n) Bioinformatics Workshop, Fall 2003

  10. NP Algorithms • An algorithm is nondeterministic if it begins with guessing a solution to the problem and then verifies the guess. • A problem is of category NP if there is a nondeterministic algorithm for that problem which runs in polynomial time. Bioinformatics Workshop, Fall 2003

  11. NP Complete • A problem is NP-complete if it has an NP algorithm, and solutions to this problem can be used to solve all other NP problems. • A problem is NP-hard if it is at least as hard as the NP-complete problems Bioinformatics Workshop, Fall 2003

  12. NP Complete Examples • Traveling salesman • Knapsack problem • Partition problem • Graph coloring Bioinformatics Workshop, Fall 2003

  13. P = NP ? • P  NP • If P  NP then NP-complete problems have exponential running time. Bioinformatics Workshop, Fall 2003

  14. Polynomial vs. Exponential Bioinformatics Workshop, Fall 2003

  15. Algorithms in Bioinformatics • Algorithms to compare DNA, RNA, or protein sequences • Database searches to find homologous sequences • Sequence assembly • Construction of evolutionary trees • Structure prediction Bioinformatics Workshop, Fall 2003

  16. Edit operations on sequences Substitution Insertion Deletion AATAAGC AAT-AAGC AATAAGC ATTAAGC AATTAAGC AA-AAGC Bioinformatics Workshop, Fall 2003

  17. What is sequence alignment? • Compare two sequences using matches, substitutions and indels. G A A - - T C A T G - T G G - C A - • 3 matches, 1 substitution, 5 indels Bioinformatics Workshop, Fall 2003

  18. Complexity of DNA Problems • 3 billion base pairs in human genome • Many NP complete problems • 10600 possible alignments for two 1000 character sequences Bioinformatics Workshop, Fall 2003

  19. Types of sequence alignment • Determine the alignment of two sequences that maximizes similarity (global alignment) • Determine substrings of two sequences with maximum similarity (local alignment) • Determine the alignment for several sequences that maximizes the sum of pairs similarity (multiple alignment) Bioinformatics Workshop, Fall 2003

  20. Significance of Alignment • Functional similarity • Structural similarity • Homology Bioinformatics Workshop, Fall 2003

  21. Scoring System • Assign a score for each possible match, substitution and indel • Distance functions – Find alignment to minimize distance between sequences • Similarity functions – Find alignment to maximize similarity between sequences Bioinformatics Workshop, Fall 2003

  22. Edit Distance G A A - - T C A T G - T G G - C A - • Similarity function: 1 for match, -1 for substitution, -2 for indel • Score: -8 Bioinformatics Workshop, Fall 2003

  23. Dynamic Programming • Used on optimization problems • Bottom-up approach • Recursively builds up solution from subproblem optimal solutions Bioinformatics Workshop, Fall 2003

  24. Dynamic Programming Alignment Algorithm (Needleman-Wunsch) • Given sequences a1,a2,…,an and b1,b2,…,bm to be aligned: • Initialize alignment matrix (aligning with spaces) • Entry [i,j] gives optimal alignment score for sequences a1,a2,…,ai and b1,b2,…,bj (where 1  i  n, 1  j  m) Bioinformatics Workshop, Fall 2003

  25. Computing Alignment Matrix • Match ai+1 with bj+1 • Match ai+1 with a space — • Match bj+1with a space — If a1,a2,…,ai and b1,b2,…,bj have been aligned, there are three possible next moves: Choose the move that maximizes the similarity of the two sequences Bioinformatics Workshop, Fall 2003

  26. Global Alignment Matrix Bioinformatics Workshop, Fall 2003

  27. Optimal Global Alignment Bioinformatics Workshop, Fall 2003

  28. Alignment Running Time • Assuming two sequences n characters each • Running time is O(n2) (each entry of matrix must be calculated) Bioinformatics Workshop, Fall 2003

  29. Variations of Alignment Algorithm • Gap penalty • Local alignment • Multiple alignment Bioinformatics Workshop, Fall 2003

  30. Gap Penalty • A gap is a number k of consecutive spaces • k consecutive spaces are more probable than k isolated spaces • Typical gap penalty function: a + b·k (affine gap penalty) • Here the first space in a gap is penalized a+b, further spaces are penalized b each. Bioinformatics Workshop, Fall 2003

  31. Gap Penalty Example • Use penalty, 1 + k A - A - C - A A C T A T C A • Score: -6 A A C - - - A A C T A T C A • Score: -4 Bioinformatics Workshop, Fall 2003

  32. Local Alignment • Find conserved regions in otherwise dissimilar sequences (e.g., viral and host DNA) • Smith-Waterman algorithm • Includes a fourth possibility at each step (don’t align) Bioinformatics Workshop, Fall 2003

  33. Local Alignment Example • Align the following G C T C T G C G A A T A C G T T G A G A T A C T Bioinformatics Workshop, Fall 2003

  34. Optimal Local Alignment G C T C T G C G A A T A C G T T G A G A T A C T (G C T C) T G C G A A T A (C G T) T G A G - A T A (C T) Bioinformatics Workshop, Fall 2003

  35. Multiple Alignment • Find the alignment among a set of sequences that maximizes the sum of scores for all pairs of sequences • Dynamic programming run-time for k sequences of length n: O(k2 2k nk) • Multiple alignment is NP-complete Bioinformatics Workshop, Fall 2003

  36. Other Features • Usually used for protein alignment • Can be used for global or local alignment Bioinformatics Workshop, Fall 2003

  37. Multiple Alignment Example Bioinformatics Workshop, Fall 2003

  38. Multiple vs. Pairwise Alignment • Optimal multiple alignment does not imply optimal pairwise alignment AT A - A - - T - T Bioinformatics Workshop, Fall 2003

  39. Substitution Matrices • In homologous sequences certain amino acid substitutions are more likely to occur than others • Types of substitution matrices • PAM • BLOSUM Bioinformatics Workshop, Fall 2003

  40. PAM Matrices • Defines units of evolutionary distance • 1 PAM unit represents an average of one mutation per 100 amino acids • Start with a set of highly similar sequences and compute • pa = probability of occurrence of amino acid a • Mab = probability of a mutating to b Bioinformatics Workshop, Fall 2003

  41. PAM Matrix Formula • Entries in a k-PAM matrix Bioinformatics Workshop, Fall 2003

  42. PAM250 Matrix Bioinformatics Workshop, Fall 2003

  43. BLOSUM Matrices (Omit) • Uses log-odds ratio similar to PAM • Uses short highly conserved sequences • BLOSUM x matrices created after removing sequences that are more than x percent identical • Better at local alignments Bioinformatics Workshop, Fall 2003

  44. BLOSUM Matrices • A motif is a conserved amino acid pattern found in a group of proteins with similar biological meaning (PROSITE) • A block is a conserved amino acid pattern in a group of proteins (no spaces allowed in the pattern) (BLOCKS) Bioinformatics Workshop, Fall 2003

  45. Motif Example • Motif obtained from a group of 34 tubulin proteins M[FYW] . . F[VLI]H . [FYW] . . EGM Bioinformatics Workshop, Fall 2003

  46. Defining BLOSUM (I) • BLOSUMn uses blocks that are n% identical (BLOSUM62 is most common) • Consider all pairs of amino acids appearing in the same column in the blocks Bioinformatics Workshop, Fall 2003

  47. Defining BLOSUM (II) • Define n(i,j) to be the frequency that amino acids i,j appear in a column pair • Define e(i,j) to be the frequency that amino acids i,j appear in any pair • Define BLOSUM entry Bioinformatics Workshop, Fall 2003

  48. PAM vs. BLOSUM • PAM derived from highly similar sequences (evolutionary model) • BLOSUM derived from protein families sharing a common ancestor (conserved domain model) Bioinformatics Workshop, Fall 2003

  49. Database Searches • FASTA • BLAST Bioinformatics Workshop, Fall 2003

  50. FASTA • Looks for sequences in a database similar to a query sequence • Heuristic, exclusion method • Compares query sequence to each database sequence (called the text) Bioinformatics Workshop, Fall 2003

More Related