1 / 65

MUMmer: fast alignment of large-scale DNA and protein sequences

MUMmer: fast alignment of large-scale DNA and protein sequences. Presented by : Arthur Loder Course : CSE 497 Computational Issues in Molecular Biology Date : February 17, 2004. MUMmer’s Significance. MUMmer is a system for rapidly aligning entire genomes or very large protein sequences

tamra
Download Presentation

MUMmer: fast alignment of large-scale DNA and protein sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MUMmer: fast alignment of large-scale DNA and protein sequences Presented by : Arthur Loder Course : CSE 497 Computational Issues in Molecular Biology Date : February 17, 2004

  2. MUMmer’s Significance • MUMmer is a system for rapidly aligning entire genomes or very large protein sequences • Input=2 strings; Output=base-to-base alignment • What distinguishes MUMmer from previous algorithms? • Can align very large sequences (millions of nucleotides long)

  3. Complete Genome Alignment • What it means to align complete genomes: • Human chromosome #1: 200 million base pairs • Previous alignment algorithms run space efficiently O(n) • Time complexity O(n2) is unacceptably slow • We would like to align using a global dynamic programming algorithm (Needleman/Wunsch), but infeasible; need a shortcut

  4. MUMmer vs. BLAST • Instead, use technique somewhat similar to BLAST (find similar regions between strings) • BLAST for comparing 1 known string to a large set of unknown strings • MUMmer for aligning 2 very similar known strings

  5. You know you’ve made it when…

  6. Overview: MUMs Part 1 • String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTCGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATCATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGATTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGATCGATCTAATCTATCTCTAAATGGAGCGA… • String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGATCTATCTCGGCTTCGATCGTA… • How align without Needleman/Wunsch ?

  7. Overview: MUMs Part 2 • String A: …CTTCAGCTAGCTAGTCCCTAGCTAATCAAGAGACTAGGATCAAGGCTAGSCTGAAGTGCACCAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAAGACTGCGGATGCGAGTCGAGGCTTTAGAGCTAGCTAGCGCGATCGAGGCTAGCTATGCTAGCTATCATCGCAAGCTAGCTGAGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCCATCTACTCTAGTAGATCGATTAGTCGATCGATGCTAGATCGGATCGAGTCGAGATCGATGGAGTCGAGATCGATCTAATCTATCTCTAAATGGAGCGA… • String B: …GCATCGTAGGCTGAGGCTTCGAGGCTAGTCGATGCTAGGTTGCAATCCATCTATGAGCGATCGAATCGGATCGAGTCGAGCTAGCTAAGCTAGCTAGGGAGTCCAAACTCGCAAAGCTAGTGATCGATCGATATCGATTCGATCGGTGTCGCGATCGGGCGTAGCGATGTCTCTAGTCTCTAGTCGAGCTGATCGTAGCTAGTAATGTATCATAGCTAATCGCACTACTACGATGCGATCTCTAGTCGATCTATCTCGGCTTCGATCGTA… • Easier with large exact matches highlighted?

  8. Overview: MUMs Part 3 • Any optimal global alignment will probably use these two subsequences as “anchors” • This is the shortcut needed to calculate global alignment quickly on large sequences • Very intuitive in alignment process, but only a heuristic

  9. Overview: MUMs Part 4 • MUM: Maximal Unique Match • MUMs occur exactly once in each sequence • Ignores repeat sequences • MUMs found efficiently using suffix tree data structure (to be explained later)

  10. Overview: Choosing MUMs • Once have anchors, need to choose which ones to use in alignment • All MUMs: • MUMs used in alignment (subset):

  11. Overview: Closing the Gaps • After choose/align anchors, what next? • Close the gaps • Use dynamic programming (Smith-Waterman) to extend MUMs • Smaller regions, so can compute quickly • Implicit assumption: sequences very similar

  12. 3 Phases of MUMmer • 3 Phases: • 1) Obtaining MUMs (via Suffix Trees) • 2) MUM choosing (via Longest Increasing Subsequences) • 3) Gap closing (via dynamic programming, Smith-Waterman) • Comprised of previously known algorithms packaged to form a unique algorithm

  13. Critiquing MUMmer’s Output • Sample Output: Sequence A: …ACTGC_TGAC_CTA… Sequence B: …ACC_CA_GGCTCG_… ^ ^ ^ • MUMmer best-case: same alignment as Needleman/Wunsch • MUMmer worst-case: sub optimal alignment • At least computable, whereas Needleman/Wunsch is not

  14. Phase 1: Obtaining MUMs via Suffix Trees Edward M. McCreight. (1973) A Space-Economical Suffix Tree Construction Algorithm. http://doi.acm.org/10.1145/321941.321946

  15. Suffix Trees Outline • I. Suffix Trees • A. Motivations • B. Tries • C. Suffix Trees • D. How Mummer utilizes suffix trees

  16. Tries • Term ‘trie’ comes from ‘retrieval’ • Introduced in 1960’s by Fredkin • Suffix trees are a type of trie • Uses: • Quickly search large text via preprocessing • Used for regular expressions, longest common substring, automatic command completion, etc

  17. Non-Compact Trie Example • 5 strings encoded: BIG, BIGGER, BILL, GOOD, GOSH • Every edge represents a symbol of the alphabet

  18. Implementation of Tries • Use linked list • Include pointers to sibling and first child

  19. Compacting Tries : Part 1 • Method 1: trim chains leading to leaves • Compact trie for strings: BIG, BIGGER, BILL, GOOD, GOSH

  20. Compacting Tries : Part 2 • Method 2: Patricia Tries • Before, one edge per character • Now, unary nodes are collapsed

  21. Suffix Trie • Normal trie, but input strings are suffixes • Assume text string [t1…tn] • Q: Tree has how many leaves? • A: Tree has n Leaves

  22. Suffix Tree • First compact suffix trie • Next collapse unary nodes

  23. Suffix Trees: Decreasing storage • Rather than storing strings, store a pair of indices (x,y) where x is beginning of string and y is the end • Storage becomes O(n)

  24. Suffix Tree Algorithms • First linear-time algorithm given by Weiner (1973) • McCreight developed more space efficient algorithm (1976) • Two original papers’ reputations: difficult to understand

  25. McCreight’s Algorithm Part 1 • Algorithm M: • “Maps finite string S of characters into auxiliary index to S in the form of a digital search tree T whose paths are the suffixes of S, and whose terminal nodes correspond uniquely to positions within S.” A Space-Economical Suffix Tree Construction Algorithm. Edward M. McCreight. http://doi.acm.org/10.1145/321941.321946

  26. McCreight’s Algorithm Part 2 • S = ababc • Definitions: • sufi is suffix of S beginning at character position i • headi is longest prefix of sufi which is also prefix for sufj for some j < i • taili is sufi – headi • suf3 =? • abc • head3 =? • ab • tail3 =? • abc – ab = c

  27. McCreight’s Algorithm Part 3 • Builds suffix tree by adding sufi to Treei-1 • Initially, Tree1 contains only suff1 (the entire string) • To obtain Tree2, add suff2 to Tree1 • Continue until you have added suffn to Treen-1 • Treen is the final suffix tree

  28. McCreight’s Algorithm Part 4 • Adding a suffix (going from T2 to T3) • suf3= abc; head3 = ab; tail3 = c

  29. Suffix Trees Complexity • Adding a non-terminal and a new arc that corresponds to tail takes at most constant time • If could find head in at most constant time, it would run in linear time n, the length of the string S • Do so by using suffix links (see paper for details)

  30. Finding MUMs from Suffix Trees

  31. Finding MUMs from Suffix Trees 2

  32. Finding MUMs from Suffix Trees 3 • More general case:

  33. Phase 2: Choosing MUMs For Alignment via Longest Increasing Subsequence (LIS) Gusfield, D. (1997) Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.

  34. Motivation For Choosing MUMs • Q: Why can’t we use all MUMs for alignment? • A: Due to crossing of MUMs; can only choose increasing set of MUMs • Problem: given a set of MUMs, how do we choose the optimal sequence?

  35. Choosing MUMs (Continued) • Configuration can be uniquely represented: • P = {1, 2, 3, 4, 6, 7, 5}; • LIS(P) = {1, 2, 3, 4, 6, 7} • Determining optimal sequence of MUMs reduces to finding LIS of P

  36. IS Definition • Increasing Subsequence: values (strictly) increase from left to right • Sequence P = {4, 2, 1, 5, 8, 6, 9, 10} • Examples of two increasing subsequences: {4, 5, 9} or {2, 5, 6, 9, 10}

  37. DS Subsequence • Decreasing Subsequence: numbers that are decreasing from left to right • Sequence P = {4, 2, 1, 5, 8, 6, 9, 10} • Examples? <insert class participation here> {4, 2, 1}, {4, 2}, {4, 1}, {2, 1}, or {8, 6}

  38. Covers Definition Part 1 • Cover of P: set of decreasing subsequences of P that contains all numbers of P • P = {7, 3, 4, 8, 6, 2, 1} • Some possible covers ? {{7, 3} {4} {8} {6, 2, 1}} OR {{7, 3, 2, 1} {4} {8, 6}} And Others…

  39. Covers Definitions Part 2 • Size of cover: number of decreasing subsequences it contains • Smallest cover: cover with minimum size • “If I is an increasing subsequence of P with length equal to the size of a cover of P, call it C, then I is a longest increasing subsequence of P and C is a smallest cover of P” • Why?

  40. Covers & Relation to LIS • “If I is an increasing subsequence of P with length equal to the size of a cover of P, call it C, then I is a longest increasing subsequence of P and C is a smallest cover of P”. Why? • Because no increasing subsequence can contain more than one character from each decreasing subsequence in a cover

  41. Covers: Examples • P = {7, 3, 4, 8, 6, 2, 1} • Two possible covers: {{7, 3} {4} {8} {6, 2, 1}} {{7, 3, 2, 1} {4} {8, 6}} • What is the size of the smallest cover? • 3 (no cover can contain < 3 decreasing subsequences) • How many elements in LIS? • 3

  42. Covers: Examples Continued • P = {7, 3, 4, 8, 6, 2, 1} • For a particular cover, say: {{7, 3, 2, 1} {4} {8, 6}} • You can only choose one element from each subsequence, otherwise subsequence would not be increasing. • Example: IS: {3, 4, 6}; to add an element, would need to choose from a subsequence from which you already chose

  43. Greedy Cover Algorithm Part 1 • To create a smallest cover, use Greedy Cover algorithm: • Start from left of sequence P • Examine each number • Place number at the end of the left-most subsequence it can extend • If none exists, make a new decreasing subsequence (to the far right)

  44. Greedy Cover Algorithm Part 2 • Example: P = 6, 3, 5, 1, 9, 7, 2, 10, 4 • {6} • {6, 3} • {6, 3} {5} • {6, 3, 1} {5} • {6, 3, 1} {5} {9} • {6, 3, 1} {5} {9, 7} • {6, 3, 1} {5, 2} {9, 7} • {6, 3, 1} {5, 2} {9, 7} {10} • {6, 3, 1} {5, 2} {9, 7, 4} {10} • {6, 3, 1}{5, 2}{9, 7, 4} {10} (smallest cover)

  45. Obtaining LIS From Smallest Cover • LIS Algorithm: • Set i = # subsequences in greedy cover • Set I to empty list • Choose any element x in subsequence I and place in front of List I • While i > 1 • Scan from top of subsequence (i-1) and find first element y smaller than x • x = y and i = i -1 • Place x in the front of list I

  46. Obtaining LIS : Example • P = {6, 3, 5, 1, 9, 7, 2, 10, 4} • Smallest Cover: {6, 3, 1}{5, 2}{9, 7, 4} {10} 659 10 3 27 14 • i = # subsequences = 4 • i = 4; x = 10; I = {10} • i = 3; x = 9; I = {9, 10} • i = 2; x = 5; I = {5, 9, 10} • i = 1; x = 3; I = {3, 5, 9, 10} • P = 6, 3, 5, 1, 9, 7, 2, 10, 4 • LIS : {3, 5, 9, 10}

  47. How Mummer Utilizes LIS • P = {1, 2, 3, 4, 6, 7, 5} • LIS = {1, 2, 3, 4, 6, 7}

  48. Obtaining LIS: Complexity Analysis • Greedy cover can be found in O(nlogn) • LIS found from greedy cover in O(n)

  49. Phase 3: Closing the Gaps via Smith-Waterman Smith, T. and Waterman, M. (1981) Identification of Common Molecular Subsequences. Journal of Molecular Biology , 147, 195-197. http://citeseer.ist.psu.edu/smith81identification.html

  50. Closing the Gaps • After global-MUM alignment found, need to close local gaps • Gap: interruption in MUM-alignment • Types of gaps: • 1) SNP Single Nucleotide Polymorphisms • 2) Insertion • 3) Highly polymorphic region • 4) Repeat

More Related