1 / 45

CS5263 Bioinformatics

CS5263 Bioinformatics. Lecture 15 & 16 Exact String Matching Algorithms. Definitions. Text: a longer string T Pattern: a shorter string P Exact matching: find all occurrence of P in T. length m. T. P. length n. The naïve algorithm. Time complexity. Worst case: O(mn) Best case: O(m)

Download Presentation

CS5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

  2. Definitions • Text: a longer string T • Pattern: a shorter string P • Exact matching: find all occurrence of P in T length m T P length n

  3. The naïve algorithm

  4. Time complexity • Worst case: O(mn) • Best case: O(m) • aaaaaaaaaaaaaa vs baaaaaaa • Average case? • Alphabet A, C, G, T • Assume both P and T are random • Equal probability • How many chars do you need to compare before moving to the next position?

  5. Average case time complexity P(mismatch at 1st position): ¾ P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾ P(mismatch at kth position): (¼)k-1 * ¾ Expected number of comparison per position: p = 1/4 k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

  6. Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if P is not in T… Smarter algorithms: O(m + n) in worst case sub-linear in practice

  7. String matching scenarios • One T and one P • Search a word in a document • One T and many P all at once • Search a set of words in a document • Spell checking • One fixed T, many P • Search a completed genome for a short sequence • Two (or many) T’s for common patterns

  8. How to speedup? • Pre-processing T or P • Why pre-processing can save us time? • Uncovers the structure of T or P • Determines when we can skip ahead without missing anything • Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

  9. Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

  10. Which string to preprocess? • One T and one P • Preprocessing P? • One T and many P all at once • Preprocessing P or T? • One fixed T, many P (unknown) • Preprocessing T? • Two (or many) T’s for common patterns • ???

  11. Pattern pre-processing algs • Karp – Rabin algorithm • Small alphabet and small pattern • Boyer – Moore algorithm • the choice of most cases • Typically sub-linear time • Knuth-Morris-Pratt algorithm (KMP) • grep • Aho-Corasick algorithm • fgrep

  12. Karp – Rabin Algorithm • Let’s say we are dealing with binary numbers Text: 01010001011001010101001 Pattern: 101100 • Convert pattern to integer 101100 = 2^5 + 2^3 + 2^2 = 44

  13. Karp – Rabin algorithm Text: 01010001011001010101001 Pattern: 101100 = 44 decimal 10111011001010101001 = 2^5 + 2^3 + 2^2 + 2^1 = 46 10111011001010101001 = 46 * 2 – 64 + 1 = 29 10111011001010101001 = 29 * 2 - 0 + 1 = 59 10111011001010101001 = 59 * 2 - 64 + 0 = 54 10111011001010101001 = 54 * 2 - 64 + 0 = 44

  14. Karp – Rabin algorithm What if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bits Basic idea: hashing. 44 % 13 = 5 10111011001010101001 = 46 (% 13 = 7) 10111011001010101001 = 46 * 2 – 64 + 1 = 29 (% 13 = 3) 10111011001010101001 = 29 * 2 - 0 + 1 = 59 (% 13 = 7) 10111011001010101001 = 59 * 2 - 64 + 0 = 54 (% 13 = 2) 10111011001010101001 = 54 * 2 - 64 + 0 = 44 (% 13 = 5)

  15. Boyer – Moore algorithm • Three ideas: • Right-to-left comparison • Bad character rule • Good suffix rule

  16. Boyer – Moore algorithm • Right to left comparison x y Skip some chars without missing any occurrence. y But how?

  17. Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ What would you do now?

  18. Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ P: tpabxab

  19. Bad character rule 0 1 123456789012345678 T:xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab

  20. Basic bad character rule tpabxab Pre-processing: O(n)

  21. Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^^^ When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P Shift 3 – 1 = 2 i = 3 P: tpabxab

  22. Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab * When T(k) is not in P, shift left end of P to align with T(k+1) i = 7 Shift 7 – 0 = 7 P: tpabxab

  23. Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^ When rightmost T(k) in P is right to i, shift pattern P one pos i = 5 5 – 6 < 0. so shift 1 P: tpabxab

  24. Extended bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^ Find T(k) in P that is immediately left to i, shift P to align T(k) with that position i = 5 5 – 3 = 2. so shift 2 P: tpabxab Preprocessing still O(n)

  25. Extended bad character rule • Best possible: m / n comparisons • Works better for large alphabet size • In some cases the extended bad character rule is sufficiently good • Worst-case: O(mn)

  26. 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab According to extended bad character rule

  27. (weak) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

  28. (Weak) good suffix rule x t T In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’ y P t’ t y P t t’

  29. (Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^

  30. (Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

  31. (Strong) good suffix rule • Pre-processing can be done in linear time • If P in T, may take O(mn) • If P not in T, worst-case O(m+n) x t T In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’, and the char left to t ≠ the char left to t’ z y P t’ t z y P t t’

  32. Lessons From B-M • Sub-linear time is possible • But we still need to read T from disk! • Bad cases require periodicity in P or T • matching random P with T is easy! • Large alphabets mean large shifts • Small alphabets make complicated shift data-structures possible • B-M better for “english” and amino-acids than for DNA.

  33. Algorithm KMP • Not the fastest • Best known • Good for multiple pattern matching and real-time matching • Idea • Left-to-right comparison • Shift P more chars when possible

  34. Basic idea x t T z y P t t’ z y P t t’ In pre-processing: for any position i in P, find the longest proper suffix of P, t = P[j+1..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’, i.e., P[i+1] != P[i-j+1]. Sp’(i) = length(t)

  35. Example P: aataac a a t a a c Sp’(i) 0 1 0 0 2 0 aaat aataac

  36. Failure link P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 a a t a a c Sp’(i) 0 1 0 0 2 0 aaat aataac

  37. FSA If the next char in T is t, we go to state 3 P: aataac t a a t a a c a 0 1 2 3 4 5 6 Sp’(i) 0 1 0 0 2 0 aaat aataac All other input goes to state 0

  38. Another example P: abababc a b a b a b c Sp’(i) 0 0 0 0 0 4 0 abab abababab ababaababc

  39. Failure link P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 a b a b a b c Sp’(i) 0 0 0 0 0 4 0 ababaababc

  40. FSA P: abababc If the next char in T is a, go to state 5 a a a b b a b c 0 1 2 3 4 5 6 7 Sp’(i) 0 0 0 0 0 4 0 ababaababc All other input goes to state 0

  41. Difference between Failure Link and FSA? • Failure link • Preprocessing time and space are O(n), regardless of alphabet size • Comparison time is at most 2m • FSA • Preprocessing time and space are O(n ||) • May be a problem for very large alphabet size • Comparison time is always m.

  42. Failure link P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 a a t a a c Sp’(i) 0 1 0 0 2 0 aaat aataac

  43. Example a a t a a c T: aacaataaaaataaccttacta aataac ^^* aataac .* Each char in T may be compared multiple times. Up to n. Time complexity: O(2m). Comparison phase and shift phase. Comparison is bounded by m, shift is also bounded by m. aataac ^^^^^* aataac ..* aataac .^^^^^

  44. Example t a a t a a c a 0 1 2 3 4 5 6 T: aacaataaaaataaccttacta 1201234501234560001001 Each char in T will be examined exactly once. Therefore, exact m comparisons are needed. Takes longer to do pre-processing.

  45. How to do pre-processing?

More Related