1 / 69

Survey: String Matching with k Mismatches

Survey: String Matching with k Mismatches. Moshe Lewenstein Bar Ilan University. String Matching with k Mismatches. Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987

Download Presentation

Survey: String Matching with k Mismatches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University

  2. String Matching with k Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000

  3. Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

  4. Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3

  5. Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 37

  6. Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 37 11

  7. Exact String Matching Input: T = t1 . . . tn P = p1 … pm Output: All locations i of T where P appears Example: P = A B C A A B T = A B A B C A A B C A A B C A A B A A… Answer: {3,7,11,..}

  8. Exact String Matching • Problem: Matching not exact in applications of: • Computational Biology • Musicology • Text Editing • Meteorology • etc. • Need other definitions of string matching!

  9. Approximate String Matching Idea: Find all text locations where distance from pattern is sufficiently small. distance metric: HAMMING DISTANCE Let S = s1s2…sm R = r1r2…rm Ham(S,R) = The number of locations j where sjrj Example: S = ABCABC R = ABBAAC Ham(S,R)= 2

  10. String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C…

  11. String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2 Ham(P,T1)= 2

  12. String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4 Ham(P,T2)= 4

  13. String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6 Ham(P,T3)= 6

  14. String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2 Ham(P,T4)= 2

  15. String Matching with Mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

  16. String Matching with k Mismatches Input: T = t1 . . . tn, P = p1 … pm Output: Every i in T s.t. Ham(P,titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

  17. String Matching with k Mismatches Input: T = t1 . . . tn, P = p1 … pm Output: Every i in T s.t. Ham(P,titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

  18. String Matching with k Mismatches Input: T = t1 . . . tn, P = p1 … pm Output: Every i in T s.t. Ham(P,titi+1…ti+m-1) k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…

  19. Naïve Algorithm (for counting mismatches or k-mismatches problem) - Goto each location of text and compute hamming distance of P and Ti Running Time: O(nm) n = |T|, m = |P|

  20. The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

  21. Trie • A tree representing a set of strings. c a { aeef ad bbfe bbfg c } b e b d e f f e g

  22. Trie (Cont) • Assume no string is a prefix of another c Each string corresponds to a leaf. a b e b d e f f e g

  23. Compressed Trie • Compress unary nodes, label edges by strings c  c a a b e b bbf d d eef e f f e g e g

  24. Suffix tree Suffix tree of string s: a compressed trie of all suffixes of s Prefix-free: add a special character, say $, at the end of s

  25. Suffix tree (Example) Let s = abab,a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

  26. $ a 5 b $ a a $ b 4 b $ $ 3 2 1 Suffix Tree properties b • Succint in space - O(n). • - Can be built in O(n) time. McCreight, Weiner, • Ukkonen, Farach-Colton

  27. Exact string matching $ s=abab$ a b 5 b $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

  28. Exact string matching $ s=abab$ 1 3 a b 5 b $ a a $ b 4 b $ $ 3 2 1 Leaves correspond to locations of appearance!

  29. Exact string matching $ s=abab$ 1 3 a b 5 b $ a a $ b 4 b $ $ 3 2 1 Prepare Tree: O(n) time Find matches: O(m + occ) time occ = # of matches

  30. Lowest common ancestors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

  31. Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ a s = abbaab$ b 7 $ a a b b b $ a 6 a b a b $ b 4 a $ $ a 3 b 5 2 $ 1

  32. Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ a s = abbaab$ aab$ b 7 $ a a b b b $ a 6 a b a b $ b 4 a $ $ a 3 b 5 2 $ 1

  33. Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ a s = abbaab$ aab$ abbaab$ b 7 $ a a b b b $ a 6 a b a b $ b 4 a $ $ a 3 b 5 2 $ 1

  34. Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ a s = abbaab$ aab$ abbaab$ b 7 $ a a b b b $ a 6 a b a b $ b 4 a $ $ a 3 b 5 2 $ 1

  35. $ b 7 $ a a b b b $ a 6 a b a b $ b 4 a $ $ a 3 b 5 2 $ 1 LCA/LCP properties a Preprocesssing time : O(n) Query Time: O(1) Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

  36. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  37. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  38. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  39. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  40. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  41. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  42. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  43. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  44. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  45. The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)

  46. a b a c c a c b a c a b a c c Boolean Convolutions (FFT) Method P = T = a b b b c c c a a a a b a c b ... ...

  47. a b a c c a c b a c a b a c c Boolean Convolutions (FFT) Method P = a-mask a b a c c a c b a c a b a c c T = a b b b c c c a a a a b a c b ... ...

  48. a b a c c a c b a c a b a c c 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 Boolean Convolutions (FFT) Method P = a-mask a b a c c a c b a c a b a c c Pa T = a b b b c c c a a a a b a c b ... ...

  49. a b a c c a c b a c a b a c c 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 Boolean Convolutions (FFT) Method P = Pa T = a b b b c c c a a a a b a c b ... ...

  50. a b a c c a c b a c a b a c c 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 Boolean Convolutions (FFT) Method P = Pa T = a b b b c c c a a a a b a c b ... ... not-a mask a b b b c c c a a a a b a c b ... ...

More Related