1 / 52

String Matching with Mismatches

String Matching with Mismatches. Some slides are stolen from Moshe Lewenstein (Bar Ilan University). String Matching with Mismatches. Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987

xena
Download Presentation

String Matching with Mismatches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

  2. String Matching with Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000

  3. Approximate String Matching problem: Find all text locations where distance from pattern is sufficiently small. distance metric: HAMMING DISTANCE Let S = s1s2…sm R = r1r2…rm Ham(S,R) = The number of locations j where sjrj Example: S = ABCABC R = ABBAAC Ham(S,R)= 2

  4. Problem 1: Counting mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C…

  5. Counting mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2 Ham(P,T1)= 2

  6. Counting mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4 Ham(P,T2)= 4

  7. Counting mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6 Ham(P,T3)= 6

  8. Counting mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2 Ham(P,T4)= 2

  9. Counting mismatches Input: T = t1 . . . tn P = p1 … pm Output: For each i in T Ham(P,titi+1…ti+m-1) Example: P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

  10. Problem 2: k-mismatches Input: T = t1 . . . tn P = p1 … pm Output: Every i where Ham(P,titi+1…ti+m-1) ≤ k Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

  11. Problem 2: k-mismatches Input: T = t1 . . . tn P = p1 … pm Output: Every i where Ham(P,titi+1…ti+m-1) ≤ kh Example: k = 2 P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … 1, 0, 0, 1,

  12. Naïve Algorithm (for counting mismatches or k-mismatches problem) - Goto each location of text and compute hamming distance of P and Ti Running Time: O(nm) n = |T|, m = |P|

  13. The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

  14. The Kangaroo Method (for k-mismatches) • Create suffix tree (+ lca) for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  15. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  16. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  17. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  18. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  19. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  20. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Check P at each location i of T by kangrooing • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  21. The Kangaroo Method (for k-mismatches) • Create suffix tree for: s = P#T • Do up to k LCP queries for every text location • Example: • P = A B A B A A B A C A B • T = A B B A C A B A B A B C A B B C A B C A … • i

  22. The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)

  23. How do we do counting in less than O(nm) ?

  24. If we pad the pattern and the text with zeros this is like a convolution of two vectors of length m+n We can do this using FFT in O(nlog(n)) time Lets start with binary strings P = 0 1 0 1 1 0 1 1 T = 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 ... ...

  25. a b a c c a c b a c a b a c c And if the strings are not binary ? P = T = a b b b c c c a a a a b a c b ... ...

  26. a b a c c a c b a c a b a c c P = a-mask a b a c c a c b a c a b a c c T = a b b b c c c a a a a b a c b ... ...

  27. a b a c c a c b a c a b a c c 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 P = a-mask a b a c c a c b a c a b a c c Pa T = a b b b c c c a a a a b a c b ... ...

  28. a b a c c a c b a c a b a c c 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 P = Pa T = a b b b c c c a a a a b a c b ... ...

  29. a b a c c a c b a c a b a c c 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 P = Pa T = a b b b c c c a a a a b a c b ... ... not-a mask a b b b c c c a a a a b a c b ... ...

  30. a b a c c a c b a c a b a c c 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 P = Pa T = a b b b c c c a a a a b a c b ... ... not-a mask a b b b c c c a a a a b a c b ... ... Tnot a 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1

  31. a b a c c a c b a c a b a c c 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 P = Pa T = a b b b c c c a a a a b a c b ... ... Tnot a 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1

  32. 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 P = a b a c c a c b a c a b a c c Pa T = a b b b c c c a a a a b a c b ... ... Tnot a 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Multiply Pa and T(not a) to count mismatches using “a” Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 Tnot a ... 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 ...

  33. 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 P = a b a c c a c b a c a b a c c Pa T = a b b b c c c a a a a b a c b ... ... Tnot a 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Multiply Pa and Tnot a to count mismatches using a Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0 Tnot a ... 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 ...

  34. Boolean Convolutions (FFT) Method

  35. Boolean Convolutions (FFT) Method Running Time: One boolean convolution - O(n log m) time # of matches of all symbols - O(n| | log m) time

  36. How do we do counting in less than O(nm) ? Lets count matches rather than mismatches

  37. ... ... For each character you have a list of offsets where it occurs in the pattern, When you see the char in the text, you increment the appropriate counters. P = a b c d e b b d b g d e f h d c c a b g h h ... T = ... counter increment

  38. ... ... For each character you have a list of offsets where it occurs in the pattern, When you see the char in the text, you increment the appropriate counters. P = a b c d e b b d b g d e f h d c c a b g h h ... T = ... counter increment

  39. ... ... This is fast if all characters are “rare” P = a b c d e b b d b g d e f h d c c a b g h h ... T = ... counter increment

  40. Partition the characters into rare and frequent Rare: occurs ≤ c times in the pattern For rare characters run this scheme with the counters Takes O(nc) time

  41. Frequest chars You have at most m/c of them Do a convolution for each Total cost O(m/c n log(n)).

  42. Fix c Complexity:

  43. Back to the k-mismatch problem Want to beat the O(nk) kangaroo bound Frequent Symbol: a symbol that appears at least times in P.

  44. Few (≤√k) frequent symbols Do the counters scheme for non-frequent Convolve for each frequent O(n ) O(n log n)

  45. (≥√k) frequent symbols Intuition: There cannot be too many places where we match

  46. (≥√k) frequent symbols - Consider frequent symbols. - For each of them consider the first appearances. Do the counters scheme just for these symbols and occurrences

  47. a b a c c a c b a c a b a c c a b a c c a c b a c a b a c c a b b b c c c a a a a b a c b ... ... k = 4, = 4 P = a-mask a b a c c a c b a c a b a c c c-mask T = use a-mask

  48. a b a c c a c b a c a b a c c a b a c c a c b a c a b a c c Example of Masked Counting k = 4, = 4 P = a-mask a b a c c a c b a c a b a c c c-mask a b a c c a c b a c a b a c c T = d a b b b c c c a a a a b a c b ... ... 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 counter ... ...

  49. a b a c c a c b a c a b a c c a b a c c a c b a c a b a c c Example of Masked Counting k = 4, = 4 P = a-mask a b a c c a c b a c a b a c c c-mask a b a c c a c b a c a b a c c T = d a b b b c c c a a a a b a c b ... ... 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 counter ... ...

  50. Counting Stage: Run through text and count occurrences of all marks. Time: O(n ). Important Observations: 1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors. Why? The total # of elements in all masks is 2 = 2k. For location i of T, if counteri < k then no match at location i.

More Related