1 / 34

Outline

Outline. String Matching Introduction Naïve Algorithm Rabin-Karp Algorithm Knuth-Morris-Pratt (KMP) Algorithm. Introduction. What is string matching ? Finding all occurrences of a pattern in a given text (or body of text ) Many applications

mhyland
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outline String Matching • Introduction • Naïve Algorithm • Rabin-Karp Algorithm • Knuth-Morris-Pratt (KMP) Algorithm

  2. Introduction • What is string matching? • Finding all occurrences of a pattern in a given text (or body of text) • Many applications • While using editor/word processor/browser • Login name & password checking • Virus detection • Header analysis in data communications • DNA sequence analysis, Web search engines (e.g. Google), image analysis

  3. String-Matching Problem • The text is in an array T [1..n] of length n • The pattern is in an array P [1..m] of length m • Elements of T and P are characters from a finite alphabet • E.g.,  = {0,1} or  = {a, b, …, z} • Usually T and P are called strings of characters

  4. String-Matching Problem …contd • We say that pattern Poccurs with shift s in text T if: • 0 ≤ s ≤ n-m and • T [(s+1)..(s+m)] = P [1..m] • If P occurs with shift s in T, then s is a valid shift, otherwise s is an invalid shift • String-matching problem: finding all valid shifts for a given T and P

  5. Example 1 1 2 3 4 5 6 7 8 9 10 11 12 13 text T s = 3 pattern P 1 2 3 4 shift s = 3is a valid shift (n=13, m=4 and 0 ≤ s ≤ n-m holds)

  6. a a b b a a a a Example 2 1 2 3 4 pattern P 1 2 3 4 5 6 7 8 9 10 11 12 13 text T s = 3 s = 9

  7. Naïve String-Matching Algorithm Input: Text strings T [1..n] and P[1..m] Result: All valid shifts displayed NAÏVE-STRING-MATCHER (T, P) n← length[T] m ← length[P] fors ← 0 ton-m ifP[1..m] = T [(s+1)..(s+m)] print “pattern occurs with shift” s

  8. Naïve Algorithm • The Naïve algorithm consists in checking, at all the positions in the text between 0 to n-m, whether an occurrence of the pattern starts there or not. • After each attempt, it shifts the pattern by exactly one position to the right. Example (from left to right): a b c a b c a a b c a (shift = 0) a b c a (shift = 1) a b c a (shift = 2) a b c a (shift = 3)

  9. a a a a a a a a a b b b Analysis: Worst-case Example 1 2 3 4 pattern P 1 2 3 4 5 6 7 8 9 10 11 12 13 text T

  10. Worst-case Analysis • There are m comparisons for each shift in the worst case • There are n-m+1 shifts • So, the worst-case running time is Θ((n-m+1)m) • In the example on previous slide, we have (13-4+1)4 comparisons in total • Naïve method is inefficient because information from a shift is not used again

  11. Naïve Algorithm Example (from right to left): a b c a b c a a b c a (shift =3) a b c a (shift = 2) a b c a (shift = 1) a b c a (shift = 0) Pattern occur with shift 0 and 3

  12. Rabin-Karp Algorithm • Has a worst-case running time of O((n-m+1)m) but average-case is O(n+m) • Also works well in practice • Based on number-theoretic notion of modularequivalence • We assume that  = {0,1, 2, …, 9}, i.e., each character is a decimal digit • In general, use radix-d where d = ||

  13. Rabin-Karp Approach • We can view a string of k characters (digits) as a length-k decimal number • E.g., the string “31425” corresponds to the decimal number 31,425 • Given a pattern P [1..m], let p denote the corresponding decimal value • Given a text T [1..n], let ts denote the decimal value of the length-m substring T [(s+1)..(s+m)] for s=0,1,…,(n-m)

  14. The Rabin-Karp algorithm

  15. The Rabin-Karp algorithm

  16. Rabin-Karp Approach …contd • ts = p iff T [(s+1)..(s+m)] = P [1..m] • s is a valid shift iff ts = p • p can be computed in O(m) time • p = P[m] + 10 (P[m-1] + 10 (P[m-2]+…)) • t0 can similarly be computed in O(m) time • Other t1, t2,…, tn-m can be computed in O(n-m) time since ts+1 can be computed from ts in constant time

  17. Rabin-Karp Approach …contd • ts+1 = 10(ts - 10m-1·T [s+1]) + T [s+m+1] • E.g., if T={…,3,1,4,1,5,2,…}, m=5 and ts= 31,415, then ts+1 = 10(31415 – 10000·3) + 2 • =14152 • Thus we can compute p in  (m) and can compute t0, t1, t2,…, tn-m in  (n-m+1) time • And we can find al occurrences of the pattern P[1…m] in text T[1…n] with  (m) preprocessing time and  (n-m+1) matching time. • But…a problem: this is assuming p and ts are small numbers • They may be too large to work with easily

  18. Rabin-Karp Approach …contd • Solution: we can use modular arithmetic with a suitable modulus, q • E.g., • ts+1 (10(ts – T[s+1]h) + T [s+m+1]) (mod q) • Where h =10 m-1 (mod q) • q is chosen as a small prime number ; e.g., 13 for radix 10 • Generally, if the radix is d, then dq should fit within one computer word

  19. How values modulo 13 are computed 3 1 4 1 5 2 old high-order digit new low-order digit 7 8 14152 ((31415 – 3·10000) ·10 + 2 )(mod 13)  ((7 – 3 · 3) · 10 + 2 )(mod 13)  8 (mod 13)

  20. Problem of Spurious Hits • tsp (mod q) does not imply that ts=p • Modular equivalence does not necessarily mean that two integers are equal • A case in which tsp (mod q) when ts ≠ p is called a spurious hit • On the other hand, if two integers are not modular equivalent, then they cannot be equal

  21. Example pattern 3 1 4 1 5 mod 13 text 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2 3 1 4 1 5 2 6 7 3 9 9 2 1 mod 13 1 7 8 4 5 10 11 7 9 11 valid match spurious hit

  22. Rabin-Karp Algorithm • Basic structure like the naïve algorithm, but uses modular arithmetic as described • For each hit, i.e., for each s where tsp (mod q), verify character by character whether s is a valid shift or a spurious hit • In the worst case, every shift is verified • Running time can be shown as O((n-m+1)m) • Average-case running time is O(n+m)

  23. 3. The KMP Algorithm • The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-to-right order (like the brute force algorithm). • But it shifts the pattern more intelligently than the brute force algorithm. continued

  24. If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons? • Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1]

  25. Example T: P: j = 5 jnew = 2

  26. Why j == 5 • Find largest prefix (start) of: "a b a a b" ( P[0..j-1] )which is suffix (end) of: "b a a b" ( p[1 .. j-1] ) • Answer: "a b" • Set j = 2 // the new j value

  27. KMP Failure Function • KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. • j = mismatch position in P[] • k = position before the mismatch (k = j-1). • The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].

  28. j 0 1 2 3 4 F(j) 0 0 1 1 2 Failure Function Example (k == j-1) • P: "abaaba" j: 012345 • In code, F() is represented by an array, like the table. F(k) is the size of the largest prefix.

  29. Why is F(4) == 2? P: "abaaba" • F(4) means • find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" = 2

  30. Using the Failure Function • Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm. • if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j

  31. Example T: P: k 0 1 2 3 4 F(k) 0 0 1 0 1

  32. Why is F(4) == 1? P: "abacab" • F(4) means • find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" = 1

  33. KMP Advantages • KMP runs in optimal time: O(m+n) • very fast • The algorithm never needs to move backwards in the input text, T • this makes the algorithm good for processing very large files that are read in from external devices or through a network stream

  34. KMP Disadvantages • KMP doesn’t work so well as the size of the alphabet increases • more chance of a mismatch (more possible mismatches) • mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later

More Related