1 / 21

A Fast String Searching Algorithm

A Fast String Searching Algorithm. Robert S. Boyer, and J Strother Moore. Communication of the ACM, vol.20 no.10 , Oct. 1977. Outline:. Introduction The Knuth-Morris-Pratt algorithm The Boyer-Moore algorithm Bad Character heuristic Good Suffix heuristic Matching Algorithm

ocean
Download Presentation

A Fast String Searching Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Fast String Searching Algorithm Robert S. Boyer, and J Strother Moore. Communication of the ACM, vol.20 no.10 , Oct. 1977

  2. Outline: • Introduction • The Knuth-Morris-Pratt algorithm • The Boyer-Moore algorithm • Bad Character heuristic • Good Suffix heuristic • Matching Algorithm • Experimental Result • Conclusion

  3. string s pattern Introduction • String Matching: • Searching a pattern from a text or a longer string. • If the pattern exist in the string, return the position of the first character in the substring which match the pattern.

  4. Introduction (cont.) • Some definition: • m : the length of the pattern. • n : the length of the string( or text ). • s (shift): the distance between first character of matched substring and start character. • w x : a string w is a prefix of a string x. • w x : a string w is a suffix of a string x.

  5. Introduction (cont.) • The naive string-matching algorithm: • Time Complexity: • Θ((n-m+1)m) in the worse case. • Θ(n2) if m = • for s ← 0 to n-m • do if pattern[1..m] = string[s+1..s+m] • printf “Pattern occurs with shift” s

  6. B A C B A B A B A A B C B A B string s A B A B A C A pattern q B A C B A B A B A A B C B A B string s’ A B A B A C A pattern k Knuth-Morris-Pratt Algorithm s + q = s’ + k

  7. Knuth-Morris-Pratt Algorithm(cont.) • Prefix Function: • f(j) = largest i < j such that P[1..i] = P[j-i+1..j] 0 if I dose not exist. A B A B A Pq Pk Pq Pk

  8. Knuth-Morris-Pratt Algorithm(cont.) • Prefix Function Algorithm: f[1] ←0 k←0 for q←2 to m do while k>0 and P[k+1] ≠P[q] do k ← f[k] if P[k+1] = P[q] then k ← k+1 f[q] = k return f[1..m]

  9. Example: Time Complexity: Prefix function : O(m) by amortize analysis Matching function: O(n) Total : O(m+n)  Linear Complexity 1 2 3 4 5 6 7 8 9 10 11 k A B A B A C A B A B A P[k] 0 0 f[k] 1 2 3 4 5 Knuth-Morris-Pratt Algorithm(cont.) 1 2 3 0

  10. The Boyer-Moore Algorithm • Symbols used: • Σ : the set of alphabets • patlen : the length of pattern • m : the last m characters of pattern matched • char : the mismatched character char ……… ……… string pattern m

  11. Characteristic • Match pattern from rightmost character of the pattern to the left most character of the pattern. • Pattern is relatively long, and Σ is reasonably large, this algorithm is likely to be the most efficient string-matching algorithm.

  12. A B C Bad Character heuristic • Observation 1: • if the char doesn’t occur in pat: Pattern Shift : j character String pointer shift: patlen character • Example:

  13. Bad Character heuristic (cont.) • Observation 2: • The char occur in the pattern • The rightmost char in pattern in position δ1[char] and the pointer to the pattern is in j • If j < δ1[char] we shift the pattern right by 1 • If j > δ1[char] we shift the pattern right by j- δ1[char] • δ1[] is an array which size is the size of Σ

  14. Bad Character heuristic (cont.) • Example: j = 3 and δ1[B] = 2 pattern shift 1 string pointer shift 1 (m+ pattern shift)

  15. Good Suffix heuristic • 2 sequence [c1.. cn] and [d1.. dn] is unify if for j from 1 to patlen, either ci =di orci = $ordi = $, which $ be a character doesn’t occur in pat. • the position of rightmost plausible reoccurrence, rpr(j) = k , such that [pat(j+1)..pat(patlen)] and [pat(k)..pat(k+patlen – j - 1)] are unify, and either k≤1 or pat(k-1) ≠pat(j)

  16. Good Suffix heuristic (cont.) • Example: • Pattern shift : j+1 – rar(j) • String pointer shift: m + j + 1 –rar(j) = strlen – j + j + 1 – rar(j) = δ2[j] j pat rpr(j)

  17. Good Suffix heuristic (cont.) • Algorithm:

  18. Boyer-Moore Matching Algorithm i = patlen; if n < patlen return false j = patlen While j > 0 do { if string(i) = pat(j) j = j-1 i = i-1 else i = i + max(δ1(string(i)) , δ2 (j)) if i > n then return false }

  19. Boyer-Moore Matching Algorithm • Time Complexity: • Bad Character heuristic :O(patlen) • Good Suffix heuristic : O(patlen) • Matching : O(n) • Total O(n+patlen)

  20. Experimental Result

  21. Conclusion • Boyer-Moore algorithm have sublinear time complexity :O(n+m) • Boyer-Moore is most efficient string matching algorithm when pattern is long and character is reasonably large.

More Related