1 / 48

Boyer Moore Algorithm

Boyer Moore Algorithm. Idan Szpektor. Boyer and Moore. What It’s About. A String Matching Algorithm Preprocess a Pattern P (|P| = n) For a text T (| T| = m), find all of the occurrences of P in T Time complexity: O(n + m), but usually sub-linear. Right to Left (like in Hebrew).

Jimmy
Download Presentation

Boyer Moore Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boyer Moore Algorithm Idan Szpektor

  2. Boyer and Moore

  3. What It’s About • A String Matching Algorithm • Preprocess a Pattern P (|P| = n) • For a text T(| T| = m), find all of the occurrences of P in T • Time complexity: O(n + m), but usually sub-linear

  4. Right to Left (like in Hebrew) • Matching the pattern from right to left • For a pattern abc: ↓ T:bbacdcbaabcddcdaddaaabcbcb P:abc • Worst case is still O(n m)

  5. The Bad Character Rule (BCR) • On a mismatch between the pattern and the text, we can shift the pattern by more than one place. Sublinearity! ddbbacdcbaabcddcdaddaaabcbcb acabc ↑

  6. BCR Preprocessing • A table, for each position in the pattern and a character, the size of the shift. O(n |Σ|) space. O(1) access time. a b a c b: 1 2 3 4 5 • A list of positions for each character. O(n + |Σ|) space. O(n) access time, But in total O(m).

  7. BCR - Summary • On a mismatch, shift the pattern to the right until the first occurrence of the mismatched char in P. • Still O(n m) worst case running time: T: aaaaaaaaaaaaaaaaaaaaaaaaa P: abaaaa

  8. The Good Suffix Rule (GSR) • We want to use the knowledge of the matched characters in the pattern’s suffix. • If we matched S characters in T, what is (if exists) the smallest shift in P that will align a sub-string of P of the same S characters ?

  9. GSR (Cont…) • Example 1 – how much to move: ↓ T: bbacdcbaabcddcdaddaaabcbcb P: cabbabdbab cabbabdbab

  10. GSR (Cont…) • Example 2 – what if there is no alignment: ↓ T: bbacdcbaabcbbabdbabcaabcbcb P: bcbbabdbabc bcbbabdbabc

  11. GSR - Detailed • We mark the matched sub-string in T with t and the mismatched char with x • In case of a mismatch: shift right until the first occurrence of t in P such that the next char y in P holds y≠x • Otherwise, shift right to the largest prefix of P that aligns with a suffix of t.

  12. Boyer Moore Algorithm • Preprocess(P) • k := n while (k ≤ m) do • Match P and T from right to left starting at k • If a mismatch occurs: shift P right (advance k) by max(good suffix rule, bad char rule). • else, print the occurrence and shift P right (advance k) by the good suffix rule.

  13. Algorithm Correctness • The bad character rule shift never misses a match • The good suffix rule shift never misses a match

  14. Preprocessing the GSR – L(i) • L(i) – The biggest index j, such that j < n and prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n] 1 2 3 4 5 6 7 8 9 10 11 12 13 P: b b a b b a a b b c a b b L: 0 0 0 0 0 0 0 0 0 5 9 0 12

  15. Preprocessing the GSR – l(i) • l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P P: b b a b b a a b b c a b b l: 2 2 2 2 2 2 2 2 2 2 2 1

  16. Using L(i) and l(i) in GSR • If mismatch occurs at position n, shift P by 1 • If a mismatch occurs at position i-1 in P: • If L(i) > 0, shift P by n – L(i) • else shift P by n – l(i) • If P was found, shift P by n – l(2)

  17. Building L(i) and l(i) – the Z • For a string s, Z(i) is the length of the longest sub-string of s starting at i that matches a prefix of s. s: bb a c d c b b a a b b c d d Z: 1 0 0 0 0 31 0 0 21 0 0 0 • Naively, we can build Z in O(n^2)

  18. From Z to N • N(i) is the longest suffix of P[1..i] that is also a suffix of P. • N(i) is Z(i), built over P reversed. s: d d c b b a a b b c d c a bb N: 0 0 0 12 0 0 13 0 0 0 0 1

  19. Building L(i) in O(n) • L(i) – The biggest index j < n, such that prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n] • L(i) – The biggest index j < n such that: N(j) == | P[i..n] | == n – i + 1 • for i := 1 to n, L(i) := 0 • for j := 1 to n-1 • i := n – N(j) + 1 • L(i) := j

  20. Building l(i) in O(n) • l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P • l(i) – The biggest j <= | P[i..n] | == n – i + 1 such that N(j) == j • k := 0 • for j := 1 to n-1 • If(N(j) == j), k := j • l(n – j + 1) := k

  21. Building Z in O(n) • For calculating Z(i), we want to use the previously calculated Z(1)…Z(i-1) • For each I we remember the right most Z(j):j, such that j < i and j + Z(j) >= k + Z(k), for all k < i

  22. Building Z in O(n) (Cont…) ↑ ↑ ↑ ↑ • S i’ j i • If i < j + Z(j), s[i … j + Z(j) - 1] appeared previously, starting at i’ = i – j + 1. • Z(i’) < Z(j) – (i - j) ?

  23. Building Z in O(n) (Cont…) • For Z(2) calculate explicitly • j := 2, i := 3 • While i <= |s|: • if i >= j + Z(j), calculate Z(i) explicitly • else • Z(i) := Z(i’) • If Z(i’) >= Z(j) – (i - j), calculate Z(i) tail explicitly • If j + Z(j) < i + Z(i), j := i

  24. Building Z in O(n) - Analysis • The algorithm builds Z correctly • The algorithm executes in O(n) • A new character is matched only once • All other operations are in O(1)

  25. Boyer Moore Worst Case Analysis • Assume P consists of n copies of a single char and T consists of m copies of the same char: T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaaaa • Boyer Moore Algorithm runs in Θ(m n) when finding all the matches

  26. The Galil Rule • In a specific matching phase, We mark with k the position in T of the right end of P. We mark with s the position of last matched char in this phase. s k k’ T:bbacdcbaabcddcdaddaaabcbcb P:abaab abaab

  27. The Galil Rule (Cont…) • All the chars in position s < j ≤ k are known to be matching. The algorithm doesn’t need to check them. • An extended Boyer Moore algorithm with the Galil rule runs in O(m + n) worst case (even without the bad-character rule).

  28. Don’t Sleep Yet…

  29. O(n + m) proof - Outline • Preprocess in O(n) – already proved • Properties of strings • Proof of search in O(m) if P is not in T, using only the good suffix rule. • Proof of search in O(m) even if P is in T, adding the Galil rule.

  30. Properties of Strings • If for two strings δ, γ: δγ = γδ then there is a string β such that δ = βi and γ = βj, i, j > 0 - Proof by induction • Definition: A string s is semiperiodic with period β if s consists of a non-empty suffix of β (possibly the entire β) followed by one or more complete copies of β. β’ β β β

  31. Properties of Strings (Cont…) • A string is prefix semiperiodic if it contains one or more complete copies of β followed by a non-empty prefix of β. • A string isprefix semiperiodic iff it is semiperiodic with the same length period

  32. Lemma 1 • Suppose P occurs in T starting at position p and also at position q, q > p. If q – p ≤  n/2  then P is semiperiodic with periodα = P[n-(q-p)+1…n] p α α’ α α q α’ α α α

  33. Proof - when P is Not Found in T • We have R rounds during the search. • After each round the good suffix rule decides on a right shift of si chars. • Σsi ≤ m • We shall use Σsi as an upper bound.

  34. Proof (Cont…) • For each round we count the matched chars by: • fi – the number of chars matched for the first time • gi –the number of chars already matched in previous rounds. • Σfi = m • We want to prove that gi ≤ 3si ( Σgi≤ 3m).

  35. Proof (Cont…) • Each round don’t find P  it matched a substring ti and one bad char xi in T (xiti T) T: bbacdcbaabcbbabdbabcaabcbcb P: bdbabc • |ti|+1 ≤ 3si gi ≤ 3si (because gi + fi = |ti|+1) • For the rest of the proof we assume that for the specific round i: |ti| + 1 > 3si

  36. Lemma 2 (|ti| + 1 > 3si) • In round i we look at the matched suffix of P, marked P*. P*= yi ti, yi≠xi. • Both P* and ti are semiperiodic with period α of length si and hence with minimal length period β, α = βk. • Proof: by Lemma 1.

  37. Lemma 3 (|ti| + 1 > 3si) • Suppose P overlapped ti during round i. We shall examine in what ways could P overlap ti in previous rounds. • In any round h < i, the right end of P could not have been aligned with the right end of any full copy of β in ti. - proof: • Both round h and i fail at char xi • two cases of possible shift after round h are invalid

  38. Lemma 4 (|ti| + 1 > 3si) • In round h < i, P can correctly match at most |β|-1 chars in ti.  • By Lemma 3, P is not aligned with a right end of ti in phase h. • Thus if it matched |β| chars or more there is a suffix γ of β followed by a prefix δ of β such that δγ = γδ. • By the string properties there is a substring μ such that β = μk, k>1. • This contradicts the minimal period size property of β.

  39. Lemma 5 (|ti| + 1 > 3si) • If in round h < i the right end of P is aligned with a char in ti, it can only be aligned with one of the following: • One of the left-most |β|-1 chars of ti • One of the right-most |β| chars of ti -proof: • If not, By Lemma 3,4, max |β|-1 chars are matched and only from the middle of a β copy, while there are at least |β| • A shift cannot pass the right end of that β copy

  40. Proof (Cont…) • If |ti| + 1 > 3si then gi ≤ 3si  • Using Lemma 5, in previous rounds we could match only the bad char xi, the last |β|-1 chars in ti or start from the first |β| right chars in ti. • In the last case, using Lemma 4, we can only match up to |β|-1 chars • in total we could previously match:gi = 1 + |β|-1 + (|β| + |β|-1) ≤ 3|β| ≤ 3si

  41. Proof - Final • Number of matches = ∑(fi + gi) =∑fi + ∑gi ≤ m + ∑3si ≤ m + 3m = 4m

  42. Proof - when P is Found in T • Split the rounds to two groups: • “match” rounds –an occurrence of P in T was found. • “mismatch” rounds –P was not found in T. • we have proved O(m) for “mismatch” rounds.

  43. Proof (Cont…) • After P was found in T, P will be shifted by a constant length s. (s = n – l(2)). • |n| + 1 ≤ 3s ∑ matches in round i ≤ ∑3s≤ m • For the rest of the proof we assume that:|n| + 1 > 3s

  44. Proof (|n| + 1 > 3s) • By Lemma 1, P is semiperiodic with minimal length period β, |β| = s. • If round i+1 is also a “match” round then, by the Galil rule, only the new |β| chars are compared. • A contiguous series of “match” rounds, i…i+k is called a “run”.

  45. Proof (|n| + 1 > 3s) • ∑ The length of a “run”, not including chars that where already matched in previous “runs” ≤ m • How many chars in a “run” where already matched in previous “runs”?

  46. Lemma (|n| + 1 > 3s) • Suppose k-1 was a “match” round and k is a “mismatch” round that ends the “run”. • If k’ > k is the first “match” round then it overlaps at most |β|-1 chars with the previous “run” (ended by round k-1).  • The left end of P at round k’ cannot be aligned with the left end of a full copy of |β| at round k-1. • As a result, P cannot overlap |β| chars or more with round k-1.

  47. Proof (|n| + 1 > 3s) • By the Lemma and because the shift after every “match” round is of |β|, only the first round of a “run” can overlap, and only with the last previous “run”.  • ∑ The length of the chars that where already matched in previous “runs” ≤ m

  48. Proof (|n| + 1 > 3s) - Final • ∑ The length of a “run” = ∑ The length of a “run”, not including chars that where already matched in previous “runs” + ∑ The length of the chars that where already matched in previous “runs” ≤ m + m

More Related