Boyer moore algorithm
Download
1 / 48

Boyer Moore Algorithm - PowerPoint PPT Presentation


  • 309 Views
  • Uploaded on

Boyer Moore Algorithm. Idan Szpektor. Boyer and Moore. What It’s About. A String Matching Algorithm Preprocess a Pattern P (|P| = n) For a text T (| T| = m), find all of the occurrences of P in T Time complexity: O(n + m), but usually sub-linear. Right to Left (like in Hebrew).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Boyer Moore Algorithm' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Boyer moore algorithm

Boyer Moore Algorithm

Idan Szpektor



What it s about
What It’s About

  • A String Matching Algorithm

  • Preprocess a Pattern P (|P| = n)

  • For a text T(| T| = m), find all of the occurrences of P in T

  • Time complexity: O(n + m), but usually sub-linear


Right to left like in hebrew
Right to Left (like in Hebrew)

  • Matching the pattern from right to left

  • For a pattern abc:

    T:bbacdcbaabcddcdaddaaabcbcb

    P:abc

  • Worst case is still O(n m)


The bad character rule bcr
The Bad Character Rule (BCR)

  • On a mismatch between the pattern and the text, we can shift the pattern by more than one place.

    Sublinearity!

    ddbbacdcbaabcddcdaddaaabcbcb

    acabc


Bcr preprocessing
BCR Preprocessing

  • A table, for each position in the pattern and a character, the size of the shift. O(n |Σ|) space. O(1) access time.

    a b a c b:

    1 2 3 4 5

  • A list of positions for each character. O(n + |Σ|) space. O(n) access time, But in total O(m).


Bcr summary
BCR - Summary

  • On a mismatch, shift the pattern to the right until the first occurrence of the mismatched char in P.

  • Still O(n m) worst case running time:

    T: aaaaaaaaaaaaaaaaaaaaaaaaa

    P: abaaaa


The good suffix rule gsr
The Good Suffix Rule (GSR)

  • We want to use the knowledge of the matched characters in the pattern’s suffix.

  • If we matched S characters in T, what is (if exists) the smallest shift in P that will align a sub-string of P of the same S characters ?


Gsr cont
GSR (Cont…)

  • Example 1 – how much to move:

    T: bbacdcbaabcddcdaddaaabcbcb

    P: cabbabdbab

    cabbabdbab


Gsr cont1
GSR (Cont…)

  • Example 2 – what if there is no alignment:

    T: bbacdcbaabcbbabdbabcaabcbcb

    P: bcbbabdbabc

    bcbbabdbabc


Gsr detailed
GSR - Detailed

  • We mark the matched sub-string in T with t and the mismatched char with x

  • In case of a mismatch: shift right until the first occurrence of t in P such that the next char y in P holds y≠x

  • Otherwise, shift right to the largest prefix of P that aligns with a suffix of t.


Boyer moore algorithm1
Boyer Moore Algorithm

  • Preprocess(P)

  • k := n

    while (k ≤ m) do

    • Match P and T from right to left starting at k

    • If a mismatch occurs: shift P right (advance k) by max(good suffix rule, bad char rule).

    • else, print the occurrence and shift P right (advance k) by the good suffix rule.


Algorithm correctness
Algorithm Correctness

  • The bad character rule shift never misses a match

  • The good suffix rule shift never misses a match


Preprocessing the gsr l i
Preprocessing the GSR – L(i)

  • L(i) – The biggest index j, such that j < n and prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n]

    1 2 3 4 5 6 7 8 9 10 11 12 13

    P: b b a b b a a b b c a b b

    L: 0 0 0 0 0 0 0 0 0 5 9 0 12


Preprocessing the gsr l i1
Preprocessing the GSR – l(i)

  • l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P

    P: b b a b b a a b b c a b b

    l: 2 2 2 2 2 2 2 2 2 2 2 1


Using l i and l i in gsr
Using L(i) and l(i) in GSR

  • If mismatch occurs at position n, shift P by 1

  • If a mismatch occurs at position i-1 in P:

    • If L(i) > 0, shift P by n – L(i)

    • else shift P by n – l(i)

  • If P was found, shift P by n – l(2)


Building l i and l i the z
Building L(i) and l(i) – the Z

  • For a string s, Z(i) is the length of the longest sub-string of s starting at i that matches a prefix of s.

    s: bb a c d c b b a a b b c d d

    Z: 1 0 0 0 0 31 0 0 21 0 0 0

  • Naively, we can build Z in O(n^2)


From z to n
From Z to N

  • N(i) is the longest suffix of P[1..i] that is also a suffix of P.

  • N(i) is Z(i), built over P reversed.

    s: d d c b b a a b b c d c a bb

    N: 0 0 0 12 0 0 13 0 0 0 0 1


Building l i in o n
Building L(i) in O(n)

  • L(i) – The biggest index j < n, such that prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n]

  • L(i) – The biggest index j < n such that: N(j) == | P[i..n] | == n – i + 1

  • for i := 1 to n, L(i) := 0

  • for j := 1 to n-1

    • i := n – N(j) + 1

    • L(i) := j


Building l i in o n1
Building l(i) in O(n)

  • l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P

  • l(i) – The biggest j <= | P[i..n] | == n – i + 1 such that N(j) == j

  • k := 0

  • for j := 1 to n-1

    • If(N(j) == j), k := j

    • l(n – j + 1) := k


Building z in o n
Building Z in O(n)

  • For calculating Z(i), we want to use the previously calculated Z(1)…Z(i-1)

  • For each I we remember the right most Z(j):j, such that j < i and j + Z(j) >= k + Z(k), for all k < i


Building z in o n cont
Building Z in O(n) (Cont…)

↑ ↑ ↑ ↑

  • S i’ j i

  • If i < j + Z(j), s[i … j + Z(j) - 1] appeared previously, starting at i’ = i – j + 1.

  • Z(i’) < Z(j) – (i - j) ?


Building z in o n cont1
Building Z in O(n) (Cont…)

  • For Z(2) calculate explicitly

  • j := 2, i := 3

  • While i <= |s|:

    • if i >= j + Z(j), calculate Z(i) explicitly

    • else

      • Z(i) := Z(i’)

      • If Z(i’) >= Z(j) – (i - j), calculate Z(i) tail explicitly

    • If j + Z(j) < i + Z(i), j := i


Building z in o n analysis
Building Z in O(n) - Analysis

  • The algorithm builds Z correctly

  • The algorithm executes in O(n)

    • A new character is matched only once

    • All other operations are in O(1)


Boyer moore worst case analysis
Boyer Moore Worst Case Analysis

  • Assume P consists of n copies of a single char and T consists of m copies of the same char:

    T: aaaaaaaaaaaaaaaaaaaaaaaaa

    P: aaaaaa

  • Boyer Moore Algorithm runs in Θ(m n) when finding all the matches


The galil rule
The Galil Rule

  • In a specific matching phase, We mark with k the position in T of the right end of P. We mark with s the position of last matched char in this phase.

    s k k’

    T:bbacdcbaabcddcdaddaaabcbcb

    P:abaab

    abaab


The galil rule cont
The Galil Rule (Cont…)

  • All the chars in position s < j ≤ k are known to be matching. The algorithm doesn’t need to check them.

  • An extended Boyer Moore algorithm with the Galil rule runs in O(m + n) worst case (even without the bad-character rule).



O n m proof outline
O(n + m) proof - Outline

  • Preprocess in O(n) – already proved

  • Properties of strings

  • Proof of search in O(m) if P is not in T, using only the good suffix rule.

  • Proof of search in O(m) even if P is in T, adding the Galil rule.


Properties of strings
Properties of Strings

  • If for two strings δ, γ: δγ = γδ then there is a string β such that δ = βi and γ = βj, i, j > 0

    - Proof by induction

  • Definition: A string s is semiperiodic with period β if s consists of a non-empty suffix of β (possibly the entire β) followed by one or more complete copies of β.

β’

β

β

β


Properties of strings cont
Properties of Strings (Cont…)

  • A string is prefix semiperiodic if it contains one or more complete copies of β followed by a non-empty prefix of β.

  • A string isprefix semiperiodic iff it is semiperiodic with the same length period


Lemma 1
Lemma 1

  • Suppose P occurs in T starting at position p and also at position q, q > p. If q – p ≤  n/2  then P is semiperiodic with periodα = P[n-(q-p)+1…n]

p

α

α’

α

α

q

α’

α

α

α


Proof when p is not found in t
Proof - when P is Not Found in T

  • We have R rounds during the search.

  • After each round the good suffix rule decides on a right shift of si chars.

  • Σsi ≤ m

  • We shall use Σsi as an upper bound.


Proof cont
Proof (Cont…)

  • For each round we count the matched chars by:

    • fi – the number of chars matched for the first time

    • gi –the number of chars already matched in previous rounds.

  • Σfi = m

  • We want to prove that gi ≤ 3si ( Σgi≤ 3m).


Proof cont1
Proof (Cont…)

  • Each round don’t find P  it matched a substring ti and one bad char xi in T (xiti T)

    T: bbacdcbaabcbbabdbabcaabcbcb

    P: bdbabc

  • |ti|+1 ≤ 3si gi ≤ 3si (because gi + fi = |ti|+1)

  • For the rest of the proof we assume that for the specific round i: |ti| + 1 > 3si


Lemma 2 t i 1 3s i
Lemma 2 (|ti| + 1 > 3si)

  • In round i we look at the matched suffix of P, marked P*. P*= yi ti, yi≠xi.

  • Both P* and ti are semiperiodic with period α of length si and hence with minimal length period β, α = βk.

  • Proof: by Lemma 1.


Lemma 3 t i 1 3s i
Lemma 3 (|ti| + 1 > 3si)

  • Suppose P overlapped ti during round i. We shall examine in what ways could P overlap ti in previous rounds.

  • In any round h < i, the right end of P could not have been aligned with the right end of any full copy of β in ti.

    - proof:

    • Both round h and i fail at char xi

    • two cases of possible shift after round h are invalid


Lemma 4 t i 1 3s i
Lemma 4 (|ti| + 1 > 3si)

  • In round h < i, P can correctly match at most |β|-1 chars in ti.

  • By Lemma 3, P is not aligned with a right end of ti in phase h.

  • Thus if it matched |β| chars or more there is a suffix γ of β followed by a prefix δ of β such that δγ = γδ.

  • By the string properties there is a substring μ such that β = μk, k>1.

  • This contradicts the minimal period size property of β.


Lemma 5 t i 1 3s i
Lemma 5 (|ti| + 1 > 3si)

  • If in round h < i the right end of P is aligned with a char in ti, it can only be aligned with one of the following:

    • One of the left-most |β|-1 chars of ti

    • One of the right-most |β| chars of ti

      -proof:

    • If not, By Lemma 3,4, max |β|-1 chars are matched and only from the middle of a β copy, while there are at least |β|

    • A shift cannot pass the right end of that β copy


Proof cont2
Proof (Cont…)

  • If |ti| + 1 > 3si then gi ≤ 3si

  • Using Lemma 5, in previous rounds we could match only the bad char xi, the last |β|-1 chars in ti or start from the first |β| right chars in ti.

  • In the last case, using Lemma 4, we can only match up to |β|-1 chars

  • in total we could previously match:gi = 1 + |β|-1 + (|β| + |β|-1) ≤ 3|β| ≤ 3si


Proof final
Proof - Final

  • Number of matches = ∑(fi + gi) =∑fi + ∑gi ≤ m + ∑3si ≤ m + 3m = 4m


Proof when p is found in t
Proof - when P is Found in T

  • Split the rounds to two groups:

    • “match” rounds –an occurrence of P in T was found.

    • “mismatch” rounds –P was not found in T.

  • we have proved O(m) for “mismatch” rounds.


Proof cont3
Proof (Cont…)

  • After P was found in T, P will be shifted by a constant length s. (s = n – l(2)).

  • |n| + 1 ≤ 3s ∑ matches in round i ≤ ∑3s≤ m

  • For the rest of the proof we assume that:|n| + 1 > 3s


Proof n 1 3s
Proof (|n| + 1 > 3s)

  • By Lemma 1, P is semiperiodic with minimal length period β, |β| = s.

  • If round i+1 is also a “match” round then, by the Galil rule, only the new |β| chars are compared.

  • A contiguous series of “match” rounds, i…i+k is called a “run”.


Proof n 1 3s1
Proof (|n| + 1 > 3s)

  • ∑ The length of a “run”, not including chars that where already matched in previous “runs” ≤ m

  • How many chars in a “run” where already matched in previous “runs”?


Lemma n 1 3s
Lemma (|n| + 1 > 3s)

  • Suppose k-1 was a “match” round and k is a “mismatch” round that ends the “run”.

  • If k’ > k is the first “match” round then it overlaps at most |β|-1 chars with the previous “run” (ended by round k-1).

  • The left end of P at round k’ cannot be aligned with the left end of a full copy of |β| at round k-1.

  • As a result, P cannot overlap |β| chars or more with round k-1.


Proof n 1 3s2
Proof (|n| + 1 > 3s)

  • By the Lemma and because the shift after every “match” round is of |β|, only the first round of a “run” can overlap, and only with the last previous “run”.

  • ∑ The length of the chars that where already matched in previous “runs” ≤ m


Proof n 1 3s final
Proof (|n| + 1 > 3s) - Final

  • ∑ The length of a “run” = ∑ The length of a “run”, not including chars that where already matched in previous “runs” + ∑ The length of the chars that where already matched in previous “runs” ≤ m + m


ad