1 / 24

Boyer-Moore string search algorithm

Boyer-Moore string search algorithm Book by Dan Gusfield : Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore (1977) Presented by: Vladimir Zoubritsky. Agenda. Problem Statement Bad character rule Boyer-Moore-Horspool algorithm

abedi
Download Presentation

Boyer-Moore string search algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore (1977) Presented by: Vladimir Zoubritsky

  2. Agenda • Problem Statement • Bad character rule • Boyer-Moore-Horspool algorithm • Good Suffix Rule • Preprocessing • Analysis

  3. Problem Statement • Given pattern P(1..n) and text T(1..m) defined over alphabet Σ, find one or all occurrences of P in T. • Boyer-Moore algorithm (1977) provides an efficient solution. The algorithm has a linear running time in worst case and sub-lineartime in most practical cases.

  4. Right to left matching idea • Other known algorithms, e.g. Brute Force, match the pattern from left to right. • Algorithm: Align P with index k of T. Start matching from k+n-1, and if all letters match, report occurrence. • By itself matching from right to left is similar to Brute Force in the running time. • Based on the suffix we can decide to skip over ranges of characters.

  5. Algorithm Skeleton • Align P with the beginning of T and match from right to left. • If whole P was match report occurrence. • Otherwise shift P by the maximal amount between the ones given by the bad character shift and the good suffix shift. Conditional correctness: If the two shifts never go beyond an occurrence of P in T, the algorithm will report all occurrences.

  6. Bad Character rule • Definition For each character x, let R(x) be the position of the right-most occurrence of character x in P. R(x) is defined to be zero if x does not occur in P.

  7. Bad character shift • Definition: Suppose a particular alignment of P against T, the rightmost n-i characters of P match their counterparts in T, but the character P(i) mismatches with its counterpart, say in position k of T. If the right-most position of the character T(k) in P is j, j < i, then shift so thatcharacter j of P is below character k of T, otherwise shift by 1. • The shift would be max[1, i-R(T(k))].

  8. Bad character shift • Simple case: The character aligned with P(n), T(k) does not appear in P: P is shifted by n (to start after k).

  9. Bad character shift • General case: Shift by i – R(x). Trivial to prove correctness.

  10. Boyer-Moore-Horspool algorithm • Described by Horspool in 1980. • Basic idea: use Boyer Moore algorithm, but only use the bad character shift rule. • Worst case running time in degenerate cases may be O(nm). • Best case is sub-linear: O(m/n).

  11. Boyer-Moore-Horspool worst case • A pair of pattern and text could be constructed to have a shift of 1 each time (same as Brute Force).

  12. Boyer-Moore-Horspool best case • In a case when the last character in the pattern does not appear in the text, each shift would be of steps.

  13. Boyer-Moore-Horspool Time • Preprocessing: Scanning the pattern is done in O(n) time, and using space. • Worst case: . • Best case: . • Average time: An average number of comparisons for the general case of Boyer-Moore-Horspool was established: [Baeza-Yates 1990]. • Bad character rule is not strong enough for providing linear time(see worst case above).

  14. Good Suffix Rule • Definition: Suppose for a given alignment of and , a substring of matches a suffix of , but a mismatch occurs to the next character to the left. Then find, if exists, the rightmost copy of in , such as is not a suffix of , and the character to the left of in differs from the one to the left of in . Shift to the right, so that substring in is below substring in .

  15. Good suffix rule (cont'd) • If does not exist, then shift the left end of past the left end of in by the leastamount, so that a prefix of matches a suffix of t in . If no such shift is possible then shift by n places to the right.

  16. Correctness of the good-suffix shift • Recall: Suppose for a given alignment of and , a substring of matches a suffix of , but a mismatch occurs to the next character to the left. • If there is only one occurrence of in P, then any alignment with the left end of P aligned before the left end of will not yield a match. • If we align with a previous copy of in P, and the character before is equal to the character before , this alignment will fail the same way.

  17. Preprocessing of P • Originally published preprocessing algorithm was complex and erroneous. An updated version was complex still. • We will use a simpler version based on the Z algorithm. • We want the preprocessing to compute values for functions L’(i) and l’(i) – defined later.

  18. Preprocessing of P (cont'd) • An intermediate value we will require is . of is defined as the length of the longest suffix of which is also a suffix of . • Recall that is the length of the longest substring of that is also a prefix of S. • We can compute values for by running the Z-algorithm on the reverse of P.

  19. Preprocessing of P: calculating L’(i) • gives the right-end position of the right-most copy of which is preceded by a different character. is zero if no such position exists. • Using , we can define as the largest j so that . • can be accumulated in linear time from the values of .

  20. Preprocessing of P: calculating l’(i) • l'(i) is the length of the largest suffix of , that is also a prefix of P, if exists. • We can also define l'(i) in terms of : is the largest j ≤ |t| so that . • In a similar way, can be accumulated in linear time from values.

  21. Using the preprocessing results • First part of the good suffix rule says we should find a copy of which is preceded by a different character – i.e. using a non-zero value of . • The second part looks at the least amount for a prefix of P to match a suffix of t – i.e. using a non-zero value of .

  22. Boyer-Moore Time • Using the linear time implementation of the Z algorithm, the preprocessing takes O(n) time and O(n) space. • The original Boyer-Moore algorithm had cases when P appears in T which resulted in O(nm) time, before a few simple modifications [Galil 1979]. • A tight bound of 3m comparisons was established for Boyer-Moore running time [Cole 1991]. • An average case analysis is proposed, but remains difficult to simplify into a simple expression as in BMH [Tsai 2005]. • For other, “Boyer-Moore-like” algorithms the following time bounds were established:

  23. Experimental Analysis • On average, for sufficiently large alphabets (8 characters) Boyer-Moore-Horspool has fast running time and sub-linear number of character comparisons. • On average, and in worst cases Boyer-Moore is faster than “Boyer-Moore-like” algorithms. Data from Michailidis and Margaritis [2001]

  24. Questions?

More Related