200 likes | 406 Views
A Fast String Matching Algorithm. The Boyer Moore Algorithm. The obvious search algorithm . Considers each character position of str and determines whether the successive patlen characters of str matches pat .
E N D
A Fast String Matching Algorithm The Boyer Moore Algorithm
The obvious search algorithm • Considers each character position of str and determines whether the successive patlen characters of str matches pat. • In worst case, the number of comparisons is in the order of . Ex. pat: aab ; str: ..aaaaac .
Knuth-Pratt-Morris Algoritm • Linear search algorithm. • Preprocesses pat in time linear in and searches str in time linear in . EXAMPLE HERE IS A SIMPLE EXAMPLE … EXAMPLE EXAMPLE EXAMPLE
Characteristics of Boyer Moore Algorithm • Basic idea: string matches the pattern from the right rather than from the left. • Preprocessing pat and compute two tables: & for shifting pat & the pointer of str. • Ex. pat : AT-THAT; str : …WHICH-FINALLY-HALTS.—AT-THAT-POINT
Informal Description Compare the last char of the pat with the patlenth char of str : AT-THAT WHICH-FINALLY-HALTS.—AT-THAT-POINT Observation 1: charis not to occur in pat, skip chars of str. AT-THAT
Informal Description Observation 2: char is in pat, slidepatdown positions so that char is aligned to the corresponding character in pat. = if char not occur in pat,then ; else , where j is the maximum integer such that . • AT-THAT • WHICH-FINALLY-HALTS.--AT-THAT-POINT
Informal Description Observation 3a:str matches the last m chars of pat, and came to a mismatch at some new char. Move strptr by .(pat shifted by ) AT-THAT …FINALLY-HALTS.--AT-THAT-POINT AT-THAT
Informal Description Observation 3b: the final m chars of pat(a subpat) is matched, find the right most plausible reoccurrence of the subpat, align it with the matched m chars of str (slide pat positions). AT-THAT …FINALLY-HALTS.—AT-THAT-POINT AT-THAT AT-THAT
The delta1 & delta2 tables • The delta1 table has as many entries as there are chars in the alphabet. Ex. pat: a b c d e ; a t – t h a t : 4 3 2 1 0 else,5; 1 0 4 0 2 1 0 else,7 • The delta2 table has as many entries as there are chars in pat. Ex. pat: a b c d e ; a t - t h a t : 9 8 7 6 1 ; 11 10 9 8 7 8 1
Ex: we compute j=5 j= 1 2 3 4 5 6 7 Pat: edbcabc edbcabc -2-101 2 3 4 5 6 7 Then
The algorithm stringlen length of string. i patlen. top : if i > stringlen then return false. j patlen. loop: if j=0 then return i+1. if string(i)=pat(j) then j j-1 i i-1 goto loop. close; i i +max( delta1(sting(i)) , delta2(j)) goto top.
Loops: fast, undo, slow • Fast:scans down string, effectively looking for the last character in pat, skipping according to . • 80% time spent in it. • Undo:decides whether this situation arose because all of stringhas been scanned or because was hit. • Slow:backs up checking for matches. • It is easy to implement on a byte addressable machine • Char <- string (i), etc
Measured the cost of each search • Three strings:binary alphabet, English, random alphabet. • Fig.1:the number of references made to string. • Fig.2:the total number of machine instruction that actually got executed.
Boyer Moore V.S. Knuth, Morris, and Pratt algorithm • for English text. • Boyer Moore: • every reference to string passes about 4 characters for a pattern of length 5. • For sufficiently large alphabets and sufficiently long patterns executes fewer than 1 instruction per character passed. • K.M.P.: • Search reference string about 1.1 times per character. • a character can be expected to be at least 3.3 instructions.
Conclusion • Require fewer CPU cycle. • Most efficiently on a byte-addressable machine. • Unadvisable:to find the first of several possible substrings or to identify a location in string defined by a regular expression. • Aho and Corasick is more suitable.
Conclusion • Improve:by fetching larger bytes in the fast loop and using a hash array to encode the extended . • Exponentially increases the effective size of the alphabet and reduces the frequency of common characters.