The wide window string matching algorithm Longtao He, Binxing Fang, Jie Sui

The wide window string matching algorithm Longtao He, Binxing Fang, Jie Sui Theoretical Computer Science Volume: 332, Issue: 1-3, February 28, 2005, pp. 391-404 Professor R.C.T Lee Speaker K.W. Liu Department of Computer Science National Chi Nan University

String Matching Problem Given : a text string T = t1t2t3…tn . a pattern string P = p1p2p3…pm. where |P|≤|T|. Output: All occurrence(s) of the pattern string within the text string. Example T = ababababaabbaabbabababa P = aabbaabb

TraditionalMethod Example T = ababababaabbaabbabababa P = aabbaabb

In this talk, we shall provide three ideas: • The wide-window method • The convolution method • The bit pattern (modified convolution) method.

Basic Idea of the Wide Window Approach • Open a window with size 2|P|-1. • Divide it into two parts: • The first one denoted as T1 is with size |P|-1 • The second part denoted as T2 is with size |P| 2|P|-1 |P|-1 |P| T T2 T1 Since |T1|<|P| , some suffix of P must be in T2 if it exists.

Find all prefixes of T2 which are also suffixes of P. • Let r denote the length of such a longest prefix. • We can be sure that one part of T2 can be ignored as shown. 2|P|-1 |P|-1 |P| T r T2 T1 Can be ignored.

For every prefix of T2 which is a suffix of P, we should find whether there exists a suffix in T1 which is also a prefix of P. 2|P|-1 |P|-1 |P| T r T2 T1

Definition: n-suffix : Given a string S,n-suffix of S is the suffix of S whose length is n. -1< n < |S|+1 Example: S = abcde 0-suffix of S = ε 1-suffix of S = e 2-suffix of S = de 3-suffix of S = cde n-prefix : Given a string S, n-prefix of S is the prefix of S whose length is n. -1< n < |S|+1 Example: S = abcde 0-prefix of S = ε 1-prefix of S = a 2-prefix of S = ab 3-prefix of S = abc

An Example of the Wide Window Approach Given: T = aababcbdcea P = abcbd Let us produce a wide window whose length is |P| - 1 + |P| = 2|P| - 1 In this case, |P|=5 , 2|P| - 1 = 9 T =aababcbdcea |P|-1 |P| T = T1 T2

We first find all prefixes of T2 which are equal to some suffixes of P. In this case, we obtain bcbd whose length is 4. |P|-4 = 5-4 = 1 If the 1-suffix of T1 is the 1-prefix of P, we have found a matching. 1-suffix of T1 = a 1-prefix of P = a ∴1-suffix of T1 = 1-prefix of P. Thus we conclude that a matching is found. T1 T2 T = P =

Another Example Given: T = ababa P = aba Let us produce a wide window whose length is |P| - 1 + |P| = 2|P| - 1 In this case, |P| = 3 , 2|P| - 1 = 5 T = ababa |P|-1 |P| T = T1 T2

We first find all prefixes of T2 which are equal to some suffixes of P. In our case, we obtain aba and a where lengths are 3 and 1. |P| - 3 = 3 - 3 = 0 (۞the whole P is equal to T2 ۞ one matching is found ) |P| - 1 = 3 – 1 = 2 If the 2-suffix of T1 is the 1-prefix of P, we have found a matching. 2-suffix of T1 = ab 2-prefix of P = ab ∴2-suffix of T1 = 2-prefix of P. Thus we conclude that two matchings are found. |P|-1 |P| T = T1 T2 P = P =

Question: How can we find a suffix of a string S1 to be a prefix of S2? Answer : We use the convolution method.

Convolution Method T = aabc , P = ab = ba

We may use the convolution method to find all prefixes of T2 which are equal to some suffixes of P. T2 = bcbdc , P = abcbd = cdbcb a b c b d c d b c b 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 4 0 1 0 P = = If any zero appears in the column, we can not get a matching. + The unused region to find matching! A 4-suffix of P equal to a prefix of T2.

a b c b d c d b c b 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 4 0 1 0 P = = + May be ignored. No further sliding to the left is needed.

We may also use the logic operator (AND &) to find all prefixes of T2 which are equal to some suffixes of P. T2 = bcbdc P = abcbd a b c b d c d b c b 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 4 0 1 0 P = = 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 + & The unused region to find matching! A 4-suffix of P equal to a prefix of T2

We may use the convolution method to find all suffixes of T1 which are equal to some prefixes of P. T1 = aaba , P = abcbd = dbcba d b c b a a a b a 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 2 0 1 = T1= If any zero appears in the column, we can not get a matching. + The unused region to find matching! A 1-prefix of P equal to a suffix of T1.

d b c b a a a b a 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 2 0 1 = T1= + May be ignored. No further sliding to the right is needed.

We may use the logic operator (AND &) to find all suffixes of T1 which are equal to some prefixes of P. T1 = aaba P = abcbd = d b c b a a a b a 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 2 0 1 T1= 0 0 0 1 0 1 0 0 1 1 0 0 0 1 + & The unused region to find matching! A 1-prefix of P equals to a suffix of T1. ∴1-suffix of T1 = 1-prefix of P. Thus we conclude that a matching is found.

The Bit Pattern Approach Let us consider the following case: T = bcbdc P = abcbd Our job is to determine whether there is a prefix in T which is a suffix of P. Indeed, in this case, we have 4-prefix of T (bcbd) which is also the 4-suffix of P. As indicated before , we may use convolution.

Convolution V1 V2 V3 AND OPERATION V4 V5 A 4-suffix of T is a 4-prefix of P. What are the vectors V1,V2,…,V5?

Given a string S = s1s2…sn and a character α, the α-bit pattern of S is defined as b1b2…bn where bi=1 if si = α and bi=0 if otherwise. Example: S = abcbd a-bit pattern of S = 1 0 0 0 0 b-bit pattern of S = 0 1 0 1 0 c-bit pattern of S = 0 0 1 0 0 d-bit pattern of S = 0 0 0 0 1

T = b c b d c , P = a b c b d V1 V2 AND OPERATION V3 V4 V5 • We can now observe that • V1 = b-bit pattern of P as we are comparing T[1] = b with P, • V2 = c-bit pattern of P as we are comparing T[2] = c with P, • V3 = b-bit pattern of P as we are comparing T[3] = b with P, • V4 = d-bit pattern of P as we are comparing T[4] = d with P, • V5 = c-bit pattern of P as we are comparing T[5] = c with P. and

T = bcbdc • P = abcbd • T[1]=b. We want to decide whether P[5] = T[1] = b. • b-bit vector of P = 0 1 0 1 0 • The last bit is 0 ≠1. • T[1] ≠ P[5] • Besides, we know that T[1] = P[2] = P[4]

T = b c b d c P = a b c b d (2) T[2] = c. We want to decide whether T[1]T[2] = bc= P[4]P[5]. c-bit pattern of P = 0 0 1 0 0 AND-operation of T[1]-bit pattern of P and T[2]-bit pattern of P in the following way: T[2] = P[3] T[1] = P[2] ignore The result of comparing T[1] and P[5] can be ignored from now on. ignore Last bit The last bit is 0. The 2-prefix of T ≠ the 2-suffix of P. What does 0 1 0 0 mean ? It means that T[1] = P[2] = b and T[2] = P[3] = c. We keep this result 0 1 0 0.

T = b c b d c P = a b c b d Resulting vector = 0 1 0 0 (3) T[3] = b. We want to decide whether T[1]T[2]T[3] = P[3]P[4]P[5]. b-bit pattern of P = 0 1 0 1 0 We only take the last 3 bits, namely 0 1 0 because we are interested in P[3]P[4]P[5]. ignore AND-operation Result Last bit The last bit is 0. The 3-prefix of T ≠ the 3-suffix of P. 0 1 0 means that T[1] = P[2] = b , T[2] = P[3] (previously obtained) and T[3] = P[4] We keep the resulting vector 010.

T = b c b d c P = a b c b d Resulting vector = 0 1 0 (4) T[4] = d. We want to decide whether T[1]T[2]T[3]T[4] = P[2]P[3]P[4]P[5]. d-bit pattern of P = 0 0 0 0 1 We only take the last 2 bits, namely 0 1. ignore AND-operation Result Last bit The last bit is 1. The 4-prefix of T = the 4-suffix of P. 0 1 means that T[1] = P[2] = b , T[2] = P[3]=c , T[3] = P[4]=b (previously obtained) and T[4] = P[5] = d

T = b c b d c • P = a b c b d • Resulting vector = 0 1 • T[5] = c. We want to decide whether • T[1]T[2]T[3]T[4]T[5] = P[1]P[2]P[3]P[4]P[5]. • c-bit pattern of P = 0 0 1 0 0 • We only take the last 1 bits, namely 0. ignore AND-operation Result The last bit is 0. The 5-prefix of T ≠ the 5-suffix of P. 0 means that T[1] = P[2] = b , T[2] = P[3] , T[3] = P[4] ,T[4] = P[5] (previously obtained)

Definition: The Logic Operator (AND &) 1 & 1 = 1 1 & 0 = 0 0 & 1 = 0 0 & 0 = 0 Bit Pattern Of String - BPS Given a string S which is composed of n characters. S = abcabcabc S is composed of 3 characters which are a, b and c. BPS means to make bit patterns where each pattern represents each character appeared position in string. a-bit pattern = 1 0 0 1 0 0 1 0 0 b-bit pattern = 0 1 0 0 1 0 0 1 0 c-bit pattern = 0 0 1 0 0 1 0 0 1

The Algorithm ww( T =t1t2…tm , P=p1p2…pn) Preprocessing Find the character set of P Build the character_bit pattern of P the character_rbit pattern of inversed P Search For k do Open a wide window whose length is 2m-1 and its center point is at km Let the window be denoted as a1a2…a2m-1 Let a1a2…am-1be denoted as T1 Let amam+1…a2m-1 be denoted as T2 /*we use modified convolution method to find out the matching*/ Find out all prefixes of T2 which are the suffix of P. (page 33-34) state 1: Find out the corresponding prefixes of P which are the suffix of T1 .(page 35-36) /*each time we can jump the wide window |P|*/ state 2: End For

Having constructed the character bit pattern of P, we may use the character bit pattern of P to find whether the prefix of T2 is equal to the suffix of P. Having constructed the character bit pattern of reversed P, we may use the character bit pattern of reversed P to find whether the suffix of T1 is equal to the prefix of P.

Find out all prefixes of T2 which are the suffix of P. Suf_bit = 1 … 1 //temporary space for storing the result |Suf_bit| = | T2| //the length of tem space is equal to the length of T2 x = 1 // x is the index for reading T2 Read T2from left to right //if the reading character of T2 is one of the characters of P. if the character which belongs to the character set of P //We use AND-operation to simulate the convolution method //After each simulation, we store the result into the temporary space Suf_bit[|P|…x] = Suf_bit[|P|….x] & character_bit[(|P|-x+1)...1] /*check whether the |P|th to xth bit of Suf_bitare zeros, they are all zeros means no more prefix of T2 will be equal to the suffix of P. Therefore we can skip the remaining reading character from T2.*/ if the |P|th to xth bit of Suf_bit are all zero goto state 1 end if else//if the reading character of T2 is not one of the characters of P. Set the |P|th to xth bit of Suf_bit to zero. goto state 2 // finish the reading from T2 end if

//if the xth bit of suf_bit is 1, x-suffix of T2 is equal to the x-prefix of P if Suf_bit[x] == 1 if x == |P| , //if the length of suffix of T2 is equal to the length of P , we found a matching we found a matching at km else we found x-suffix end if end if x++ //increase the index for reading next character Read next character Fig :: Find out all prefixes of T2 which are the suffix of P.

Having constructed the character bit pattern of reversed P, we may use the character bit pattern of reversed P to find whether the suffix of T1 is equal to the prefix of P.

Find out the corresponding prefixes of P which are the suffix of T1 . state 1: //if in the previous processing, we did not find any prefix of T2 is equal to the suffix of P //We do not need to find any corresponding suffix of T1 is equal to the prefix of P if the |P|th to 1st bit of Suf_bit are all zero goto state 2 else Pre_bit = 1 …1 //temporary space for storing the result y = |Pre_bit| = |T1| //the length of tem space is equal to the length of T1 z = 1 // z is the index for reading T1 Read T1from right to left //if the reading character of T1 is one of the characters of P if the character which belongs to the character set of P, //We use AND-operation to simulate the convolution method Pre_bit[y…z] = Pre_bit[y...z] & character_rbit[(y-z+1)...1] /* if the (|P|-1)th to yth bit of Pre_bitare zeros, they are all zeros means no more suffix of T1 will be equal to the prefix of P. Therefore we can skip the remaining reading character from T1. */

if the (|P|-1)th to yth bit of Pre_bit are all zero goto state 2 end if else //if the reading character of T1 is not one of the characters of P. goto state 2 // finish the reading character from T1 end if //if the xth bit of Pre_bit is 1, x-suffix of T1 is equal to the x-prefix of P if (Pre_bit[z] == 1) then /* if we found a suffix of T1 is equal to the a prefix of P, we need to check the whether the corresponding prefix of T2 appeared in the Suf_bit pattern. */ if ( Pre_bit[z] & Suf_bit[|P|-z]) Found a matching at km – y end if end if z++ Read next character end if state 2: Fig :: Find out the corresponding prefixes of P which are the suffix of T1 .

Example: T = aababcbdc P = abcbd Let us produce a wide windows where length is |P| - 1 + |P| = 2|P| - 1 In this case, |P|=5 , 2|P| - 1 = 9 |P|-1 |P| T = T1 T2

Preprocessing Build character bit pattern of P P = abcbd Find all bit patterns of P, P is composed of a, b, c, b, d. The character set of P = {a, b, c, b, d} a_bit = 1 0 0 0 0 b_bit = 0 1 0 1 0 c_bit = 0 0 1 0 0 d_bit = 0 0 0 0 1 Having constructed the character bit pattern of P, we may use the character bit pattern of P to find whether the prefix of T2 is equal to the suffix of P.

Build character bit pattern of reversed P P = abcbd Find all character bit patterns of reversed P, P is composed of a, b, c, b, d. The character set of P = {a, b, c, b, d} a_rbit = 0 0 0 0 1 b_rbit = 0 1 0 1 0 c_rbit = 0 0 1 0 0 d_rbit = 1 0 0 0 0 Having constructed the character bit pattern of P, we may use the character bit pattern of P to find whether the prefix of T2 is equal to the suffix of P.

the character is ‘b’ ∴ Suf_bit[5...1] = Suf_bit[5...1] & b_bit[5...1] ∵ the last bit is ‘0’, no1-suffix of T2 is equal to 1-prefix of P Suf_bit[5…1]= 0 1 0 1 0

the character is ‘c’ ∴ Suf_bit[5...2] = Suf_bit[5…2] & c_bit[4…1] ∵ the last bit is ‘0’, no2-suffix of T2 is equal to 2-prefix of P Suf_bit[5…1] = 0 1 0 0 0

the character is ‘b’ ∴ Suf_bit[5...3] = Suf_bit[5...3] & b_bit[3...1] ∵ the last bit is ‘0’, no 3-suffix of T2 is equal to 3-prefix of P Suf_bit[5…1] = 0 1 0 0 0

the character is ‘d’ ∴ Suf_bit[5...4] = Suf_bit[5…4] & c_bit[2...1] ∵ the last bit is ‘1’, 4-suffix of T2 is equal to 4-prefix of P Suf_bit [5…1]= 0 1 0 0 0

the character is ‘c’ ∴ Suf_bit[5...5] = Suf_bit[5...5] & c_bit[1...1] ∵ the last bit is ‘0’, no5-suffix of T2 is equal to 5-prefix of P ∴ Suf-bit [5…1] = 0 1 0 0 0 We have found one suffix which is 4-suffix. The corresponding prefix which we need to find is (|P|-4)-prefix. If we found, we got a matching.

Having constructed the character bit pattern of reversed P, we may use the character bit pattern of reversed P to find whether the suffix of T1 is equal to the prefix of P.

Pre_bit[4...1] = Pre_bit[4...1] & a_rbit[4...1] ∵ the last bit is ‘0’, 1-prefix of T1 is equal to 1-suffix of P if (Pre_bit[1] == 1) then if ( Pre_bit[1] & Suf_bit[5-1]) Found a matching ∴ Suf-bit [5…1] = 0 1 0 0 0 Check |prefix|+|suffix| = |P| ? ∴ Pre-bit [4…1] = 0 0 0 1

Pre_bit[4…2] = Pre_bit[4...2] & b_rbit[3...1] ∵ the last bit is ‘0’, no 2-prefix of T1 is equal to 2-suffix of P if (Pre_bit[2] == 1) then if ( Pre_bit[2] & Suf_bit[5-2])  no need to check Pre-bit[4..1] = 0 0 0 1

Pre_bit[4...3] = Pre_bit[4...3] & a_rbit[2...1] ∵ the last bit is ‘0’, no 3-prefix of T1 is equal to 3-suffix of P if (Pre_bit[3] == 1) then if ( Pre_bit[3] & Suf_bit[5-3])  no need to check Pre-bit[4..1] = 0 0 0 1

Pre_bit[4...4] = Pre_bit[4…4] & a_rbit[1…1] ∵ the last bit is ‘0’, no 3-prefix of T1 is equal to 3-suffix of P if (Pre-bit[4] == 0) then if ( Pre-bit[4] & Suf-bit[5-4])  no need to check Pre-bit[4..1] = 0 0 0 1

The wide window string matching algorithm Longtao He, Binxing Fang, Jie Sui

The wide window string matching algorithm Longtao He, Binxing Fang, Jie Sui

Presentation Transcript

String Matching

A Fast String Matching Algorithm

String Matching Using the Rabin-Karp Algorithm

A Fast String Matching Algorithm

String Matching

The wide window string matching algorithm Longtao He, Binxing Fang, Jie Sui

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching: Knuth-Morris-Pratt algorithm

brute force string matching algorithm

String Matching

String matching

String Matching

String Matching

String Matching