1 / 29

# COMP170 Tutorial 13: Pattern Matching - PowerPoint PPT Presentation

COMP170 Tutorial 13: Pattern Matching. T:. P:. Overview. 1. What is Pattern Matching? 2. The Naive Algorithm 3. The Boyer-Moore Algorithm 4. The Rabin-Karp Algorithm 5. Questions. 1. What is Pattern Matching?. Definition:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' COMP170 Tutorial 13: Pattern Matching' - anise

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### COMP170 Tutorial 13: Pattern Matching

T:

P:

1. What is Pattern Matching?

2. The Naive Algorithm

3. The Boyer-Moore Algorithm

4. The Rabin-Karp Algorithm

5. Questions

• Definition:

• given a text string T and a pattern string P, find the pattern inside the text

• T: “the rain in spain stays mainly on the plain”

• P: “n th”

• Applications:

• text editors, Search engines (e.g. Google), image analysis

• Assume S is a string of size m.

• A substring S[i .. j] of S is the string fragment between indexes i and j.

• A prefix of S is a substring S[0 .. i]

• A suffix of S is a substring S[i .. m-1]

• i is any index between 0 and m-1

a

n

d

r

e

w

0

5

Examples

• Substring S[1..3] == "ndr"

• All possible prefixes of S:

• "andrew", "andre", "andr", "and", "an”, "a"

• All possible suffixes of S:

• "andrew", "ndrew", "drew", "rew", "ew", "w"

2. The Naive Algorithm

• Check each position in the text T to see if the pattern P starts in that position

T:

a

n

d

r

e

w

T:

a

n

d

r

e

w

P:

r

e

w

P:

r

e

w

P moves 1 char at a time through T

. . . .

Algorithm and Analysis

• Brutal force

continued

• The brute force algorithm is fast when the alphabet of the text is large

• e.g. A..Z, a..z, 1..9, etc.

• It is slower when the alphabet is small

• e.g. 0, 1 (as in binary files, image files, etc.)

• Example of a worst case:

• T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"

• P: "aaah"

• Example of a more average case:

• T: "a string searching example is standard"

• P: "store"

continued

Reverse na text is large ive algorithm

• Why not search from the end of P?

• Boyer and Moore

Reverse-Naive-Search(T,P)

01 for s ¬ 0 to n – m

02 j ¬ m – 1 // start from the end

03 // check if T[s..s+m–1] = P[0..m–1]

04 while T[s+j] = P[j] do

05 j ¬ j - 1

06 if j < 0 return s

07 return –1

• Running time is exactly the same as of the naive algorithm…

3. The Boyer-Moore Algorithm text is large

• The Boyer-Moore pattern matching algorithm is based on two techniques.

• 1. The looking-glass technique

• find P in T by moving backwards through P, starting at its end

• 2. The text is large character-jump technique

• when a mismatch occurs at T[i] =/= P[m-1]

• the character in pattern P[m-1] is not the same as T[i]

• There are 2 possible cases.

T

x

i

P

b

Case 1 text is large

• If P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i].

T

T

?

?

a

a

x

x

i

P

P

x

x

b

b

c

c

Case text is large 2

• If the character T[i] does not appear in P, then shift P to align P[0] with T[i+1].

T

T

?

?

a

a

x

x

?

i

inew

P

P

d

d

b

b

c

c

0

No x in P

Case text is large 3

• If T[i] = P[m-1] and the match is incomplete, align T[i] with the last occurrence of T[i] in P.

T

T

?

?

a

a

x

x

?

inew

i

P

P

a

a

a

a

b

b

c

c

Boyer-Moore Example (1) text is large

T:

P:

Boyer-Moore algorithm text is large

• To implement, we need to find out for each character c in the alphabet, the amount of shift needed if P[m-1] aligns with the character c in the input text and they don’t match.

Example: Suppose the alphabet is

{a, b,c} and the pattern is ababbb.

Then,

shift[c] = 6

shift[a] = 3

shift[b] = 1

This takes O(m + A) time, where A is the number of possible characters. Afterwards, matching P with substrings in T is very fast in practice.

Analysis text is large

• Boyer-Moore worst case running time is O(nm + A)

• But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small.

• e.g. good for English text, poor for binary

• Boyer-Moore is significantly faster than brute force for searching English text.

Fingerprint text is large idea

• Assume:

• We can compute a fingerprint f(P)of P in O(m) time.

• If f(P)¹ f(T[s .. s+m–1]), then P ¹ T[s .. s+m–1]

• We can compare fingerprints in O(1)

• We can compute f’ = f(T[s+1.. s+m]) from f(T[s .. s+m–1]), in O(1)

f’

f

Algorithm with Fingerprints text is large

• Let the alphabet S={0,1,2,3,4,5,6,7,8,9}

• Let fingerprint to be just a decimal number, i.e., f(“1045”) = 1*103 + 0*102 + 4*101 + 5 = 1045

Fingerprint-Search(T,P)

01 fp ¬ compute f(P)

02 f ¬ compute f(T[0..m–1])

03 for s ¬ 0 to n – m do

04 if fp = f return s

05 f ¬ (f – T[s]*10m-1)*10 + T[s+m]

06 return –1

T[s]

new f

f

T[s+m]

• Running time O(m+n)

• Where is the catch?

Using a Hash Function text is large

• Problem:

• we can not assume we can do arithmetics with m-digits-long numbers in O(1) time

• Solution: Use a hash function h = f mod q

• For example, if q = 7, h(“52”) = 52 mod 7 = 3

• h(S1) ¹h(S2) Þ S1¹S2

• But h(S1) = h(S2) does not imply S1=S2!

• For example, if q = 7, h(“73”) = 3, but “73” ¹ “52”

• Basic “mod q” arithmetics:

• (a+b) mod q = (a mod q + b mod q) mod q

• (a*b) mod q = (a mod q)*(b mod q) mod q

Preprocessing and Stepping text is large

• Preprocessing:

• fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ … … + 10*(P[1] + 10*P[0])…)) mod q

• In the same way compute ft from T[0..m-1]

• Example: P = “2531”, q = 7, what is fp?

• Stepping:

• ft = (ft–T[s]*10m-1 mod q)*10 + T[s+m]) mod q

• 10m-1 mod q can be computed once in the preprocessing

• Example: Let T[…] = “5319”, q = 7, what is the corresponding ft?

T[s]

new ft

ft

T[s+m]

Rabin-Karp Algorithm text is large

Rabin-Karp-Search(T,P)

01 q¬ a prime larger than m

02 c ¬10m-1mod q //run a loop multiplying by 10mod q

03 fp ¬ 0; ft ¬ 0

04 for i¬ 0 to m-1 // preprocessing

05 fp ¬ (10*fp + P[i]) mod q

06   ft ¬ (10*ft + T[i]) mod q

07 for s ¬ 0 to n – m // matching

08 if fp = ft then // run a loop to compare strings

09 if P[0..m-1] = T[s..s+m-1] return s

10 ft ¬ ((ft – T[s]*c)*10 + T[s+m]) mod q

11 return –1

• How many character comparisons are done if

T = “2531978” and P = “1978”?

Analysis text is large

• If q is a prime, the hash function distributes m-digit strings evenly among the q values

• Thus, only every q-th value of shift s will result in matching fingerprints (which will require comparing stings with O(m) comparisons)

• Expected running time (if q > m):

• Outer loop: O(n-m)

• All inner loops:

• Total time: O(n-m)

• Worst-case running time: O((n-m+1)m)

Rabin-Karp in Practice text is large

• If the alphabet has d characters, interpret characters as radix-d digits (replace 10 with d in the algorithm).

• Choosing prime q > m can be done with randomized algorithms in O(m), or q can be fixed to be the largest prime so that 10*q fits in a computer word.

• Rabin-Karp is simple and can be easily extended to two-dimensional pattern matching.

Question 1 text is large

• What is the worst case complexity of the Naïve algorithm? Find an example of the worst case.

• What is the worst case complexity of the BM algorithm? Find an example of the worst case.

Question 2 text is large

• Illustrate how does BM work for the following pattern matching problem.

• P: abacab

Answer to question 1 text is large

• Example of a worst case for Naïve algorithm:

• T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"

• P: "aaah“

• Time complexity O(mn)

BM text is large Worst Case Example

• T: "aaaaa…a"

• P: "baaaaa“

• Complexity

• O(mn+A)

T:

P:

T: text is large

P: