- 122 Views
- Uploaded on
- Presentation posted in: General

COMP170 Tutorial 13: Pattern Matching

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

COMP170 Tutorial 13: Pattern Matching

T:

P:

1. What is Pattern Matching?

2. The Naive Algorithm

3. The Boyer-Moore Algorithm

4. The Rabin-Karp Algorithm

5. Questions

- Definition:
- given a text string T and a pattern string P, find the pattern inside the text
- T: “the rain in spain stays mainly on the plain”
- P: “n th”

- given a text string T and a pattern string P, find the pattern inside the text
- Applications:
- text editors, Search engines (e.g. Google), image analysis

- Assume S is a string of size m.
- A substring S[i .. j] of S is the string fragment between indexes i and j.
- A prefix of S is a substring S[0 .. i]
- A suffix of S is a substring S[i .. m-1]
- i is any index between 0 and m-1

S

a

n

d

r

e

w

0

5

- Substring S[1..3] == "ndr"
- All possible prefixes of S:
- "andrew", "andre", "andr", "and", "an”, "a"

- All possible suffixes of S:
- "andrew", "ndrew", "drew", "rew", "ew", "w"

- Check each position in the text T to see if the pattern P starts in that position

T:

a

n

d

r

e

w

T:

a

n

d

r

e

w

P:

r

e

w

P:

r

e

w

P moves 1 char at a time through T

. . . .

- Brutal force

continued

- The brute force algorithm is fast when the alphabet of the text is large
- e.g. A..Z, a..z, 1..9, etc.

- It is slower when the alphabet is small
- e.g. 0, 1 (as in binary files, image files, etc.)

- Example of a worst case:
- T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"
- P: "aaah"

- Example of a more average case:
- T: "a string searching example is standard"
- P: "store"

continued

- Why not search from the end of P?
- Boyer and Moore

Reverse-Naive-Search(T,P)

01 for s ¬ 0 to n – m

02 j ¬ m – 1 // start from the end

03 // check if T[s..s+m–1] = P[0..m–1]

04 while T[s+j] = P[j] do

05 j ¬ j - 1

06 if j < 0 return s

07 return –1

- Running time is exactly the same as of the naive algorithm…

- The Boyer-Moore pattern matching algorithm is based on two techniques.
- 1. The looking-glass technique
- find P in T by moving backwards through P, starting at its end

- 2. The character-jump technique
- when a mismatch occurs at T[i] =/= P[m-1]
- the character in pattern P[m-1] is not the same as T[i]

- There are 2 possible cases.

T

x

i

P

b

- If P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i].

T

T

?

?

a

a

x

x

i

P

P

x

x

b

b

c

c

- If the character T[i] does not appear in P, then shift P to align P[0] with T[i+1].

T

T

?

?

a

a

x

x

?

i

inew

P

P

d

d

b

b

c

c

0

No x in P

- If T[i] = P[m-1] and the match is incomplete, align T[i] with the last occurrence of T[i] in P.

T

T

?

?

a

a

x

x

?

inew

i

P

P

a

a

a

a

b

b

c

c

T:

P:

- To implement, we need to find out for each character c in the alphabet, the amount of shift needed if P[m-1] aligns with the character c in the input text and they don’t match.

Example: Suppose the alphabet is

{a, b,c} and the pattern is ababbb.

Then,

shift[c] = 6

shift[a] = 3

shift[b] = 1

This takes O(m + A) time, where A is the number of possible characters. Afterwards, matching P with substrings in T is very fast in practice.

- Boyer-Moore worst case running time is O(nm + A)
- But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small.
- e.g. good for English text, poor for binary

- Boyer-Moore is significantly faster than brute force for searching English text.

- Assume:
- We can compute a fingerprint f(P)of P in O(m) time.
- If f(P)¹ f(T[s .. s+m–1]), then P ¹ T[s .. s+m–1]
- We can compare fingerprints in O(1)
- We can compute f’ = f(T[s+1.. s+m]) from f(T[s .. s+m–1]), in O(1)

f’

f

- Let the alphabet S={0,1,2,3,4,5,6,7,8,9}
- Let fingerprint to be just a decimal number, i.e., f(“1045”) = 1*103 + 0*102 + 4*101 + 5 = 1045

Fingerprint-Search(T,P)

01 fp ¬ compute f(P)

02 f ¬ compute f(T[0..m–1])

03 for s ¬ 0 to n – m do

04 if fp = f return s

05 f ¬ (f – T[s]*10m-1)*10 + T[s+m]

06 return –1

T[s]

new f

f

T[s+m]

- Running time O(m+n)
- Where is the catch?

- Problem:
- we can not assume we can do arithmetics with m-digits-long numbers in O(1) time

- Solution: Use a hash function h = f mod q
- For example, if q = 7, h(“52”) = 52 mod 7 = 3
- h(S1) ¹h(S2) Þ S1¹S2
- But h(S1) = h(S2) does not imply S1=S2!
- For example, if q = 7, h(“73”) = 3, but “73” ¹ “52”

- Basic “mod q” arithmetics:
- (a+b) mod q = (a mod q + b mod q) mod q
- (a*b) mod q = (a mod q)*(b mod q) mod q

- Preprocessing:
- fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ … … + 10*(P[1] + 10*P[0])…)) mod q
- In the same way compute ft from T[0..m-1]
- Example: P = “2531”, q = 7, what is fp?

- Stepping:
- ft = (ft–T[s]*10m-1 mod q)*10 + T[s+m]) mod q
- 10m-1 mod q can be computed once in the preprocessing
- Example: Let T[…] = “5319”, q = 7, what is the corresponding ft?

T[s]

new ft

ft

T[s+m]

Rabin-Karp-Search(T,P)

01 q¬ a prime larger than m

02 c ¬10m-1mod q //run a loop multiplying by 10mod q

03 fp ¬ 0; ft ¬ 0

04 for i¬ 0 to m-1 // preprocessing

05 fp ¬ (10*fp + P[i]) mod q

06 ft ¬ (10*ft + T[i]) mod q

07 for s ¬ 0 to n – m // matching

08 if fp = ft then // run a loop to compare strings

09 if P[0..m-1] = T[s..s+m-1] return s

10 ft ¬ ((ft – T[s]*c)*10 + T[s+m]) mod q

11 return –1

- How many character comparisons are done if
T = “2531978” and P = “1978”?

- If q is a prime, the hash function distributes m-digit strings evenly among the q values
- Thus, only every q-th value of shift s will result in matching fingerprints (which will require comparing stings with O(m) comparisons)

- Expected running time (if q > m):
- Outer loop: O(n-m)
- All inner loops:
- Total time: O(n-m)

- Worst-case running time: O((n-m+1)m)

- If the alphabet has d characters, interpret characters as radix-d digits (replace 10 with d in the algorithm).
- Choosing prime q > m can be done with randomized algorithms in O(m), or q can be fixed to be the largest prime so that 10*q fits in a computer word.
- Rabin-Karp is simple and can be easily extended to two-dimensional pattern matching.

- What is the worst case complexity of the Naïve algorithm? Find an example of the worst case.
- What is the worst case complexity of the BM algorithm? Find an example of the worst case.

- Illustrate how does BM work for the following pattern matching problem.
- T: abacaabadcabacabaabb
- P: abacab

- Example of a worst case for Naïve algorithm:
- T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"
- P: "aaah“

- Time complexity O(mn)

- T: "aaaaa…a"
- P: "baaaaa“
- Complexity
- O(mn+A)

T:

P:

T:

P: