1 / 22

# Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan - PowerPoint PPT Presentation

Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91. Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan. Given a text T (1, n ), a pattern P (1, m ) and an error found k . Our approximate string matching problem is defined as follow:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan' - uta

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Fast text searching: allowing errorsSun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91

Advisor: Prof. R. C. T. Lee

Reporter: Z. H. Pan

Given a text T(1,n), a pattern P(1,m) and an error found k.

Our approximate string matching problem is defined as follow:

Find all location i of T such that the following condition is satisfied: There exists a suffix A of T(1, i) such that d(A,P)≦k where d(x,y) is the edit distance between x and y.

T=deaabeg,

P=aabac and k=2.

For i=5.

T(1, 5)= deaab.

We note that there exists a suffix A=aab of T(1, 5) such that d(A,P)=d(aab,aabac)=2.

T=deaabeg, P=aab and k=2.

Consider i=5.

T(1,5)=deaab.

We have A=aab of T(1,5) and d(A,P)=d(aab,aab)=0. Thus we have found a substring aab in T such that d(aab,P)=0.

Consider i=6.

T(1,6)=deaabe.

We have A=aabe of T(1,6) and d(A,P)=d(aabe,aab)=1. Again, we have found a substring aabe in T such that d(aabe,P)=1.

S

S1

S2

T

P

P1

P2

Let S be a substring of T.

If there exists a suffix S2 of S and a suffix P2 of P such that

d(S2, P2) = 0, and d(S1, P1) ≦k,

we haved(S, P) ≦ k.

We may decompose A and B as follows:

B=ab+cd.

Thus d(A,B)=2.

Consider T(1,i) and P(1, j).

Case 1: T(i)=P( j). We denote prefix B which is P(1, j-1) in P. We consider whether there is a suffix A in T(1,i-1) such that d( A, B )≦k.

i

i-1

T :

A

1

j-1

j

P :

B

Case 2: T(i)≠P(j). We consider three cases:

2.1 We denote B which is P(1, j). There is a suffix A in T(1,i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i-1

i

T :

A

1

j

P :

B

i-1

i

T :

A

1

j

P :

B

insertion

Case 2: T(i)≠P(j). We consider three cases:

2.2 We denote B which is P(1, j-1). There is a suffix A in T(1, i ) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i

T :

A

1

j-1

j

P :

B

Case 2: T(i)≠P(j). We consider three cases:

2.3 We denote B which is P(1, j-1). There is a suffix A in T(1, i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i-1

i

T :

A

1

j-1

j

P :

B

To solve our approximate string matching problem, we start with a table, called Rk[n, m]. Let S=T(1, i).

Rk(i,j)

Where 1≦i≦n and 1≦j≦m.

=1 if there exists a suffix A of S such that d(A, P1,j)≦k.

=0 otherwise.

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=9, j=4.

S=T(1, 9)=aabaacaab

P(1, 4)=aaba

A=aab

d(A,P(1, 4))=d(aab,aaba)=1

∴ R1(9, 4)=1

R1

1 2 3 4 5 6 7 8 9 10111213

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0

Example: with a table, called

T:aabaacaabacab, P:aabac and k=1.

Consider i=13 and j=5.

S=T(1, 13)=aabaacaabacab

P(1, 5)=aabac

There doesn’t exist any suffix A of S such that d(A,P(1, 5))≦1.

∴ R1(13,5)=0

1 2 3 4 5 6 7 8 9 10111213

R1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0

Question: How can we find with a table, called Rk(i, j)?

There are three types of operation in edit distance:

(1) Insertion

(2) Deletion

(3) Substitution

We consider them separately and combine the results later.

Let with a table, called RIk(i,j), RDk(i,j) and RSk(i,j) denote the Rk(i,j) related to insertion, deletion and substitution respectively.

And let RIk[i,j], RDk[i,j] and RSk[i,j] denote the Rk[i,j] related to insertion, deletion and substitution of table respectively.

Consider with a table, called RIk(i,j) first.

RIk(i,j)

=1 if ti≠pj and Rk-1(i-1,j)=1

or ti＝pj and Rk(i-1,j-1)=1,

=0 otherwise.

i-1

i

aabac

b

T:

P:

b

aabac

insertion

j

=1 if with a table, called ti≠pj and Rk-1(i-1,j)=1 or ti＝pj and Rk(i-1,j-1)=1

=0 otherwise

RIk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RI1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(5,4)=0

∴ RI1(6,4)=0

(3) When i=11 and j=4.

t11=‘c’≠p4=‘a’, R0(10,4)=1

∴ RI1(11,4)=1

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RI1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

0

1

0

1

1

0

0

1

1

1

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

0

0

1

1

1

1

0

0

1

1

0

1

0

0

Consider with a table, called RDk(i,j).

RDk(i,j)

=1 if ti≠pj and Rk-1(i,j-1)=1

or ti＝pj and Rk(i-1,j-1)=1,

=0 otherwise.

i

aabac

T:

P:

b

aabac

deletion

j-1

j

=1 if with a table, called ti≠pj and Rk-1(i,j-1)=1 or ti＝pj and Rk(i-1,j-1)=1

=0 otherwise

RDk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RD1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(6,3)=0

∴ RD1(6,4)=0

(3) When i=3 and j=4.

t3=‘b’≠p4=‘a’, R0(3,3)=1

∴ RD1(3,4)=1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RD1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

0

1

1

0

1

1

0

1

1

1

1

1

0

0

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

0

1

1

0

1

1

0

1

1

1

0

0

0

1

1

1

0

0

0

1

0

1

0

0

Consider with a table, called RSk(i,j).

RSk(i,j)

=1 if ti≠pj and Rk-1(i-1,j-1)=1

or ti＝pj and Rk(i-1,j-1)=1

=0 otherwise

i-1

i

i-1

i

aabac

b

aabac

b

T:

P:

T:

P:

a

b

aabac

aabac

j-1

j

j-1

j

substitution

=1 if with a table, called ti≠pj and Rk-1(i-1,j-1)=1 or ti＝pj and Rk(i-1,j-1)=1

=0 otherwise

RSk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RD1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(5,3)=0

∴ RD1(6,4)=0

(3) When i=5 and j=5.

t3=‘b’≠p4=‘a’, R0(4,4)=1

∴ RD1(5,5)=1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RS1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

0

1

0

1

1

0

0

1

1

1

1

0

0

1

1

0

1

0

1

1

0

0

0

1

1

1

0

0

1

1

0

1

0

1

1

0

0

1

1

1

0

0

0

1

1

1

0

0

After every with a table, called RIk(i,j), RDk(i,j) and RSk(i,j) have found, we immediately determine Rk(i,j) by

Rk(i,j)= RIk(i,j) or RDk(i,j) or RSk(i,j).

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0

Thank you! with a table, called