Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan

Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91. Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan. Given a text T (1, n ), a pattern P (1, m ) and an error found k . Our approximate string matching problem is defined as follow:

### Fast text searching: allowing errorsSun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91

Advisor: Prof. R. C. T. Lee

Reporter: Z. H. Pan

Given a text T(1,n), a pattern P(1,m) and an error found k.

Our approximate string matching problem is defined as follow:

Find all location i of T such that the following condition is satisfied: There exists a suffix A of T(1, i) such that d(A,P)≦k where d(x,y) is the edit distance between x and y.

Example:

T=deaabeg,

P=aabac and k=2.

For i=5.

T(1, 5)= deaab.

We note that there exists a suffix A=aab of T(1, 5) such that d(A,P)=d(aab,aabac)=2.

Example:

T=deaabeg, P=aab and k=2.

Consider i=5.

T(1,5)=deaab.

We have A=aab of T(1,5) and d(A,P)=d(aab,aab)=0. Thus we have found a substring aab in T such that d(aab,P)=0.

Consider i=6.

T(1,6)=deaabe.

We have A=aabe of T(1,6) and d(A,P)=d(aabe,aab)=1. Again, we have found a substring aabe in T such that d(aabe,P)=1.

Our approach is based upon the following observation:

S

S1

S2

T

P

P1

P2

Let S be a substring of T.

If there exists a suffix S2 of S and a suffix P2 of P such that

d(S2, P2) = 0, and d(S1, P1) ≦k,

we haved(S, P) ≦ k.

Example:

We may decompose A and B as follows:

B=ab+cd.

Thus d(A,B)=2.

A Recursive Operation for the Dynamic Programming Approach

Consider T(1,i) and P(1, j).

Case 1: T(i)=P( j). We denote prefix B which is P(1, j-1) in P. We consider whether there is a suffix A in T(1,i-1) such that d( A, B )≦k.

i

i-1

T :

A

1

j-1

j

P :

B

Case 2: T(i)≠P(j). We consider three cases:

2.1 We denote B which is P(1, j). There is a suffix A in T(1,i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i-1

i

T :

A

1

j

P :

B

i-1

i

T :

A

1

j

P :

B

insertion

Case 2: T(i)≠P(j). We consider three cases:

2.2 We denote B which is P(1, j-1). There is a suffix A in T(1, i ) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i

T :

A

1

j-1

j

P :

B

Case 2: T(i)≠P(j). We consider three cases:

2.3 We denote B which is P(1, j-1). There is a suffix A in T(1, i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i-1

i

T :

A

1

j-1

j

P :

B

To solve our approximate string matching problem, we start with a table, called Rk[n, m]. Let S=T(1, i).

Rk(i,j)

Where 1≦i≦n and 1≦j≦m.

=1 if there exists a suffix A of S such that d(A, P1,j)≦k.

=0 otherwise.

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=9, j=4.

S=T(1, 9)=aabaacaab

P(1, 4)=aaba

A=aab

d(A,P(1, 4))=d(aab,aaba)=1

∴ R1(9, 4)=1

R1

1 2 3 4 5 6 7 8 9 10111213

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=13 and j=5.

S=T(1, 13)=aabaacaabacab

P(1, 5)=aabac

There doesn’t exist any suffix A of S such that d(A,P(1, 5))≦1.

∴ R1(13,5)=0

1 2 3 4 5 6 7 8 9 10111213

R1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0

Question: How can we find Rk(i, j)?

There are three types of operation in edit distance:

(1) Insertion

(2) Deletion

(3) Substitution

We consider them separately and combine the results later.

Let RIk(i,j), RDk(i,j) and RSk(i,j) denote the Rk(i,j) related to insertion, deletion and substitution respectively.

And let RIk[i,j], RDk[i,j] and RSk[i,j] denote the Rk[i,j] related to insertion, deletion and substitution of table respectively.

Consider RIk(i,j) first.

RIk(i,j)

=1 if ti≠pj and Rk-1(i-1,j)=1

or ti＝pj and Rk(i-1,j-1)=1,

=0 otherwise.

i-1

i

aabac

b

T:

P:

b

aabac

insertion

j

=1 if ti≠pj and Rk-1(i-1,j)=1 or ti＝pj and Rk(i-1,j-1)=1

=0 otherwise

RIk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RI1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(5,4)=0

∴ RI1(6,4)=0

(3) When i=11 and j=4.

t11=‘c’≠p4=‘a’, R0(10,4)=1

∴ RI1(11,4)=1

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RI1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

0

1

0

1

1

0

0

1

1

1

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

0

0

1

1

1

1

0

0

1

1

0

1

0

0

Consider RDk(i,j).

RDk(i,j)

=1 if ti≠pj and Rk-1(i,j-1)=1

or ti＝pj and Rk(i-1,j-1)=1,

=0 otherwise.

i

aabac

T:

P:

b

aabac

deletion

j-1

j

=1 if ti≠pj and Rk-1(i,j-1)=1 or ti＝pj and Rk(i-1,j-1)=1

=0 otherwise

RDk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RD1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(6,3)=0

∴ RD1(6,4)=0

(3) When i=3 and j=4.

t3=‘b’≠p4=‘a’, R0(3,3)=1

∴ RD1(3,4)=1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RD1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

0

1

1

0

1

1

0

1

1

1

1

1

0

0

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

0

1

1

0

1

1

0

1

1

1

0

0

0

1

1

1

0

0

0

1

0

1

0

0

Consider RSk(i,j).

RSk(i,j)

=1 if ti≠pj and Rk-1(i-1,j-1)=1

or ti＝pj and Rk(i-1,j-1)=1

=0 otherwise

i-1

i

i-1

i

aabac

b

aabac

b

T:

P:

T:

P:

a

b

aabac

aabac

j-1

j

j-1

j

substitution

=1 if ti≠pj and Rk-1(i-1,j-1)=1 or ti＝pj and Rk(i-1,j-1)=1

=0 otherwise

RSk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RD1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(5,3)=0

∴ RD1(6,4)=0

(3) When i=5 and j=5.

t3=‘b’≠p4=‘a’, R0(4,4)=1

∴ RD1(5,5)=1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RS1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

0

1

0

1

1

0

0

1

1

1

1

0

0

1

1

0

1

0

1

1

0

0

0

1

1

1

0

0

1

1

0

1

0

1

1

0

0

1

1

1

0

0

0

1

1

1

0

0

After every RIk(i,j), RDk(i,j) and RSk(i,j) have found, we immediately determine Rk(i,j) by

Rk(i,j)= RIk(i,j) or RDk(i,j) or RSk(i,j).

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0