slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan PowerPoint Presentation
Download Presentation
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan

Loading in 2 Seconds...

play fullscreen
1 / 22

Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91. Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan. Given a text T (1, n ), a pattern P (1, m ) and an error found k . Our approximate string matching problem is defined as follow:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan' - uta


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Fast text searching: allowing errorsSun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91

Advisor: Prof. R. C. T. Lee

Reporter: Z. H. Pan

slide2
Given a text T(1,n), a pattern P(1,m) and an error found k.

Our approximate string matching problem is defined as follow:

Find all location i of T such that the following condition is satisfied: There exists a suffix A of T(1, i) such that d(A,P)≦k where d(x,y) is the edit distance between x and y.

slide3

Example:

T=deaabeg,

P=aabac and k=2.

For i=5.

T(1, 5)= deaab.

We note that there exists a suffix A=aab of T(1, 5) such that d(A,P)=d(aab,aabac)=2.

slide4

Example:

T=deaabeg, P=aab and k=2.

Consider i=5.

T(1,5)=deaab.

We have A=aab of T(1,5) and d(A,P)=d(aab,aab)=0. Thus we have found a substring aab in T such that d(aab,P)=0.

Consider i=6.

T(1,6)=deaabe.

We have A=aabe of T(1,6) and d(A,P)=d(aabe,aab)=1. Again, we have found a substring aabe in T such that d(aabe,P)=1.

slide5

Our approach is based upon the following observation:

S

S1

S2

T

P

P1

P2

Let S be a substring of T.

If there exists a suffix S2 of S and a suffix P2 of P such that

d(S2, P2) = 0, and d(S1, P1) ≦k,

we haved(S, P) ≦ k.

slide6

Example:

A=addcd and B=abcd. k=2.

We may decompose A and B as follows:

A=add+cd.

B=ab+cd.

d(add,ab)=2.

Thus d(A,B)=2.

slide7

A Recursive Operation for the Dynamic Programming Approach

Consider T(1,i) and P(1, j).

Case 1: T(i)=P( j). We denote prefix B which is P(1, j-1) in P. We consider whether there is a suffix A in T(1,i-1) such that d( A, B )≦k.

i

i-1

T :

A

1

j-1

j

P :

B

slide8

Case 2: T(i)≠P(j). We consider three cases:

2.1 We denote B which is P(1, j). There is a suffix A in T(1,i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i-1

i

T :

A

1

j

P :

B

i-1

i

T :

A

1

j

P :

B

insertion

slide9

Case 2: T(i)≠P(j). We consider three cases:

2.2 We denote B which is P(1, j-1). There is a suffix A in T(1, i ) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i

T :

A

1

j-1

j

P :

B

slide10

Case 2: T(i)≠P(j). We consider three cases:

2.3 We denote B which is P(1, j-1). There is a suffix A in T(1, i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:

i-1

i

T :

A

1

j-1

j

P :

B

slide11

To solve our approximate string matching problem, we start with a table, called Rk[n, m]. Let S=T(1, i).

Rk(i,j)

Where 1≦i≦n and 1≦j≦m.

=1 if there exists a suffix A of S such that d(A, P1,j)≦k.

=0 otherwise.

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=9, j=4.

S=T(1, 9)=aabaacaab

P(1, 4)=aaba

A=aab

d(A,P(1, 4))=d(aab,aaba)=1

∴ R1(9, 4)=1

R1

1 2 3 4 5 6 7 8 9 10111213

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0

slide12

Example:

T:aabaacaabacab, P:aabac and k=1.

Consider i=13 and j=5.

S=T(1, 13)=aabaacaabacab

P(1, 5)=aabac

There doesn’t exist any suffix A of S such that d(A,P(1, 5))≦1.

∴ R1(13,5)=0

1 2 3 4 5 6 7 8 9 10111213

R1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0

slide13

Question: How can we find Rk(i, j)?

Answer: Dynamic Programming.

There are three types of operation in edit distance:

(1) Insertion

(2) Deletion

(3) Substitution

We consider them separately and combine the results later.

slide14
Let RIk(i,j), RDk(i,j) and RSk(i,j) denote the Rk(i,j) related to insertion, deletion and substitution respectively.

And let RIk[i,j], RDk[i,j] and RSk[i,j] denote the Rk[i,j] related to insertion, deletion and substitution of table respectively.

slide15

Consider RIk(i,j) first.

RIk(i,j)

=1 if ti≠pj and Rk-1(i-1,j)=1

or ti=pj and Rk(i-1,j-1)=1,

=0 otherwise.

i-1

i

aabac

b

T:

P:

b

aabac

insertion

j

slide16

=1 if ti≠pj and Rk-1(i-1,j)=1 or ti=pj and Rk(i-1,j-1)=1

=0 otherwise

RIk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RI1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(5,4)=0

∴ RI1(6,4)=0

(3) When i=11 and j=4.

t11=‘c’≠p4=‘a’, R0(10,4)=1

∴ RI1(11,4)=1

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RI1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

0

1

0

1

1

0

0

1

1

1

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

0

0

1

1

1

1

0

0

1

1

0

1

0

0

slide17

Consider RDk(i,j).

RDk(i,j)

=1 if ti≠pj and Rk-1(i,j-1)=1

or ti=pj and Rk(i-1,j-1)=1,

=0 otherwise.

i

aabac

T:

P:

b

aabac

deletion

j-1

j

slide18

=1 if ti≠pj and Rk-1(i,j-1)=1 or ti=pj and Rk(i-1,j-1)=1

=0 otherwise

RDk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RD1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(6,3)=0

∴ RD1(6,4)=0

(3) When i=3 and j=4.

t3=‘b’≠p4=‘a’, R0(3,3)=1

∴ RD1(3,4)=1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RD1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

0

1

1

0

1

1

0

1

1

1

1

1

0

0

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

0

1

1

0

1

1

0

1

1

1

0

0

0

1

1

1

0

0

0

1

0

1

0

0

slide19

Consider RSk(i,j).

RSk(i,j)

=1 if ti≠pj and Rk-1(i-1,j-1)=1

or ti=pj and Rk(i-1,j-1)=1

=0 otherwise

i-1

i

i-1

i

aabac

b

aabac

b

T:

P:

T:

P:

a

b

aabac

aabac

j-1

j

j-1

j

substitution

slide20

=1 if ti≠pj and Rk-1(i-1,j-1)=1 or ti=pj and Rk(i-1,j-1)=1

=0 otherwise

RSk(i,j)

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R0[13,5]

(1) When i=13 and j=3.

t13=p3=‘b’, R1(12,2)=1

∴ RD1(13,3)=1

(2) When i=6 and j=4.

t6=‘c’≠p4=‘a’, R0(5,3)=0

∴ RD1(6,4)=0

(3) When i=5 and j=5.

t3=‘b’≠p4=‘a’, R0(4,4)=1

∴ RD1(5,5)=1

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

1

1

0

0

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

1 2 3 4 5 6 7 8 9 10111213

RS1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

0

0

0

0

1

1

0

0

0

1

1

1

0

0

1

1

0

1

0

1

1

0

0

1

1

1

1

0

0

1

1

0

1

0

1

1

0

0

0

1

1

1

0

0

1

1

0

1

0

1

1

0

0

1

1

1

0

0

0

1

1

1

0

0

slide21

After every RIk(i,j), RDk(i,j) and RSk(i,j) have found, we immediately determine Rk(i,j) by

Rk(i,j)= RIk(i,j) or RDk(i,j) or RSk(i,j).

Example: Text = aabaacaabacab. Pattern = aabac. k=1.

1 2 3 4 5 6 7 8 9 10111213

R1[13,5]

a a b a a c a a b a c a b

1

2

3

4

5

a

a

b

a

c

1

1

0

0

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

1

1

1

1

1

1

0

1

1

1

0

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

1

0

0

1

1

1

1

0

0