1 / 22

Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan

Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91. Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan. Given a text T (1, n ), a pattern P (1, m ) and an error found k . Our approximate string matching problem is defined as follow:

Download Presentation

Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast text searching: allowing errorsSun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91 Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan

  2. Given a text T(1,n), a pattern P(1,m) and an error found k. Our approximate string matching problem is defined as follow: Find all location i of T such that the following condition is satisfied: There exists a suffix A of T(1, i) such that d(A,P)≦k where d(x,y) is the edit distance between x and y.

  3. Example: T=deaabeg, P=aabac and k=2. For i=5. T(1, 5)= deaab. We note that there exists a suffix A=aab of T(1, 5) such that d(A,P)=d(aab,aabac)=2.

  4. Example: T=deaabeg, P=aab and k=2. Consider i=5. T(1,5)=deaab. We have A=aab of T(1,5) and d(A,P)=d(aab,aab)=0. Thus we have found a substring aab in T such that d(aab,P)=0. Consider i=6. T(1,6)=deaabe. We have A=aabe of T(1,6) and d(A,P)=d(aabe,aab)=1. Again, we have found a substring aabe in T such that d(aabe,P)=1.

  5. Our approach is based upon the following observation: S S1 S2 T P P1 P2 Let S be a substring of T. If there exists a suffix S2 of S and a suffix P2 of P such that d(S2, P2) = 0, and d(S1, P1) ≦k, we haved(S, P) ≦ k.

  6. Example: A=addcd and B=abcd. k=2. We may decompose A and B as follows: A=add+cd. B=ab+cd. d(add,ab)=2. Thus d(A,B)=2.

  7. A Recursive Operation for the Dynamic Programming Approach Consider T(1,i) and P(1, j). Case 1: T(i)=P( j). We denote prefix B which is P(1, j-1) in P. We consider whether there is a suffix A in T(1,i-1) such that d( A, B )≦k. i i-1 T : A 1 j-1 j P : B

  8. Case 2: T(i)≠P(j). We consider three cases: 2.1 We denote B which is P(1, j). There is a suffix A in T(1,i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below: i-1 i T : A 1 j P : B i-1 i T : A 1 j P : B insertion

  9. Case 2: T(i)≠P(j). We consider three cases: 2.2 We denote B which is P(1, j-1). There is a suffix A in T(1, i ) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below: i T : A 1 j-1 j P : B

  10. Case 2: T(i)≠P(j). We consider three cases: 2.3 We denote B which is P(1, j-1). There is a suffix A in T(1, i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below: i-1 i T : A 1 j-1 j P : B

  11. To solve our approximate string matching problem, we start with a table, called Rk[n, m]. Let S=T(1, i). Rk(i,j) Where 1≦i≦n and 1≦j≦m. =1 if there exists a suffix A of S such that d(A, P1,j)≦k. =0 otherwise. Example: T:aabaacaabacab, P:aabac and k=1. Consider i=9, j=4. S=T(1, 9)=aabaacaab P(1, 4)=aaba A=aab d(A,P(1, 4))=d(aab,aaba)=1 ∴ R1(9, 4)=1 R1 1 2 3 4 5 6 7 8 9 10111213 a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 0

  12. Example: T:aabaacaabacab, P:aabac and k=1. Consider i=13 and j=5. S=T(1, 13)=aabaacaabacab P(1, 5)=aabac There doesn’t exist any suffix A of S such that d(A,P(1, 5))≦1. ∴ R1(13,5)=0 1 2 3 4 5 6 7 8 9 10111213 R1 a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 0

  13. Question: How can we find Rk(i, j)? Answer: Dynamic Programming. There are three types of operation in edit distance: (1) Insertion (2) Deletion (3) Substitution We consider them separately and combine the results later.

  14. Let RIk(i,j), RDk(i,j) and RSk(i,j) denote the Rk(i,j) related to insertion, deletion and substitution respectively. And let RIk[i,j], RDk[i,j] and RSk[i,j] denote the Rk[i,j] related to insertion, deletion and substitution of table respectively.

  15. Consider RIk(i,j) first. RIk(i,j) =1 if ti≠pj and Rk-1(i-1,j)=1 or ti=pj and Rk(i-1,j-1)=1, =0 otherwise. i-1 i aabac b T: P: b aabac insertion j

  16. =1 if ti≠pj and Rk-1(i-1,j)=1 or ti=pj and Rk(i-1,j-1)=1 =0 otherwise RIk(i,j) Example: Text = aabaacaabacab. Pattern = aabac. k=1. (1) When i=13 and j=3. t13=p3=‘b’, R1(12,2)=1 ∴ RI1(13,3)=1 (2) When i=6 and j=4. t6=‘c’≠p4=‘a’, R0(5,4)=0 ∴ RI1(6,4)=0 (3) When i=11 and j=4. t11=‘c’≠p4=‘a’, R0(10,4)=1 ∴ RI1(11,4)=1 1 2 3 4 5 6 7 8 9 10111213 R0[13,5] a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10111213 RI1[13,5] a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0

  17. Consider RDk(i,j). RDk(i,j) =1 if ti≠pj and Rk-1(i,j-1)=1 or ti=pj and Rk(i-1,j-1)=1, =0 otherwise. i aabac T: P: b aabac deletion j-1 j

  18. =1 if ti≠pj and Rk-1(i,j-1)=1 or ti=pj and Rk(i-1,j-1)=1 =0 otherwise RDk(i,j) Example: Text = aabaacaabacab. Pattern = aabac. k=1. 1 2 3 4 5 6 7 8 9 10111213 R0[13,5] (1) When i=13 and j=3. t13=p3=‘b’, R1(12,2)=1 ∴ RD1(13,3)=1 (2) When i=6 and j=4. t6=‘c’≠p4=‘a’, R0(6,3)=0 ∴ RD1(6,4)=0 (3) When i=3 and j=4. t3=‘b’≠p4=‘a’, R0(3,3)=1 ∴ RD1(3,4)=1 a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10111213 RD1[13,5] a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 1 0 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0

  19. Consider RSk(i,j). RSk(i,j) =1 if ti≠pj and Rk-1(i-1,j-1)=1 or ti=pj and Rk(i-1,j-1)=1 =0 otherwise i-1 i i-1 i aabac b aabac b T: P: T: P: a b aabac aabac j-1 j j-1 j substitution

  20. =1 if ti≠pj and Rk-1(i-1,j-1)=1 or ti=pj and Rk(i-1,j-1)=1 =0 otherwise RSk(i,j) Example: Text = aabaacaabacab. Pattern = aabac. k=1. 1 2 3 4 5 6 7 8 9 10111213 R0[13,5] (1) When i=13 and j=3. t13=p3=‘b’, R1(12,2)=1 ∴ RD1(13,3)=1 (2) When i=6 and j=4. t6=‘c’≠p4=‘a’, R0(5,3)=0 ∴ RD1(6,4)=0 (3) When i=5 and j=5. t3=‘b’≠p4=‘a’, R0(4,4)=1 ∴ RD1(5,5)=1 a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10111213 RS1[13,5] a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0

  21. After every RIk(i,j), RDk(i,j) and RSk(i,j) have found, we immediately determine Rk(i,j) by Rk(i,j)= RIk(i,j) or RDk(i,j) or RSk(i,j). Example: Text = aabaacaabacab. Pattern = aabac. k=1. 1 2 3 4 5 6 7 8 9 10111213 R1[13,5] a a b a a c a a b a c a b 1 2 3 4 5 a a b a c 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 0

  22. Thank you!

More Related