1 / 29

A Unifying Framework for Compressed Pattern Matching

A Unifying Framework for Compressed Pattern Matching. Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa. Department of Informatics,. Kyushu University, Japan. Contents. Pattern matching and compressed pattern matching Previous results Collage system

mildred
Download Presentation

A Unifying Framework for Compressed Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Unifying Framework forCompressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics, Kyushu University, Japan

  2. Contents • Pattern matching and compressed pattern matching • Previous results • Collage system • Proposed algorithm • Conclusion

  3. Pattern Matching Problem compress pattern:= text:= We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach.

  4. Compressed Pattern Matching Pattern Matching Machine decompress Original Text Compressed Text New Machine ! Compressed Text

  5. year researcher compression 1988 Eliam-Tsoreff and Vishkin run-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1992 Amir and Benson two-dimensional run-length 1994 Amir, Benson, and Farach two-dimensional run-length 1994 Manber original compression scheme 1995 Farach and Thorup LZ77 1996 Gasieniec, et al. LZ77 1998 1996 Amir, Benson and Farach Kida, et al. LZW LZW 1997 Karpinski, Rytter, and Shinohara straight-line programs 1997 Miyazaki, Shinohara, and Takeda straight-line programs 1997 Takeda finite state encoding 1998 Fukamachi, Shinohara, and Takeda Huffman encoding 1998 Shibata byte pair encoding Previous Results(1)

  6. Previous Results(2) year researcher compression 1998 1999 1999 de Moura, Navarro, Ziviani, and Baeza-Yates Kida, Takeda, Shinohara, and Arikawa Navarro and Raffinot Word based encoding LZ family LZW 1999 Shibata, et al. Byte pair encoding 1999 Kida, et al. Dictionary based methods (Collage system) 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries faster than Agrep! Today’s talk

  7. Compression A PM Algorithm A Compression B PM Algorithm B Compression C PM Algorithm C Collage system General Pattern matching algorithm on the unifying framework Compression A Compression B Compression C Motivation Previous: Ours:

  8. Collage System Definition and Several Examples

  9. Dictionary Based Compression Original text encoding compressed text Dictionary structure factorize into a series of phrases • How to choose the phrases. • How to design the data structure of the dictionary. • How to encode phrases.

  10. Definition of Collage System • Collage system is a pair 〈D, S 〉 D :A sequence of assignments (Dictionary structure) X1= expr1; X2= expr2; ・・・ Xn= exprn; S:A sequence of variables defined in D (Compressed text) S := Xi1, Xi2,・・・, Xil ( Xi ∈D ) ||D|| = n : number of assignments inD |S| = l : number of variables in S

  11. X1= expr1; X2= expr2; ・・・ Xn= exprn; a a ∈Σ∪{ε}, (primitive assignment) Xi ・ Xj for i, j < k, (concatenation) ( Xi ) j fori < kand integer j ( jtimes repetition) [j ]Xi fori < kand integer j (prefix truncation) Xi [ j ] fori < kand integer j (suffix truncation) Definition of Collage System D :A sequence of assignments (Dictionary structure) where exprkare

  12. T(X7) X7 prefix truncation X6 X4 ab 3 times repetition X5 X2 X1 ba ababab X3 bab X1 X2 babba a [3] (( )3 ) b b a height(X7) = 4 abbabbababba height(D) = 4 Example of Collage System D : X1= a ; X2= b ; X3= X1・X2 ; X4= X2・X1 ; X5= ( X3 )3 ; X6= [3]X5 ; X7= X6・X4 ; S: X3 , X6 ,X4 ,X7

  13. Example of Collage SystemByte Pair Encoding (BPE) a ba b c b a b c c a b c a c b Original Text: abD D D c b D c c D c a c b DcE D E b E c E a c b X1= a; abD D: X4= X1・X2; X2= b; DcE X5= X4・ X3; X3= c; S : X4, X5 , X2 , X5 ,X3, X5 ,X1 ,X3, X2

  14. D: X1= a1; X2= a2; ・・・ Xq= aq; Example of Collage System (LZSS[gzip]) Xq+1= (( [i1]Xl(1)・Xl(1)+1 ・・・ Xr(1))m1)[j1]b1; Xq+2= (( [i2]Xl(2)・Xl(2)+1 ・・・ Xr(2))m2)[j2]b2; ・・・ Xq+n= (( [in]Xl(n)・Xl(n)+1 ・・・ Xr(n))mn)[jn]bn; Xq+1, Xq+2,・・・, Xq+n S:

  15. What is ‘Collage’? This is college!

  16. Collage is ... • an artistic composition technique. 1. Cut ortear up materials. 2. Paste the pieces over a surface.

  17. Our Algorithm Pattern Matching Algorithm on aCollage System

  18. The problem of compressed pattern matching can be solved in O( (||D||+|S|)・height(D) + m2 + r ) time using O( ||D|| + m2 ) space. If D contains no truncation, it can be solved in O( ||D|| + |S| + m2 + r ) time. Compressed pattern matching on a collage system ||D|| : number of assignments inD O(compressed textlength+m2+r) |S| : number of variables in S m : pattern length r : number of pattern occurrences

  19. Basic Idea Pattern π= a b a b b a b b a b 0 1 2 5 3 4 : goto function 7 : failure function abababba original text: Xi1 Xi2 Xi3 Xi4 S: 1 4 0 1 2 3 4 3 1 4 5 1 2 state:

  20. Jump and Output The function Jump( j, u) =δKMP( j, u) • It simulates the sequence of state transitions foru. • The domain is Q×D Reply in O(1) time The set Output( j, u) ={1≦i≦|u| |P = a suffix of P[1: j]・u[1: i]} • This set contains the pattern occurrences. Reply in O( l ) time

  21. [j ]Xi O( height(Xi) )time Xi [ j ] ( Xi ) j O(1) time Realization of Jump for Jump( q, Xk) ,ifXkis ... a O(1) time If the factor concatenation problem for length m string can be solved in O(1) time,it can be solved in O(1) time. Xi ・ Xj

  22. example: P =COPACABANA OPA , CABAN OPACABAN concatenate ‘Yes’! P[2:9] Factor Concatenation Problem Instance: Two factors x and y of a string P each represented as a node of suffix trie of P. Question: Is the string xy a factor of P ? If ‘yes’ then return its node number.

  23. Solution to the problem • Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space. • Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing. • It can be solved in O(1) time after O(m2) space and time preprocessing.

  24. [j ]Xi O( l ・height(Xi) ) time Xi [ j ] ( Xi ) j O( l ) time Realization of Output Size of the set Output forOutput( q, Xk), if Xk is ... a O(1) time It can be enumerate in O( l ) time from Output of Xi andXj. Xi ・ Xj

  25. Outline of Our Algorithm Input. pattern P and Collage system: 〈D, S 〉 (S := Xi1, Xi2,・・・, Xin ) Output. All occurrences of the patterns. /* preprocess for D and P */ preprocess(D); preprocess(P); l:=0; q:=0; forj:=1 tondo begin for eachdOutput(q, Xij)do report ‘pattern occurs at position l+d ’; q:= Jump(q, Xij); /* state transition */ l:= l + |Xij |; /* calculate the offset */ end

  26. Concluding Remarks Conclusion and Future Works

  27. no truncation truncation LZ78, LZW, BPE, Run-length, etc... LZ77, LZSS, etc... Our Results Complexity of our algorithm: O( ||D|| + m2 ) space O( (||D|| + |S| )・height(D) + m2 + r ) time If D contains no truncation : O( ||D|| + |S| + m2 + r ) time 1998 Kida, et al.(LZW): O(n + m2 ) space O( n + m2 + r ) time

  28. Conclusion • We introduced a general framework for compressed pattern matching (Collage system) • We proposed a compressed pattern matching algorithm on collage system and showed its complexity. • O( (||D||+|S|)・height(D) + m2 + r ) time • O( ||D|| + m2 )space • (If no truncation)O( ||D|| + |S| + m2 + r ) time

  29. Future Works • Can we reduce the complexity of the preprocessing? O(m2)  O(m) • To improve our algorithm for dealing with multiple patterns. • To develop an approximate pattern matching algorithm on a collage system. • To develop a new compression which is suitable for compressed pattern matching.

More Related