1 / 32

Shift-And Approach to Pattern Matching in LZW Compressed Text

Shift-And Approach to Pattern Matching in LZW Compressed Text. Takuya KIDA. Masayuki TAKEDA. Ayumi SHINOHARA. Setsuo ARIKAWA. Department of Informatics Kyushu University, Japan. Motivation. The available storage devices are limited! I am eager to stuff any available information

osmond
Download Presentation

Shift-And Approach to Pattern Matching in LZW Compressed Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shift-And Approach to Pattern Matchingin LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Setsuo ARIKAWA Department of Informatics Kyushu University, Japan

  2. Motivation • The available storage devices are limited! • I am eager to stuff any available information up to possible! • I want to do pattern matching as fast as possible! ...Yes! Data compression! ...but a suffix trie is very large... Phone numbers Electronic book Address book Memo Dictionary E-mail Schedule Database Motivation

  3. Our goal Pattern Matching Machine decompress Original Text Compressed Text New Machine ! Compressed Text Our goal

  4. Previous researches year researchers compression method 1988 Eliam-Tsoreff and Vishkin run-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1992 Amir and Benson two-dimensional run-length 1994 Amir, Benson, and Farach two-dimensional run-length 1994 Manber original compression scheme 1995 Farach and Thorup LZ77 1996 Gasieniec, et al. LZ77 1998 1996 Amir, Benson and Farach Kida, et al. LZW LZW 1997 Karpinski, Rytter, and Shinohara straight-line programs 1997 Miyazaki, Shinohara, and Takeda straight-line programs 1997 Takeda finite state encoding 1998 Fukamachi, Shinohara, and Takeda Huffman encoding 1998 Shibata byte pair encoding AC automaton DCC’98 Previous researches

  5. Recent researches year researchers compression method 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries 1999 1998 1999 de Moura, Navarro, Ziviani, and Baeza-Yates Navarro and Raffinot Kida, Takeda, Shinohara, and Arikawa Word based encoding LZ family LZW 1999 Shibata, et al. Byte pair encoding 1999 Kida, et al. Dictionary based methods (Collage system) Shift-And algorithm CPM’99 CPM’99 CPM’99 SPIRE’99 Previous researches

  6. Our main results • The new algorithmscans a compressed text in O(n+r) time using O(|D|)space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing. • The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton. • The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. |D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences Main results

  7. Lempel-Ziv-Welch Compression how to compress and decompress

  8. Lempel-Ziv-Welch(LZW) compression 0 a a c b 1 2 3 a c a b b 4 5 9 10 a a a b b 6 6 8 7 12 1 2 4 4 5 2 3 6 9 b 11 Dictionary trie 11 a b ab ab ba b c aba bc abab aba Original text: 6 Compressed text: O(|D|) = O(n) LZW compression

  9. How to compress a text 0 a c b 1 2 3 a c a b 4 5 9 10 a a b b 6 8 7 12 b 11 a b ab ab ba b c aba bc abab Original text: 1 2 4 4 5 2 3 6 9 Compressed text: 11 Dictionary trie Move of compression

  10. How to decompress a compressed text 1 2 4 4 5 2 3 6 9 11 0 a c b 1 2 3 a c a b 4 5 9 10 a a b b 6 8 7 12 b 11 Compressed text: a b ab ab ba b c aba bc abab Original text: O(N) time O(n) time Dictionary trie Move of decompression

  11. Compressed Pattern Matchingin LZW Compressed Text with Shift-And approach

  12. Shift-And approach to pattern matching mask bits abc a a b a c 0 a a 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 1 0 a a 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 b b 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 a a 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 c c 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 & 0 0 (Baeza-Yates and Gonnet[1992], Wu and Manber[1992]) aabac pattern: text: aabaacaabacab 1 Pattern was found! Shift-And approach to pattern matching

  13. Properties of Shift-And approach • Simple, but very fast when a pattern length m isnot greater than the word length of typical computers (32 or 64). • Assuming m32(or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time. • This method has many variations • generalized pattern matching • pattern matching with k-mismatch • pattern matching for multiple patterns Property of SA approach

  14. Basic idea of our algorithm aabac pattern: abc a b a c b c a a a a a mask bits text: aabaacaabacab a 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 0 a 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 b 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 a 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 c 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Jump! Jump! compressed text : 1 6 15 a O(1) time? Basic idea

  15. Basic idea of our algorithm aabac pattern: abc mask bits text: aabaacaabacab a 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 0 a 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 b 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 a 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 c 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Pattern was found! compressed text : 1 6 15 We need a mechanism for reporting all pattern occurrences. Basic idea

  16. Technical details Lemma 1 (Realization of ‘Jump’) The state transition function can be realized inO(|D|+m) time using O(|D|) space, and return the value in O(1) time. Lemma 2 (Realization of ‘Output ’) The procedure which enumerates the pattern occurrences can be realized in O(|D|+m)time using O(|D|) space, and run in O(r) time. |D| : size of the dictionary trie m : pattern length r : number of pattern occurrences Main results

  17. Overview of the algorithm Input. pattern P, u1,u2, …,un: LZW compressed text. Output. All occurrences of the patterns. Construct mask bits from P. Initialize the dictionary trie, M, U, and V; l:=0; S:=; fori:=1 tondo begin for eachdOutput(S, ui)do report ‘pattern occurs at position l+d ’; S:= f (S, u); /* Jump the state! */ l:= l+ |ui|; /* increment the offset */ Update the dictionary trie, M, U, and V; end ^ ^ ^ Overview of the algorithm

  18. Detail of our Algorithm Realization of Jump and Output

  19. Detail of ‘Jump’ for a ∈Σ, u ∈Σ*, and S∈{1,・・・, m}, • abc a 1 1 0 0 1 1 a 0 0 1 1 0 0 b mask bits state S={1,3} 1 0 0 1 1 0 0 a M(a)={1,2,4} 1 0 0 0 1 0 c M(b)={3} 0 1 0 0 0 0 M(c)={5} 1 0 & 0 state transition 0 M(a) : { 1i  m | Pattern[i] = a } f (S, a) : ((S 1)∪{1}) ∩ M(a) AND bit shift OR Detail of ‘Jump’

  20. Detail of ‘Jump’ for a ∈Σ, u ∈Σ*, and S∈{1,・・・, m}, • ^ f (S,ε) :S f (S, ua) :f ( f (S, u), a) define recursively ^ ^ ^ ^ M(u) : f({1,・・・, m}, u) ^ ^ f (S, u) = ((S  |u|)∪{1,・・・, |u|}) ∩ M(u) M(a) : { 1i  m | Pattern[i] = a } f (S, a) : ((S 1)∪{1}) ∩ M(a) O(1) Detail of ‘Jump’

  21. Move of f (S, u) text: aabaacaabacab 0 0 a a 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0 0 a a 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 b b ^ M(u) 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 a a 1 aba acaabac 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 c c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 & 0 0 Move of ‘Jump’

  22. Move of f (S, u) text: aabaacaabacab 0 0 a a 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 a a 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 b b ^ M(u) 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 a a aba acaabac 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 c c 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 & 1 1 Move of ‘Jump’

  23. How to calculate M(u) ^ ^ M(u  a)= f({1,・・・, m}, u  a) ^ = f ( f({1,・・・, m}, u), a ) ^ = f ( M(u), a ) u ^ = ((M(u) 1)∪{1})∩M(a) ^ M(u) a ^ M(u  a) O(1) u  a total: O(|D|) time and space Dictionary trie D Detail of updating Mhat(u)

  24. How to enumerate the occurrences u S pattern occurrence pattern occurrence 2 11 Output(S, u) = { 1 j |u| |m∈S } length i prefix of the pattern for the largest i∈S. 2{1, ...,m}D Output(S, u) ={ 2, 11} Detail of Output(S,u)

  25. Realization of Output(S, u) Output(S, u) =((m S) U(u)) V(u) u S U(u) : {1 j |u| |i < mand u[1..i]=Pattern[m-i+1..m]} V(u) : {1 j |u| |i mand u[1-m+1..i]=Pattern} dependent onS independent of S Two subset U and A

  26. How to calculate U(u) and V(u) ^ u ifm∈M(ua) then U(ua) = U(u) {|u  a|} else U(ua) = U(u) ; a O(1) U(u) V(u) We can deal with V(n) as the same way of [DCC’98]. u  a total: O(|D|) time and space U(ua) V(ua) Dictionary trie D Detail of updating U and A

  27. But... Is it really fast ? Uhmm.... -- Is this really practical? --

  28. Experimental Comparisons Decompress ! ◆ Method 1: Shift-And Compressed Text bcbababc 9 ◆ Method 2: Compressed Text Our previous algorithm(DCC’98) ◆ Method 3: Compressed Text Our new algorithms Experimentation

  29. Experimental Comparisons Original Text "The Brown corpus" 6.8 Mbytes Compressed Text 3.4 Mbytes compress (UNIX command) Language: C (with gcc compiler) Machine : Sun SPARCstation 20 with remote disk storage File transfer ratio: 0.96 Mbyte/sec Experimentation

  30. Experimental results CPU time(s) Method elapsed time(s) Shift-And with decompression 7.52 8.16 1.5 times faster! Our previous algorithm(DCC’98) 6.57 7.31 1.3 times faster! New algorithm 5.15 6.05 CPU time + File I/O time uncompressed text Shift-And Experimental results

  31. Experimental results CPU time(s) Method elapsed time(s) Shift-And with decompression 7.52 8.16 Our previous algorithm(DCC’98) 6.57 7.31 New algorithm 5.15 6.05 Shift-And in original text 9.36 3.09 Experimental results

  32. Conclusion • The proposed algorithmscans an LZW compressed text in O(n+r) time using O(|D|)space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing. • Weimplementedthe algorithm, and showed that it is approximately 1.3 times fasterthan our previous algorithm. • Our new algorithm has several extensions. • generalized pattern matching • pattern matching with k-mismatches • pattern matching for multiple patterns Conclusion

More Related