1 / 38

An Online Algorithm for Finding the Longest Previous Factors

ESA2008@Universitat Karlsruhe, Sep 15, 2008. An Online Algorithm for Finding the Longest Previous Factors. Daisuke Okanohara University of Tokyo. Kunihiko Sadakane Kyushu University. Problem: Finding the longest previous factors (matching). Input : A text T[0…n-1]

Download Presentation

An Online Algorithm for Finding the Longest Previous Factors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ESA2008@Universitat Karlsruhe, Sep 15, 2008 An Online Algorithm for Finding the Longest Previous Factors Daisuke OkanoharaUniversity of Tokyo Kunihiko SadakaneKyushu University

  2. Problem: Finding the longest previous factors (matching) • Input : A text T[0…n-1] • At all position k, report the longest substring T[k-len+1…k] that also occurs in previous positions (history)T[pos…pos+len-1] = T[k-len+1…k] • c.f. LZ77, LZ-factorization (pos, len) = (0, 4) (pos, len) = (5, 2)

  3. Applications • Data Compression • LZ77, Prediction by Partial Matching • Pattern Analysis • Log analysis • Data Mining

  4. Previous approach • Sequential search on the fly • O(n2) time for a text of length n • Offline- Index approach • Read an whole text beforehand, and build an index (suffix array/trees) for it. • Search the match using the index [Chen 07] [Chen 08] [Crochemore 08] [Kolpakov 01] [Larsson 99] • 6n bytes, and O(n log n) time [Chen 08]Suffix Arrays with Range Minimum Query

  5. New Problem: Online finding the longest previous factors • Report match information just after reading each character • A case where we don’t know the length of data beforehand, e.g. streaming data • Previous approaches cannot deal with this problem

  6. Our approach for new problem • Online construction of enhanced prefixarrays • Update an index just after reading each character • Although many methods used in LZ77 cannot report the longest match, our method can. • Succinct data structures • Keep all information very compactly; using about the same space for an original text

  7. Prefix arrays • Keep NOTsuffix arrays (SA), but prefix arrays (PA) • because when a character is added at the last of a text, SA may cause W (n) changes, but PA not • In PA, prefixes are sorted in the reverse-lexicographic order Tnew=aaaaz T=aaaa SA for T SA for Tnew PA for T PA for Tnew 0 $ 1 a$ 2 aa$ 3 aaa$ 4 aaaa$ 0 $ 4 aaaaz$ 3 aaaz$ 2 aaz$ 1 az$ 5 z$ 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 5 $aaaaz

  8. Our idea • Weiner’s suffix tree construction algorithm • Insert the suffixes from the shortest ones • Modify it to the insert prefixes form the shortest ones • Similar idea is used for the incremental construction of compressed suffix arrays [Chan, et. al 2007], [Lippert 2005] • We extend this work to the succinct version • Our algorithm reports matching information as a by-product of construction • Do not require tree representation, we just use array information

  9. Preliminary: Dynamic Rank/Select Dictionary (DRSD) • For an text T[0…n-1], DRSD supports: • rank(T, c, i): return the number of c in T[0…i] • select(T, c, i): return the position of i-th c in T • insert(T, c, i): insert c at T[i] • delete(T, i): delete T[i] • These operations can be supported in time (O(logn) time if s < logn), bits space where sis the alphabet size [Lee, et. al. 07],

  10. Preliminary: Range Minimum Query (RMQ) • Given an array E[0…n-1] of elements from totally ordered set, rmq(E, l, r) returns the index of the smallest element in E[l…r] • i.e. rmq(E, l, r) = argmink∈[l, r]E[k] • return the leftmost such element in the tie • In the static case, RMQ can be supported in O(1) time using 2n+o(n) bits space [Fischer, 2007] • In the dynamic case, RMQ/insert/delete can be supported in O(Tlogn) time using O(n) bits if the lookup cost (E[i]) is O(T)

  11. Data structures • Keep the following data structures for T[0…k] • Assume T[0]=$, $ is the unique smallest character • B[0…k]: (Prefix-) BW-transformed Text • B[i] = T[PA[i]+1] and B[i] = $ if PA[i]=k • H[0…k]: Height Array • will be explained in the next slide • C[0…s-1] : Cumulative Array • C[c] = the total number of characters c’ s.t. c’ < c in T • s: The position for the next prefix to be inserted

  12. T = $abaababa

  13. T = $abaababa PA stores the end position of each prefix(we will omit this) Prefix stores prefixes sorted in the reverse-lexicographic order (Neither PA nor prefix are stored explicitly) We can examine PA[i] by using SAlookupoperation using O(log2n) time as in FM-index[Ferragina 00]

  14. B stores the next character for each prefix(Burrows Wheeler’s transform for prefix arrays) T = $abaababa

  15. H stores the length of the longest common suffix between adjacent prefixes T = $abaababa

  16. T = $abaababa s = 4 s denotes the position where $ in B, andthe longest prefix is placed.

  17. T = $abaababa C[c] = the number of characters c’ that is smaller than c in T(=B) C[$]=0 C[a]=1 C[b]=6

  18. T = $abaababaa The next character `a’comes !

  19. T = $abaababaa Replace $ in B[s] with a (because $ is placed in the position of the longest prefix) a

  20. T = $abaababaa Find the position for the new prefix $abaababaa Count the number of a in B[0…s-1] = rank(B, a, s-1) = 2

  21. T = $abaababaa Insert $abaababaa at 3rd position in aC[a]+rank(B, a, s-1) =3 s := C[a]+rank(B, a, s-1), insert(B, s, $)

  22. T = $abaababaa Update H This is actually the length of the longest match in the history

  23. T = $abaababa Recall that in the previous step, $abaaand $aba are placed in the prefixes whose B is `a’ These positions can be found by using rank and select c. f. succ(T, `c’, s) = select(T, c, rank(T, s, c))

  24. T = $abaababa RMQ(H, 4, 6) = 5, H[5] = 0 Therefore RMQ(H, 4, 6) + 1 is the new value for the next H entry

  25. T = $abaababa RMQ(H, 3, 3) = 3, H[3] = 3 Therefore RMQ(H, 3, 3) + 1 is the new value for the nextH entry

  26. T = $abaababaa rmq(H, 3, 3) + 1 rmq(H, 4, 6) + 1

  27. T = $abaababaa Report max(4, 1) = 4as the length of thelongest factor and report the positionof $abaa as SAlookup[2]- len = 0 Report (pos=0, len=4) as the max. matching

  28. Overall algorithm All operations are rank, select, RMQ

  29. Overall Analysis • H is stored in 2n bits [Sadakane , Soda 02] • naïve representation requires O(n log n) bits • requires one SA lookup operation to decode • B is stored in nlogs+ o(nlogs) bits • by using dynamic rank/select dictionary • The bottleneck of our algorithm is rmq(H, I, r) which requires O(log3n) time • SAlookuprequires O(log2n) time

  30. Overall Analysis (cont.) • We can solve the online longest previous factor problem in O(log3n) time for each character, using nlog2s+ o(nlogs) + O(n) bits of space • where s is the alphabet size, and n is the length of a text

  31. Simulating window buffer • If the working space is limited, we often discards the history from the oldest ones • We can simulate this by using the almost the same operations as in the insertion operation • We actually do not discard a character but ignore it • If we actually discard an oldest character , it may cause W(n) changes in B and H • The effect of discarded character is remained (prefixes are sorted according to the discarded characters) • But this does not cause the problem if we only report the matching information up to the history size

  32. Experiments • In experiment, we used a simpler data structure (algorithm is same) • B and H is store in the balanced binary tree • Each leaf stores the small block of B and H • We call this implementation as OS • Compare OS with otheroffline algorithms • Require to read the whole text beforehand • CPSa, CPSd: SA+LCP with stack [Chen, et. al. 07] • CPS6n: SA with RMQ [Chen, et. al. 08] • kk-lz: mreps, specialized for σ=4 [Kolpakov 01]

  33. Peak memory usage in bytes per input symbol • The space of OS is smallest in many real data especially when the values in H is small

  34. Runtime in milliseconds for searching the longest previous factors • OS is about 2~10 times slower than the fastest ones due to the dynamic operations

  35. Conclusion • Solve online longest matching problem by using enhanced prefix arrays • Simple and easy to implement • Require about 3~6 times space of the input text • Actually this is a by-product of construction of compressed suffix trees c.f. Weiner’s algorithm • Simple; and much room for improvements • by using better rank/select/rmq implementation

  36. Future work • Construction of compressed suffix trees • Update the parenthesis tree efficiently • Actually, the time complexity for this is smaller • Practical improvements • Currently, dynamic succinct data structure is not efficient due to cache misses, and memory fragmentation • Approximated version of longest matching problem; enough for many application Thank you for you attention !

  37. Weiner’s suffix tree’s construction alg. $a $abraca $abracada $abra $ab $abracadab $abracad $abr $a $abraca $abracada $abra $ab $abracad $abr $a $abraca $abracada $abra $ab $abracadab $abracad $abr $abracadabr . a a ba $ $ $abr $abr $ab $ $abracad $abrac $abr $abracad abr $abrac $abr a $abrac ab $abrac $abracada $ $ $abracad $ $abrac $abr $abracad $abrac $abracada

More Related