An Online Algorithm for Finding the Longest Previous Factors

ESA2008@Universitat Karlsruhe, Sep 15, 2008 An Online Algorithm for Finding the Longest Previous Factors Daisuke OkanoharaUniversity of Tokyo Kunihiko SadakaneKyushu University

Problem: Finding the longest previous factors (matching) • Input : A text T[0…n-1] • At all position k, report the longest substring T[k-len+1…k] that also occurs in previous positions (history)T[pos…pos+len-1] = T[k-len+1…k] • c.f. LZ77, LZ-factorization (pos, len) = (0, 4) (pos, len) = (5, 2)

Applications • Data Compression • LZ77, Prediction by Partial Matching • Pattern Analysis • Log analysis • Data Mining

Previous approach • Sequential search on the fly • O(n2) time for a text of length n • Offline- Index approach • Read an whole text beforehand, and build an index (suffix array/trees) for it. • Search the match using the index [Chen 07] [Chen 08] [Crochemore 08] [Kolpakov 01] [Larsson 99] • 6n bytes, and O(n log n) time [Chen 08]Suffix Arrays with Range Minimum Query

New Problem: Online finding the longest previous factors • Report match information just after reading each character • A case where we don’t know the length of data beforehand, e.g. streaming data • Previous approaches cannot deal with this problem

Our approach for new problem • Online construction of enhanced prefixarrays • Update an index just after reading each character • Although many methods used in LZ77 cannot report the longest match, our method can. • Succinct data structures • Keep all information very compactly; using about the same space for an original text

Prefix arrays • Keep NOTsuffix arrays (SA), but prefix arrays (PA) • because when a character is added at the last of a text, SA may cause W (n) changes, but PA not • In PA, prefixes are sorted in the reverse-lexicographic order Tnew=aaaaz T=aaaa SA for T SA for Tnew PA for T PA for Tnew 0 $ 1 a$ 2 aa$ 3 aaa$ 4 aaaa$ 0 $ 4 aaaaz$ 3 aaaz$ 2 aaz$ 1 az$ 5 z$ 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 5 $aaaaz

Our idea • Weiner’s suffix tree construction algorithm • Insert the suffixes from the shortest ones • Modify it to the insert prefixes form the shortest ones • Similar idea is used for the incremental construction of compressed suffix arrays [Chan, et. al 2007], [Lippert 2005] • We extend this work to the succinct version • Our algorithm reports matching information as a by-product of construction • Do not require tree representation, we just use array information

Preliminary: Dynamic Rank/Select Dictionary (DRSD) • For an text T[0…n-1], DRSD supports: • rank(T, c, i): return the number of c in T[0…i] • select(T, c, i): return the position of i-th c in T • insert(T, c, i): insert c at T[i] • delete(T, i): delete T[i] • These operations can be supported in time (O(logn) time if s < logn), bits space where sis the alphabet size [Lee, et. al. 07],

Preliminary: Range Minimum Query (RMQ) • Given an array E[0…n-1] of elements from totally ordered set, rmq(E, l, r) returns the index of the smallest element in E[l…r] • i.e. rmq(E, l, r) = argmink∈[l, r]E[k] • return the leftmost such element in the tie • In the static case, RMQ can be supported in O(1) time using 2n+o(n) bits space [Fischer, 2007] • In the dynamic case, RMQ/insert/delete can be supported in O(Tlogn) time using O(n) bits if the lookup cost (E[i]) is O(T)

Data structures • Keep the following data structures for T[0…k] • Assume T[0]=$, $ is the unique smallest character • B[0…k]: (Prefix-) BW-transformed Text • B[i] = T[PA[i]+1] and B[i] = $ if PA[i]=k • H[0…k]: Height Array • will be explained in the next slide • C[0…s-1] : Cumulative Array • C[c] = the total number of characters c’ s.t. c’ < c in T • s: The position for the next prefix to be inserted

T = $abaababa

T = $abaababa PA stores the end position of each prefix(we will omit this) Prefix stores prefixes sorted in the reverse-lexicographic order (Neither PA nor prefix are stored explicitly) We can examine PA[i] by using SAlookupoperation using O(log2n) time as in FM-index[Ferragina 00]

B stores the next character for each prefix(Burrows Wheeler’s transform for prefix arrays) T = $abaababa

H stores the length of the longest common suffix between adjacent prefixes T = $abaababa

T = $abaababa s = 4 s denotes the position where $ in B, andthe longest prefix is placed.

T = $abaababa C[c] = the number of characters c’ that is smaller than c in T(=B) C[$]＝０ C[a]＝1 C[b]＝6

T = $abaababaa The next character `a’comes !

T = $abaababaa Replace $ in B[s] with a (because $ is placed in the position of the longest prefix) a

T = $abaababaa Find the position for the new prefix $abaababaa Count the number of a in B[0…s-1] = rank(B, a, s-1) = 2

T = $abaababaa Insert $abaababaa at 3rd position in aC[a]+rank(B, a, s-1) =3 s := C[a]+rank(B, a, s-1), insert(B, s, $)

T = $abaababaa Update H This is actually the length of the longest match in the history

T = $abaababa Recall that in the previous step, $abaaand $aba are placed in the prefixes whose B is `a’ These positions can be found by using rank and select c. f. succ(T, `c’, s) = select(T, c, rank(T, s, c))

T = $abaababa RMQ（H, 4, 6) = 5, H[5] = 0 Therefore RMQ（H, 4, 6) + 1 is the new value for the next H entry

T = $abaababa RMQ（H, 3, 3) = 3, H[3] = 3 Therefore RMQ（H, 3, 3) + 1 is the new value for the nextH entry

T = $abaababaa rmq(H, 3, 3) + 1 rmq(H, 4, 6) + 1

T = $abaababaa Report max(4, 1) = 4as the length of thelongest factor and report the positionof $abaa as SAlookup[2]- len = 0 Report (pos=0, len=4) as the max. matching

Overall algorithm All operations are rank, select, RMQ

Overall Analysis • H is stored in 2n bits [Sadakane , Soda 02] • naïve representation requires O(n log n) bits • requires one SA lookup operation to decode • B is stored in nlogs+ o(nlogs) bits • by using dynamic rank/select dictionary • The bottleneck of our algorithm is rmq(H, I, r) which requires O(log3n) time • SAlookuprequires O(log2n) time

Overall Analysis (cont.) • We can solve the online longest previous factor problem in O(log3n) time for each character, using nlog2s+ o(nlogs) + O(n) bits of space • where s is the alphabet size, and n is the length of a text

Simulating window buffer • If the working space is limited, we often discards the history from the oldest ones • We can simulate this by using the almost the same operations as in the insertion operation • We actually do not discard a character but ignore it • If we actually discard an oldest character , it may cause W(n) changes in B and H • The effect of discarded character is remained (prefixes are sorted according to the discarded characters) • But this does not cause the problem if we only report the matching information up to the history size

Experiments • In experiment, we used a simpler data structure (algorithm is same) • B and H is store in the balanced binary tree • Each leaf stores the small block of B and H • We call this implementation as OS • Compare OS with otheroffline algorithms • Require to read the whole text beforehand • CPSa, CPSd: SA+LCP with stack [Chen, et. al. 07] • CPS6n: SA with RMQ [Chen, et. al. 08] • kk-lz: mreps, specialized for σ=4 [Kolpakov 01]

Peak memory usage in bytes per input symbol • The space of OS is smallest in many real data especially when the values in H is small

Runtime in milliseconds for searching the longest previous factors • OS is about 2～10 times slower than the fastest ones due to the dynamic operations

Conclusion • Solve online longest matching problem by using enhanced prefix arrays • Simple and easy to implement • Require about 3～6 times space of the input text • Actually this is a by-product of construction of compressed suffix trees c.f. Weiner’s algorithm • Simple; and much room for improvements • by using better rank/select/rmq implementation

Future work • Construction of compressed suffix trees • Update the parenthesis tree efficiently • Actually, the time complexity for this is smaller • Practical improvements • Currently, dynamic succinct data structure is not efficient due to cache misses, and memory fragmentation • Approximated version of longest matching problem; enough for many application Thank you for you attention !

Weiner’s suffix tree’s construction alg. $a $abraca $abracada $abra $ab $abracadab $abracad $abr $a $abraca $abracada $abra $ab $abracad $abr $a $abraca $abracada $abra $ab $abracadab $abracad $abr $abracadabr . a a ba $ $ $abr $abr $ab $ $abracad $abrac $abr $abracad abr $abrac $abr a $abrac ab $abrac $abracada $ $ $abracad $ $abrac $abr $abracad $abrac $abracada

An Online Algorithm for Finding the Longest Previous Factors

An Online Algorithm for Finding the Longest Previous Factors

Presentation Transcript

An Algorithm for Bootstrapping Communications

An Adaptive Algorithm for Finding Frequent Sets in Landmark Windows

A Fast Multiple Longest Common Subsequence (MLCS) Algorithm

An Efficient Online Algorithm for Hierarchical Phoneme Classification

Output Sensitive Algorithm for Finding Similar Objects

An Iterative Algorithm for Finding the Minimum Sampling Frequency For Two Bandpass Signals

Experimenting an approximation algorithm for the LCS

An Algorithm for the Consecutive Ones Property

An Optimal Algorithm for Online Square Detection

AlgoViz : An Online Educational Community for Algorithm Visualizations

An Algorithm for: Explaining Algorithms

An almost linear time and linear space algorithm for the longest common subsequence problem

Root-Finding Algorithm

An efficient algorithm for finding double-vertex dominators in circuit graphs

The longest night

Factors for considering an online program - KNU Open University

Finding the Best Deals for Online Furniture

Finding Houses for Sale Online

Building an Online Educational Community for Algorithm Visualization

The Longest Memory

Numerical on Longest Path Matrix (LPM) Algorithm

Tips to Finding an Online Pharmacy