1 / 12

Compressed Index for a Dynamic Collection of Texts

Compressed Index for a Dynamic Collection of Texts. H.W. Chan, W.K. Hon , T.W. Lam The University of Hong Kong. Problem Definition. Given L = { T 1 , T 2 , … , T k } of total length n over an alphabet Σ

tahlia
Download Presentation

Compressed Index for a Dynamic Collection of Texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong

  2. Problem Definition • Given L ={ T1, T2, …, Tk} of total length n over an alphabetΣ • We want to create an index for L such that on given any pattern P, the occurrences of P in each of the Ti can be found quickly • Also, the index should support fast insertion/ deletion of Tiinto/from L

  3. Previous Work & Our Result

  4. Two Basic Tools: CSA, FM-index • Definition 1: The main component of CSA for a text T is a function Ψ such that Ψ[i] = SA-1[SA[i] + 1] where SA[i] is the i-th entry in the suffix array, and SA-1is the inverse of SA

  5. Two Basic Tools: CSA, FM-index • Definition 2: The FM-index of T is based on Burrows-Wheeler array of T, which is an array of characters, denoted by BWT, such that BWT[i] = T[SA[i]-1]. The main component of FM-index is |Σ| functions countcfor everyc Σsuch that countc[i] = # of c inBWT[1…i]

  6. Our Index • Our index is a dynamic version of CSA + FM-index for the concatenated text T1T2…Tk • We exploit the property of Ψ and count that, both of them are essentially a couple of sequence of increasing values.

  7. Our Index • To maintain a dynamicCSA and FM-index to maintain a dynamic sequence of increasing values • Observation 3:Balanced search tree is good for dynamic sequence • Observation 4:Difference encoding for increasing values can save space

  8. Our Index • Combining Observations 3 and 4  Differential Balanced Search Tree to handle the values in the dynamic CSA andFM-index • Drawbacks: computation of Ψ and count is slowed down by O(log n) factor • Pattern matching: O(|P| log n + occ log2 n) time

  9. Insertion & Deletion (sketch idea) • Insertion corresponds to finding update points in the increasing sequences of Ψ and count • To insert a text T intoL, there are O(|T|) such update points • Update points can be found by simulating a pattern matching query of T against L • Total time:O(|T| log n)

  10. Insertion & Deletion (sketch idea) • Deletion reverses the insertion process • Update points can be found by queryingΨiteratively, instead of simulating a pattern matching query • Total time: O(|T| log n)

  11. Conclusion, Progress & Future Work • In the literature, there is a dualproblem called Dictionary Management, which maintains a collection of patterns, such that when a text Tis given later, all occurrencesof each pattern in T is reported in one query. Also, fast insertion/deletion of pattern is required • O(n) bits: some progress …

  12. Conclusion, Progress & Future Work • There is another problem called Dynamic Text, which maintains a single text T, and when a pattern P is given later, it supports finding all occurrences of P in T. The text T is subject to insertion/deletion of substrings. • O(n log n) bits: Sahinalp & Vishkin, FOCS’96 • O(n) bits: ??

More Related