1 / 19

Compressed Index for Dictionary Matching

Compressed Index for Dictionary Matching. WK Hon (NTHU) , TW Lam (HKU) , R Shah (LSU) , SL Tam (HKU) , JS Vitter (Purdue). Outline. Dictionary Matching Problem Summary of Results Description of Our Solution (Brief): Based on (I) Suffix Tree

tekla
Download Presentation

Compressed Index for Dictionary Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compressed Index for Dictionary Matching WK Hon(NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)

  2. Outline • Dictionary Matching Problem • Summary of Results • Description of Our Solution (Brief): Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities • Open Problems

  3. Dictionary Matching • Input: A set of d short patterns, { P1, P2, …, Pd } of total length n • Problem: Preprocess the patterns, and create an index so that: on receiving any textT, we can report for each Pj, all positions in T where it occurs

  4. Dictionary Matching • Relevant parameters to measure index’s performance: d = # of patterns n = total length of patterns |T| = length of T s = size of alphabet of T and patterns occ = total occurrences in search result

  5. optimal e= constant in (0,1) |patterns| + o(n log s) Summary of Results

  6. a v t e e t h c h a a i r t v t e Patricia trie for { ate, chair, chat, hat, have, vet } Existing Solution I: Patricia Trie • Compact trie storing all d patterns

  7. Existing Solution I: Patricia Trie • Advantage: Space: |patterns| + O( d log n ) bits  Very small overhead in addition to the input patterns

  8. Existing Solution I: Patricia Trie Searching Strategy: For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found • Disadvantage: Searching: worst-case O(|T|n + occ) time

  9. v a e i r t c t h i r h a t v r $ e e a $ e i v $ e e r i t t r t $ suffix tree for { ate, chair, chat, hat, have, vet } Existing Solution II: Suffix Tree • Compact trie storing all suffixes of all d patterns

  10. Matching Time = O(|T|) Existing Solution II: Suffix Tree • SameSearching Strategy: • For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found Searching: worst-case O(|T| + occ) time

  11. Existing Solution II: Suffix Tree Disadvantage: Space: O( n log n ) bits  could be much larger than O( n log s ), the space for|patterns|

  12. no suffixes: poor searching all suffixes: poor space some suffixes: good space + searching Our Solution

  13. v a e i r t c t h r h a t $ e a $ i v e e r t t a = 2 for { ate, chair, chat, hat, have, vet } Our Solution: Sampling • Store one suffix for every a suffixes

  14. irregularities Our Solution: Sampling • Store one suffix for every a suffixes v a e i r t c t h r h a t $ e a $ i v e e r t t a = 2 for { ate, chair, chat, hat, have, vet }

  15. Need to handle irregularities Matching time = O(|T|) despite irregularities Our Solution: Sampling • SameSearching Strategy: • For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found

  16. Y-fast trie When a = logsn Handling irregularities predecessor search in a set of (log n)-bit integers Search: O(|T| log log n + occ) time Space: O( n log s ) bits

  17. Sting B-tree When a = (log1+en) / logs Handling irregularities predecessor search in a set of (log1+en)-bit strings Search: O(|T| (logen + log d) + occ) time Space: |patterns| + o(n log s) bits

  18. Sting B-tree When a = (log1+en) / logs Handling irregularities predecessor search in a set of (log1+en)-bit strings Search: O(|T| (logen + log d) + occ) time Space: nHk + o(n log s) bits FerVen 07

  19. Open Problems Compressed + Dynamic Version: Can an index support update in the set of patterns ? Target: Achieve nHk-type space bound External Memory Version: Can an index operate in external memory and still support fast searching ?

More Related