1 / 19

Tunable Compression of Word-level Index for Versioned Corpora

Tunable Compression of Word-level Index for Versioned Corpora. Klaus Berberich, Srikanta Bedathur , Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken, Germany. Introduction. Most document collections are not static

erol
Download Presentation

Tunable Compression of Word-level Index for Versioned Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken, Germany

  2. Introduction • Most document collections are not static • Intranet documents, Mail folders, Blogs, Source-code, and contents of the World Wide Web • Contents are being archived – possibly time-stamped and/or versioned • Wikis • Document repositories (SVN, CVS, …) • Desktop • Web Archives! • Search over evolving collections • Ability to query the collection “as of” given time • Time-travel Search [BBNW’07] EIIR 2008, Glasgow

  3. Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and Controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow

  4. Historical Information Needs • News articles discussing Cola-drinks Cancer controversy during 2005-2006 • Contemporary articles about “Harry Potter and the Philosopher’s Stone” • Angela Merkel’s interview during 2002 EIIR 2008, Glasgow

  5. Time-Travel Search Angela Merkel Interview @ 2002 Time-context for Evaluation & Ranking Keyword Query Keyword search extended with atime-context for evaluation Q = q @ ts Evaluate qusing the collection that existed at time ts • Key Challenges • Dealing with the MASSIVE size • Adapting the scoring models (typically defined for static collections) • Efficient query processing • Opportunities • Redundancy in content • Sufficiency of good approximations • Append-only data growth EIIR 2008, Glasgow

  6. Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow

  7. FluxCapacitor/TTIX D1 D3 D1 D3 D2 1.87 1.6 1.9 2.0 2.2 [t0,t3) [t0,t2) [t2,t4) [t0,t1) [t3,t7) [Berberich, Bedathur, Neumann, Weikum : SIGIR 2007, VLDB 2007] Adapt Inverted Index structure to include validity time-interval of each document-version Version-history of Documents Time-stampedInverted Index Vocabulary Documents D1, D2, D3 are observed to have changed at different times D3 “deletion” D3 D2 Doc. Ids D1 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t11 t12 t13 t10 now Time … … D2 D1 D3 D3 … xx xx xx xx • Index Compaction via Approximate Temporal Coalescing • A sublist materialization framework for trading off space-performance D2 D1 D3 D3 [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx D2 D1 D3 D3 [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx D2 D1 D3 D3 … [t0,t3) [t0,t2) [t3,t7) [t0,t1) xx xx xx xx [t0,t3) [t0,t2) [t3,t7) [t0,t1) EIIR 2008, Glasgow

  8. Phrase Queries • Significantly improve effectiveness • Essential for quickly locating • entities – e.g., “Coca Cola”, “Where Eagles Dare”,… • concepts – e.g., “Water filtering” • … • Indexing for Phrase queries • For each word, need to store positional information for every occurrence • Index-size blowup • Size reduction via gap encoding + space-efficient coding on positions [Scholer et al. 2002] EIIR 2008, Glasgow

  9. Phrase Queries in FluxCapacitor • Baseline:For each document version dtb, posting of the following structure Validity Time-interval (=64 bits) Document Identifier (=64 bits) List of Word-Positions • Word-positions compressed using standard techniques • (Gap + Elias-/Golomb-)encodings Can this be Improved? EIIR 2008, Glasgow

  10. Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow

  11. Word-Positions across Versions • High Level of Redundancy between versions • Append-only changes leave most parts unchanged • word b between dt1and dt2 • Numerical closeness of positions • Small shifts in positions • word c between dt2 and dt3 b: c: EIIR 2008, Glasgow

  12. FUSION • Idea: • Merge (or Fuse) multiple consecutive document versions, and exploit redundancy and positional proximity => Better compressibility • Positions: all word-positions in any of the versions • Timestamps: all intermediate version timestamps • Signatures: for each version, a bit-signature of positions b: c: EIIR 2008, Glasgow

  13. Query Processing – win some, lose some • Save on overall space • Naïve organization + processing => reads the whole list, computes ranking • FUSION maintains smaller list, so faster (naïve) query processing • Who is Naïve !? • Skip pointers to jump ahead during query proc. • In the worst case,FUSION ends up reading and processing all the versions, instead of just one version! • Baseline - Good performance, Bad storageFUSION -Bad (worst-case) performance, Good storage EIIR 2008, Glasgow

  14. Controlled FUSION • Compute a set of fusions over contiguous versions s.t. • It takes minimal storage for word positions • For any version, the maximum worst case query processing overhead is within η • Can be set up as an optimization problem • Optimal solution computable in O(n3) time and O(n) space • Assumption: storage cost is monotonous • In practice, we found it close to O(n2) EIIR 2008, Glasgow

  15. Outline • Time-travel Search • Our Time-machine: FluxCapacitor/TTIX • Phrase Queries in TTIX • FUSION and controlled FUSION • Experimental Evaluation EIIR 2008, Glasgow

  16. Experimental Evaluation • English Wikipedia • Revision history (2004 – 2005) • 10% sample (~35,000 docs, ~900,000 ver.) • Baseline: • Elias- code: 97.51 GBytes • Elias- code: 97.77 GBytes • FUSION: • η between 1.1 – 10 • Elias- & Elias-  for compressing word-positions in each fused posting EIIR 2008, Glasgow

  17. Experimental Results  = 1.5  44% of the baseline  = 1.5  35% of the baseline EIIR 2008, Glasgow

  18. Conclusions • Time-travel Search • Key to archive search & analysis • An interesting and important problem! • Our Time-machine: FluxCapacitor/TTIX • Builds on inverted index framework • Tunable index-size reduction • FUSION • Adds phrase-querying to FluxCapacitor/TTIX • More than 50% space reduction over baseline • With 50% worst-case overhead in query proc. EIIR 2008, Glasgow

  19. Thank You!Questions ?

More Related