Improved index compression techniques for versioned document collections
1 / 30

Improved Index Compression Techniques for Versioned Document Collections - PowerPoint PPT Presentation

  • Uploaded on

Improved Index Compression Techniques for Versioned Document Collections. Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU. Content of this Talk. Introduction Related work Our improved approaches Conclusion and future work. Introduction.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Improved Index Compression Techniques for Versioned Document Collections' - Rita

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Improved index compression techniques for versioned document collections l.jpg

Improved Index Compression Techniques for Versioned Document Collections

Jinru He, Junyuan Zeng, and Torsten Suel

Computer Science & Engineering

Polytechnic Institute of NYU

Content of this talk l.jpg
Content of this Talk Collections

  • Introduction

  • Related work

  • Our improved approaches

  • Conclusion and future work

Slide3 l.jpg

Introduction Collections

  • What is a versioned document collection?

Slide5 l.jpg

Challenges Collections

Index representation and compression

Index traversal techniques

Support for temporal range queries

Aggregated query processing (e.g. stable top-k)

Versioned Document Collections

Our focus is here

Content of this talk6 l.jpg
Content of this Talk Collections

  • Introduction

  • Related work

  • Our improved approaches

  • Conclusion and future work

Slide7 l.jpg

An Collectionsinverted index consists of inverted lists

Each term has an inverted list

Each inverted list is a sequence of postings

Each posting contains docID and frequency value

Inverted lists are sorted by docID and compressed

Usually…. To improve the compressibility, we store difference between docIDs instead of docIDs

Related Work: Inverted Index

Slide8 l.jpg

Interpolative coding: Stuiver and Moffat 1997 Collections

works very well for clustered data (but slow)

thus, great for reordered collections

OPT-PFD: Yan, Ding and Suel 2009

Works well for the clustered data

based on S. Heman (2005) .

Decompression is fast

Binary Arithmetic Coding

Binary coder driven by the probability of symbols.

Works well if the prediction is good.

Really slow, not practical (only used to achieve theoretical bound)

Related Work: Index Compression Scheme

Slide9 l.jpg

Related Work on Archive Indexing Collections

  • One level Indexing of Version Collections

    • DIFF by P. Anick and R. Flynn 1992

      • Index the symmetric difference between versions.

    • Posting coalescing by Berberich 2007

      • Lossy compression method

    • MSA Herscovici etc 2007

      • Virtual document represents a range of versions.

    • He, Yan and Suel CIKM 2009

Slide10 l.jpg

Related Work on Archive Indexing Collections

  • Two-level Indexes by Altingovde etc 2008

    • Top level index the union of all versions

    • Lower level using bit vectors. He etc 2009

    • The length of each bit vector is the number of versions in the document.

    • For bit vector of term t, if t appears in ith version, the ith position of bitmap is set to 1 otherwise, it is set to 0

10, 30, 34 …

Slide11 l.jpg

Data Set Collections

  • 10% of Wikipedia from Jan. 2001 to Jan. 2008, 0.24 million documents with 35 versions on average for each document.

  • 1.06 million web pages from Ireland domain collected between 1996 and 2006, with 15 on average versions per document from Internet Archive

Slide12 l.jpg

Data Analysis Collections

  • Most changes are small

    • More than 50% changes between two consecutive versions are less than 5 terms.

  • Term changes are bursty

    • Terms just appeared are likely to disappear again shortly

  • Change size is bursty

    • Less than 10% versions makes up more than 50% and 70% changes in wiki and Ireland

  • Terms are dependent

    • 48.8% terms disappear together if they come together

    • 30.5% terms disappear together otherwise

Slide13 l.jpg

Content of this Talk Collections

  • Introduction

  • Related work

  • Our improved approaches

    • Combinatorial approach

    • Better practical methods

    • Query processing

  • Conclusion and future work

Slide14 l.jpg

For simplicity, Each document version is a set (docIDs only)

first level stores IDs of any docs where term has occurred

second level models the version bitvector

Given the information of known versions in a document, we derive models to predict what it is like in the next versions.

Combinatorial Approach





















Slide15 l.jpg

Combinatorial Lower Bound only)

  • Based on Communication Complexity ( compared by Orlitsky 1990)

  • Basic model:

  • For a document with m versions, given the total number of terms in the document as s, and the number of changes in each version cj. Total number of possible versions are:

  • It can be proved the minimum bits required to encode these versions without ambiguity is

Slide16 l.jpg

Idea: assign probability to next bit only)

Using information in a document.

Can then use arithmetic coding to compress bit vector

Given the number of changes between two versions as ch() , and total number of terms in document d as s(d)

Versions for more complicated models

Combinatorial Upper Bound

Slide17 l.jpg

Featured based Prediction only)

  • In addition, what if we exploit more features in the versions?

  • K-D tree: partition the 8-dimensional space.

    • each bit is a 8-dimension points in the space

    • recursively partition each dimension of k-D tree to decrease the overall entropy.

    • Trade-offs: the smaller the index size, the larger the size of the tree itself.

  • Note: like C4.5

Slide18 l.jpg

Experiment Result only)

Lower lever index size in MB

  • Information of changes between version has huge impact.

  • Additional features only result in moderate further gains.

Slide19 l.jpg

Better Practical Methods only)

  • Combinatorial methods achieve good compression

  • Decompression is slow (arithmetic coding)

  • Changes play a role

  • Here we want to engineering existing compression method by using change information

    • Two levels applied to DIFF, MSA

    • Bit vector reordering

    • Hybrid DIFF and MSA

Slide20 l.jpg

Two-level DIFF and MSA only)

  • Leverage change information and apply it to the known index compression method

  • First level index:

    • The union of all versions

  • Second level:

    • Each bit in the bit vector is a virtual document

    • Reorder the bits in bit vector.

    • Apply standard compression techniques (OPT-PFD, IPC) to compress

Slide21 l.jpg

Bit Vector Reordering on DIFF only)





















  • Bit Vector transformed to the Diff Bit Vector

  • Reorder the bit vector by the changes between versions

  • Index the gaps between ‘1’ bit

Slide22 l.jpg

Hybrid DIFF and MSA only)

  • MSA

    • Works well when virtual document contains many terms.

    • Less well if too many small non-empty virtual documents

  • Idea

    • Pick large virtual documents in MSA

    • DIFF finish up the rest of the postings

Slide23 l.jpg

Experiment Result ( only)docID only)

  • Reordering improves 30% on Wikipedia

  • Improvements on Ireland data set are limited.

Slide24 l.jpg

Experiment Result ( only)docID and Frequency)

Slide25 l.jpg

Query Processing only)

  • We have achieved good compression…

  • What about the query processing?

  • Actually, we can do it even better than previous work!

Slide26 l.jpg

Query Processing only)


10, 30, 34 …

  • Intersect first on the first level


30, 33, 37 …

Slide27 l.jpg

Query Processing only)

  • Bit vectors corresponding to the result docIDs are fetched

  • AND the bit vector

  • First level index is small, many bit vectors can be skipped to speed up the second level query processing!





Slide28 l.jpg

Query Processing Result only)

  • The 2-DIFF query processing is about 32% faster.

  • Mix is good.

  • 2R-DIFF gains moderate improvement while achieving

  • best compressibility

Slide29 l.jpg

Conclusion and Future Work only)

  • New index organization

    • Reduced index size

    • Faster query processing

  • Simple model exploiting the change information matters.

  • Future work:

    • Generative models with user behavior

    • Different classes of query processing ( stable Top-k, temporal query processing)

Slide30 l.jpg

Q & A only)

  • Thank You!