1 / 28

Using Fingerprints in n-Gram Indices

Using Fingerprints in n-Gram Indices. Stefan Selbach selbach@informatik.uni-wuerzburg.de. Digital Libraries: Advanced Methods and Technologies, Digital Collections. 17.09.2009. Using Fingerprints in n-Gram Indices. Overview Introduction Inverted Index N-Gram Index Bitmaps

Download Presentation

Using Fingerprints in n-Gram Indices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Fingerprints in n-Gram Indices Stefan Selbach selbach@informatik.uni-wuerzburg.de Digital Libraries: Advanced Methods and Technologies, Digital Collections 17.09.2009

  2. Using Fingerprints in n-Gram Indices Overview • Introduction • Inverted Index • N-Gram Index • Bitmaps • Signature Files • n-Gram Fingerprints • n-Gram Fingerprints in Combination with Posting Lists • Fingerprint Compression • Conclusion and Future Work

  3. Introduction

  4. Inverted Index • Very common index structure • Term-oriented • Every term is linked to its postings

  5. n-Gram Index • Uses n-Grams as indexing terms • Any kind of subsequence can be searched • n-Gram is a subsequence of a text with • Postings for longer subsequences can be calculated:

  6. n-Gram Index • Index structure is very similar to an inverted index • Searching is more complex

  7. Bitmaps • Bitmaps are occurrence maps • Each bit signals an occurrence of a specific term in a specific document

  8. Signature Files

  9. n-Gram Fingerprint

  10. N-Gram Fingerprint The idea: Create fingerprintsthat: • Have a fixedsize • Containinformationaboutthepostings

  11. N-Gram Fingerprint A 2D-Fingerprint is a bit-matrix

  12. N-Gram Fingerprint • Given two 1-grams and their fingerprintsBw1 and Bw2 the fingerprint Bw1w2 can beaproximated: • B’w2 is constructed by cyclic shifting each column of Bw2 by one position to the left.

  13. N-Gram Fingerprint

  14. N-Gram Fingerprint Search Speed Results from the “Online Encyclopedia of Dermatology from P. Altmeyer”

  15. N-Gram Fingerprints in combinationwithpostinglists

  16. Combining Fingerprints and Posting Lists By combining fingerprints and posting lists • No verification step is needed • Posting lists are partitioned into smaller subsets. Each bit of the fingerprint corresponds to a separate posting list • Costs for intersection of posting lists are being reduced

  17. Combining Fingerprints and Posting Lists

  18. Managing n-Gram Posting Lists • Very large number of posting-subsets have to be managed:For example:1024 residue classes for the fileID 128 residue classes for the offset 14.000 different n-grams • Subsets are stored in a hash • The hash value is a function of the residue classes

  19. Managing n-Gram Posting Lists

  20. hash collisions and collision resolving 40000 ... collisions ... comparisons 35000 ... comparisons after sorting 30000 25000 frequency 20000 15000 10000 5000 0 0 20 40 60 80 100 120 140 number of ... Managing n-Gram Posting Lists

  21. Results • Performance improved by 40% compared to the setup without posting lists

  22. Fingerprint compression

  23. Fingerprint Compression • Fingerprints with high or low densities do not contain much information • Fingerprints can be compressed by reducing the resolution • Dictionary based compression

  24. Fingerprint Compression • Results: Fingerprint convolution • In combination with the dictionary based compression the index size is being reduced by additional 30%

  25. Conclusion and Future Work

  26. Conclusion • Fingerprints improve the scalability of n-gram indices • Fingerprints improve the performance of n-gram indices • The index structure can be adjusted to user behavior, so that common queries can be processed more efficiently • The fingerprints can be stored in a compressed index with loosing only a minimum of performance

  27. Future Work • Combination of term based inverted index and n-Gram fingerprint index • Profit from the advantages of both using terms and n-Grams as indexing terms • Substring search • Ranking • Thesaurus information

  28. Thank You!   Digital Libraries: Advanced Methods and Technologies, Digital Collections 17.09.2009

More Related