generalized inverted index for keyword search authors hao wu guoliang li lizhu zhou n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Generalized Inverted Index for Keyword Search Authors: Hao Wu, Guoliang Li, Lizhu Zhou PowerPoint Presentation
Download Presentation
Generalized Inverted Index for Keyword Search Authors: Hao Wu, Guoliang Li, Lizhu Zhou

Loading in 2 Seconds...

play fullscreen
1 / 18
chynna

Generalized Inverted Index for Keyword Search Authors: Hao Wu, Guoliang Li, Lizhu Zhou - PowerPoint PPT Presentation

201 Views
Download Presentation
Generalized Inverted Index for Keyword Search Authors: Hao Wu, Guoliang Li, Lizhu Zhou
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Generalized Inverted Index for Keyword SearchAuthors: Hao Wu, GuoliangLi,LizhuZhou Presentation By Manoj Raj Devalla

  2. Introduction • A keyword search is most common type of search. • Each time a user queries a search engine it has to search billions of web pages and has to return most among them that matches the criteria. • For example, Google Search. • The most common mechanism used to access text data from keyword based search is Inverted Index.

  3. Introduction(Contd.) • An inverted index is an index data structure that stores a mapping of contents to its location within a document. • The fastness of the search is achieved at a cost of increased processing of the document when it is newly added.

  4. Motivation • Inverted file may be large at times. • We need some efficient techniques to reduce the storage space consumed. • Also we need to reduce disk I/O time. • Solution???

  5. Motivation(Contd.) • A new index structure Ginix (Generalized INvertedIndeX) has been proposed to overcome the problems. • It Merges consecutive Doc IDs in inverted file into intervals. • It is also compatible with other techniques such as Variable-Byte Coding, PForDelta etc.

  6. Basic Concepts • Depending upon the algorithm implemented, the inverted lists consist of Document ID’s, number of occurrences and location inside the document. • For example, if Id contains {12, 3, 12, 23, 32} then that keyword has 3 occurrences at location 12, 23, 32 in the document 12.

  7. Related Work • Various compression techniques have been proposed such as Variable-Byte Coding, PForDelta etc. • Variable-Byte Codingis simple to implement but does not achieve a better compression ratio. • In this method, an integer n is compressed as a sequence of bytes.

  8. Variable-Byte Coding • Each byte consists of a status bit that is used to indicate if another byte follows the current byte or not, and 7 data bits. • The status bit will be the most significant bit. • For example, if inverted list contains (12, 25, 39, 279) it will be encoded as follows, 279 = 10000010 00010111

  9. Ginix • Merges document IDs that are consecutive into interval list. • Lists are sorted by ascending order. • An advantage in this scheme is that, the interval list needs only two numbers to identify an interval - a lower boundary lb(r) and upper boundary ub(r). • Unlike other approaches, in Ginix we do not require to decompress the interval list.

  10. Ginix(Contd.) • Sometimes we may end up having same lower boundary and upper boundary number. • In such cases, that list is further divided into three lists. • S, L, U where, • S - Single element intervals. • L - Lower bound multi-element intervals. • U - Upper bound multi-element intervals.

  11. Ginix(Contd.) Table 1 A sample dataset of 7 paper titles.

  12. Ginix(Contd.) Table 2: Traditional Inverted file vs Ginix

  13. Storage Consumption

  14. Storage Consumption(Contd.)

  15. Performance

  16. Performance(Contd.)

  17. Conclusion • Ginix proved to reduce the storage consumption. • Using Ginix, disk I/O time is also reduced. • Ginixis also compatible with other compression techniques. • Overall performance of Ginix is much higher than that of traditional Inverted Index.

  18. Reference [1] Wu Hao, Guoliang Li, Lizhu Zhou. “Ginix: Generalized inverted index for keyword search”. Tsinghua Science and Technology, Vol. 8, Issue 1, pages 77-87, Feb. 2013. [2] HaoYan, Shuai Ding, TorstenSuel. “Inverted index compression and query processing with optimized document ordering”. WWW ’09 Proceeding of the 18th international conference on world wide web, pages 401-410, ACM New York, 2009. [3]http://www.1and1.com/domain-name-search [4]Wikipedia, http://en.wikipedia.org/wiki/Inverted_index. [5]Gavin Powell (2006). "Chapter 8: Building Fast-Performing Database Models". Beginning Database Design ISBN 978-0-7645-7490-0. Wrox Publishing. [6]A. Moffat, J. Zobel. “Inverted files for text search engines”. ACM Computing Surveys, vol. 38, Issue 2, pp. 6, 2006.