1 / 17

Presented By Guan Guan

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University. Presented By Guan Guan. Overview. INTRODUCTION TESTBED ARCHITECTURE PIPELINED INDEXER DESIGN

Download Presentation

Presented By Guan Guan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented By Guan Guan

  2. Overview • INTRODUCTION • TESTBED ARCHITECTURE • PIPELINED INDEXER DESIGN • MANAGING INVERTED FILES IN AN EMBEDDED DATABASE SYSTEM • COLLECTING GLOBAL STATISTICS • CONCLUSIONS

  3. Inverted Index Book Index Inverted Index fddf similar

  4. Steps to build an inverted index • Web scale and growth rate • Rate of change processing each page to extract postings sorting the postings first on index terms and then on locations writing out the sorted postings as a collection of inverted lists on disk Index build time becomes critical for two reasons:

  5. Purpose of The Paper? • To ptimize build times for massive(web) collections (challenges and solutions). • Propose a pipeline architecture on each indexing node to enhance performance through intra-node parallelism. (building performance issues) • Propose an appropriate format for inverted files that makes optimal use of the features of such a database system • Any distributed system for building inverted indexes needs to address the issue of collecting global statistics (e.g., inverse document frequency - IDF ). We examine different strategies for collecting such statistics from a distributed collection

  6. TESTBED ARCHITECTURE • Distributors. These nodes store the collection of Web pages to be indexed. Pages are gathered by a Web • Indexers. These nodes execute the core of the index building engine. • Query servers. Each of these nodes stores a portion of the final inverted index and an associated lexicon. The lexicon lists all the terms in the corresponding portion of the index and their associated statistics. Overview of indexing process.

  7. PIPELINED INDEXER DESIGN • The core of the indexing system is the index-builder process that executes on each indexer. Logic phases

  8. PIPELINED INDEXER DESIGN Multi-threaded execution Performance gain through pipelining save 1.5hours for 5 million pages. 30-40% in general

  9. MANAGING INVERTED FILES IN AN EMBEDDED DATABASE SYSTEM Challenges 1: Custom Implementation VS existing data management systems Solution: Berkeley DB Challenges 2: designing a scheme for storing inverted files that makes optimal use of the storage structures provided by the data management system. Full list, Single payload, Mixed list:

  10. 3 types of schemas: • 1. Full list: The key is an index term, and the value is the complete inverted list for that term. • 2. Single payload: Each posting (an index term, location pair) is a separate key.

  11. 3. Mixed list:

  12. Comparison of storage schemes Index size -- With the mixed list scheme, the length of the value field is approximately constant. Zig-zag joins -- In the full list scheme, the entire list must be retrieved to compute the join, whereas with the mixed list scheme, access to specific portions of the inverted list is available. Hot updates -- Since we limit the length of the value field, hot updates are faster with mixed lists than with full lists.

  13. Experimental Results 2 million Web pages, 4.9 million distinct terms, 312 million postings Optimal mixed list 30% better than full list

  14. COLLECTING GLOBAL STATISTICS • ME Strategy (sending local information during merging). • FL Strategy (sending local information during flushing).

  15. Experiments • In general, experiments show the FL strategy outperforming ME, although they seem to converge as the collection size becomes large. Furthermore, as the collection size grows, the relative overheads of both strategies decrease.

  16. CONCLUSIONS • In this paper we addressed the problem of efficiently constructing inverted indexes over large collections of Web pages. • We proposed a new pipelining technique to speed up index construction and demonstrated how to identify the right buffer sizes for maximum performance. • We proposed and compared different schemes for storing and managing inverted files using an embedded database system. • Finally, we identified the key characteristics of methods for efficiently collecting global statistics from distributed inverted indexes.

  17. Q & A

More Related