1 / 46

Building a Distributed Full-Text Index for the Web

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina. Introduction. Testbed architecture. Design of the indexer. Distributed indexing. Introduction. Testbed architecture. Design of the indexer. Distributed indexing. 3. 2. 1. Dog

Download Presentation

Building a Distributed Full-Text Index for the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

  2. Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.

  3. Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.

  4. 3 2 1 Dog Cat Fish Dog Fly Dog Pig Pig Cat Fish Cat Inverted list Cat-> (1,2), (1,4), (3,2) Dog->(2,2), (3,1), (3,4) Fish->(1,3), (3,3) Pig->(1,1), (2,3) Inverted index location

  5. Inverted indexconsist of an inverted lists for each sorted term. Inverted listconsist of a locations in sorted way. Location consist of (page identifier, position in the page). Posting consist of (index term, location).

  6. Building an inverted index over a collection of web pages involves: 1. Processing each page to extract postings. 2. Building for each term inverted list. 3. Writing out on disk.

  7. Important problems when building web-scale inverted index: 1. Scale and growth rate. 2. Rate of change

  8. Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.

  9. Distributors. • Indexers. • Query servers.

  10. Distributed inverted index organization: • Local inverted files. • 2. Global inverted files.

  11. Global inverted files Cat->(1,2), (1,4), (3,2) Dog->(2,2), (3,1), (3,4) Query server 1 a-e Fish->(1,3), (3,3) Pig->(1,1), (2,3) Query server 2 f-z 3 2 1 Dog Cat Fish Dog Fly Dog Pig Pig Cat Fish Cat

  12. Local inverted files f-z a-e Query server 2 Query server 1 Cat->(3,2) Dog->(3,1), (3,4) Fish->(3,3) Cat->(1,2), (1,4) Dog->(2,2) Fish->(1,3) Fly->(2,1) Pig->(1,1), (2,3) 3 Dog Cat Fish Dog 2 Fly Dog Pig 1 Pig Cat Fish Cat

  13. Local vs. Global • Resilience to failures. • Network load.

  14. Testbed environment: The indexers and the query servers are single processor PC’s with 350-500 MHz processors, 300-500 MB of main memory, and equipped with multiple disks. All the machines are interconnected by a 100 Mbps Ethernet LAN network.

  15. The WebBase collection: To study some properties of web pages that are relevant to text indexing, we analyzed 5 samples, of 100,000 pages each, from different portions of the WebBase repository.

  16. Table 1: Properties of the WebBase collection

  17. Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.

  18. Design of the Indexer • Software pipeline. • The storage of the inverted files generated by the process.

  19. Software pipeline • The process can logically be split into 3 phases: • Processing -> CPU intensive. • Flushing -> disk. • loading -> network.

  20. The goal of our pipelining technique is to design an execution schedule for the different indexing phases that will result in minimal overall running time. Examples: F Execution of the pipeline P L

  21. t Pipeline time

  22. Theoretical analysis vs. experimental results

  23. Design of the Indexer • Software pipeline. • The storage of the inverted files generated by the process.

  24. Storage schemes: We consider ed three storage schemes for storing inverted files as sets of (key, value) pairs in a B-tree: 1. Full list. 2. Single payload. 3. Mixed list.

  25. A qualitative comparison of these storage schemes: • Index size • Zig-zag joins • Hot updates

  26. Zig-zag join using ordered indexes 1 2 3 4 7 9 18 1 7 9 11 12 17 19

  27. Experimental results (using mixed list)

  28. Table 5:Mixed-list scheme index sizes Only one posting was generated for all the occurrences of a word in a page

  29. Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.

  30. Two problems that must be addressed when building an inverted index on a distributed architecture: • Page distribution: The question of when and how to distribute pages to the indexing nodes. • Collecting global statistics: the question of where, when, and how to compute and distribute global statistics.

  31. Two strategies for page distribution: • A priori distribution. • Runtime distribution.

  32. Three advantages of runtime distribution: • Space. • Load balancing. • Effective pipelining.

  33. Collecting global statistics • A dedicated server known as the statistician. • Parallel computation. • Minimize the number of conversations among servers. • Avoid extra disk I/O • Reduces network overhead.

  34. Two strategies for sending information to the statistician: • ME Strategy: sending local information during merging. • FL Strategy: sending local information during flushing.

  35. comparison

More Related