1 / 25

On the feasibility of Peer-to-Peer Web Indexing and Search

On the feasibility of Peer-to-Peer Web Indexing and Search. J. Li, B.Loo, J.Hellerstein, M. Kaashoek, D. Karger, R. Morris Presented by: Ranjit R. Briefly. P2P full text keyword search. Two classes of keyword search: Flooding. Intersection of index lists in DHTs. Feasibility analysis:

xanto
Download Presentation

On the feasibility of Peer-to-Peer Web Indexing and Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the feasibility of Peer-to-Peer Web Indexing and Search J. Li, B.Loo, J.Hellerstein, M. Kaashoek, D. Karger, R. Morris Presented by: Ranjit R.

  2. Briefly • P2P full text keyword search. • Two classes of keyword search: • Flooding. • Intersection of index lists in DHTs. • Feasibility analysis: • P2P networks cannot make naïve use of either of above techniques due to resource constraints. • Paper presents: • Optimizations & compromises for P2P search on DHTs. • Performance. • Concludes that these optimizations help in bringing the problem to within an order of magnitude of feasibility. • Bring down costs to an optimistic budget.

  3. Motivation for P2P Web search • Stress test for P2P infrastructures. • Resistant to censoring. • Robustness. • Infeasibility of existing P2P keyword-based search systems: • Gnutella, KaZaA. • Both use flooding  performance problems (refer [6]). • DHTs [17] proposes a full-text keyword search on 105 documents (but there are 5.5 x 109 documents on the web [5]). • Key question: • Will P2P Web search work?

  4. Issues to be pondered on • Size of the problem • Size of Web index? • Rate of submission of Web search queries? • Resource constraints • Communications costs. • Available disk space on peers. • Goals of this paper: • Evaluate fundamental costs of and constraints on P2P Web search.

  5. Basics of Web search • Inverted index: • Two parts: • Index of terms: • Sorted order of distinct list of terms in document collection. • Posting list for each term: • List of documents that contain the term. • Complexities: • Search: O (log N), N is the number of terms.

  6. Basics of Web search (Contd.) • Consider the following two documents: • D1: The GDP increased 2 percent this quarter. • D2: The spring economic slowdown continued to spring downwards this quarter. • An inverted index for these two documents is given below: • 2  [D1] • continued  [D2] • downwards  [D2] • economic  [D2] • GDP  [D1] • increased  [D1] • percent  [D1] • quarter  [D1]  [D2] • slowdown  [D2] • spring  [D2] • the  [D1]  [D2] • this  [D1]  [D2] • to [D2]

  7. Basics of Web search (Contd.) • Rankings: • Importance of documents. • Frequency of the search terms in the doc. • How close the terms occur to each other within the documents.

  8. Constraints • Workload • Google indexes ~ 3 billion Web docs; 1000 queries per second ([2]). • ~ 1000 words per doc. • Keys (doc-IDs) in the index = 3 x 1012. • DHT scenario: • Each key = SHA1(content of term) which is 20 bytes. • Total inverted index size = 6 x 1013 bytes.

  9. Constraints (Contd.) • Fundamental constraints: • Storage constraints: • How much storage per peer to store part of the index? • ~ 1GB/peer 60,000 PCs (with no compression of the index). • Communication constraints: • What is an optimistic communication cost per query? • Total bandwidth consumed by queries ≤ Internet’s capacity. • 100 gigabits in US Internet backbones in 1999 [7]. • 1,000 queries per second. • % Internet capacity consumed by Web search = 10. • From above data, per query 10 megabits ~ 1 MB.

  10. Naïve Approaches • Naïve implementations of P2P text search: • Partition-by-document. • Partition-by-keyword. • Partition-by-document: • E.g. Gnutella, KaZaA. • Divide documents among hosts. • Each peer maintains local inverted index of documents it is responsible for. • Query approach: flooding. Return highly ranked doc(s). • Cost: • 60,000 peers. • Flood to each peer  60,000 packets. • Each packet 100 bytes. • Total bandwidth consumed = 6 MB

  11. Naïve Approaches (Contd.) • Partition-by-keyword: • Responsibility for words divided among peers. • i.e. each peer stores the posting list for word(s) it is responsible for. • Query for one or more terms implies postings be sent over the network. • Two-term queries: • Smaller posting sent to holder of larger posting. Perform intersections and return highly ranked doc(s).

  12. Naïve Approaches (Contd.) • Partition-by-keyword: • Cost: • 81,000 queries to search engine at mit.edu • 40% one, 35% two, 25% three or more terms. • mit.edu has 1.7 million web pages. • Average query moved 300,000 bytes of postings over the network. • Scaling to the size of the Web indexed by Google (3 billion pages)  • 530 MB of postings moved. 

  13. Naïve Approaches (Contd.) • Improve upon which approach? • Partition-by-document • bandwidth/query = 6 MB, or • Partition-by-keyword? • bandwidth/query = 530 MB • Authors chose: • Partition-by-keyword (530 MB) • Reason: • To capitalize on vast research on inverted index intersection.

  14. Optimizations for Partition-by-keyword • Scenario: • Query trace of 81,000 queries on a data set of 1.7 million web pages from mit.edu • Caching and Pre-computation • Caching: • To avoid receiving postings for same queries again. • Reduced communication costs by 38%. • Pre-computation: • Computing/storing intersection of different posting lists in advance. • Not feasible to compute intersections of all term pairs. • Compute intersection of all pairs of popular query terms (Zipf). • Savings: 50%.

  15. Optimizations (Contd.) • Compression: • Reduce communication cost. • Approaches: • Bloom filters. • Gap compression. • Adaptive Set Intersection. • Clustering.

  16. Optimizations (Contd.) • Bloom filters (Intro.) • A probabilistic algorithm to quickly test membership in a large set using multiple hash functions into a single array of bits.

  17. Optimizations (Contd.) • Bloom filters (Into.) • Network applications of Bloom Filters: A Survey. Broder etal.

  18. Optimizations (Contd.) • Bloom filters • Efficient Peer-to-Peer Keyword Searching. Reynolds etal

  19. Optimizations (Contd.) • Bloom filters: • Represent a set compactly. • Probability of false positives. • Two-round Bloom intersection. • Compression ratio of 13. • Four-round Bloom intersection. • Compression ratio of 40. (More the number of rounds the lesser the false positives provided the intersection set is small.) • Compressed Bloom filters • Compression ratio of 50.

  20. Optimizations (Contd.) • Gap compression (GC) • Periodically remap docIDs from 160-bits to numbers from 1 to num. of docs. • E.g. D-Gap compression: bit blocks • 0001000111001111 • {3, 7, 8, 9, 12, 13, 14, 15, 16 } • {[0], 3, 1, 3, 3, 2, 4} • {[0], 2, 3, 6, 9, 11, 15} • GAP(N) = GAP(N-1) + Length(GAP(N)) • Fibonacci numbers • http://bmagic.sourceforge.net/dGap.html • Compression ratio of 30.

  21. Optimization (Contd.) • Adaptive set intersection (AS) • Avoid transfer of posting lists by exploiting their structure. • E.g. • Intersection of A={1, 3, 4, 7} and B={8, 10, 20, 30} • 7 < 8, thus AB= • A={1, 4, 8, 20} and B={3, 7, 10, 30} • 20 < 3. Requires the transfer of A. • Compression ratio of 40 (with GC). • Clustering • Statistical clustering techniques to group similar documents • Achieves compression ratio of 75x with GC and AS

  22. Optimizations (Contd.)

  23. Compromises • Max reduction in comm. costs 75x using optimizations. • Another 7x improvement through compromising quality of results and structure. (Target = 530x reduced cost.) • Compromising result quality: • Incremental intersection (fig. from Reynolds etal.) plus ranking functions for results.

  24. Compromises (Contd.) • Compromising P2P structure: • Exploit Internet aggregate bandwidth, e.g. by replicating entire inverted index with one copy per ISP.

  25. Conclusion • Feasibility analysis for P2P web search. • Naïve search implementation not feasible. • Obtain feasibility through a combination of optimizations and compromises.

More Related