1 / 17

Result Merging in a Peer-to-Peer Web Search Engine MINERVA

Result Merging in a Peer-to-Peer Web Search Engine MINERVA. Master thesis project. Speaker: Sergey Chernov Supervisor: Prof. Dr. Gerhard Weikum Saarland university, Max Planck Institute for Computer Science, Database and Information Systems Group. Overview.

orde
Download Presentation

Result Merging in a Peer-to-Peer Web Search Engine MINERVA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Result Merging in a Peer-to-Peer Web Search Engine MINERVA Master thesis project Speaker: Sergey Chernov Supervisor: Prof. Dr. Gerhard Weikum Saarland university, Max Planck Institute for Computer Science, Database and Information Systems Group

  2. Overview 1 Result merging problem in MINERVA system 4 Summary and future work 2 Selected merging strategies: GIDF, ICF, CORI, LM 3 Our approach: result merging with the preference-based language model

  3. Problems of present Web Search Engines • Size of indexable Web: • Web is huge, it’s difficult to cover all • Timely re-crawls are required • Deep Web • Monopoly of Google: • Controls 80% of web search requests • Sites may be censored by engine • Make use of Peer-to-Peer technology: • Exploit previously unused CPU/memory/disk power • Keep up-to-date results for small portions of Web • Conquer Deep Web by specialized web crawlers

  4. MINERVA project • MINERVA is a Peer-to-Peer Web search engine Representative statistics C1 Si C6 P1 P6 Peer with local search engine Pi S6 S1 Index on crawled pages P2 C2 Ci P5 C5 S2 S5 S3 Distributed directory based on Chord protocol S4 P4 P3 C3 C4 Chord ring

  5. Query processing in distributed IR <<R1, R2, R3,>,q> <P,q> Selection <P’,q> Retrieval Merging RM RM ................... ................... ................... ................... P1 P1’ R1 ................... ................... ................... ................... P2 P3 P2’ R2 ................... ................... ................... ................... P4 P3’ R3 ................... ................... ................... ................... P5 q – query, P – set of peers, P’ – subset of peers most “relevant” for q Ri – ranked result list of documents from Pi, Rm – merged result list of documents P6

  6. Naive merging approaches • How we can combine results from peers? • 1. Retrieve k documents with the highest similarity scores Problem: scores incomparable • 2.Take the same number of documents from each peer Problem: different database quality • 3. Fetch best documents from peers, re-rank them and select top-k Problem: good solution, but how to compute final scores?

  7. P3 P2 P1 DB2 DB3 DB1 Overlapping document set Result merging problem Objective: make scores completely comparable Solution: Replace allcollection-dependent statistics with global ones Baseline: obtain document scores estimation as they were placed in single database Difficulty: Overlapping influence statistics for score estimation. Methods: • GIDF – Global Inverted Document Frequency • ICF – Inverted Collection Frequency • CORI – merging used in CORI system • LM– Language Modeling Final goal: find method which produce most accurate scores and robust to overlapping LeftScore ≈ RightScore P DB1+2+3 Single database

  8. Selected result merging methods (1) • GIDF: compute Global Inverted Document Frequency: DFi – number of documents with particular term on peer i, |Di| – overall number of documents on peer i. • ICF: replace IDF with Inverted Collection Frequency value CF – number of peers with particular term, |C| – number of collections (peers) in the system

  9. Selected result merging methods (2) • CORI: COllection Retrieval Inference network DatabaseRank – obtained during Database Selection step LocalScore – Scores computed with local statistics constants are heuristics tuned for INQUERY search engine • LM: Language Modeling λ – smoothing parameter, heuristic tradeoff between two models P(q | global language model from all documents on all peers) P(q | documentlanguage model)

  10. Experimental setup • TREC-2002, 2003 and 2004 Web Track datasets • 4 topics • 50 peers, 10-15 per topic • documents are replicated twice • 25 title queries, the topic distillation task 2002 and 2003 Web Track • 3 database selection algorithms • RANDOM – that’s it • CORI – de-facto standard • IDEAL – manually created

  11. Experiments – CORI database ranking, all merging methods

  12. Experiments – all database rankings, the best LM merging method

  13. Experiments – IDEAL database ranking, the best LM merging method, limited statistics

  14. Preference-based language model (1) • 1. Execute query on the best peer • 2. First top-k results assumed relevant (pseudo-relevance feedback) • 3. Estimate preference-based LM on these top-k documents • 4. Compute cross-entropy between LM of the document and preference-based LM • 5. Combine this ranking with the LM merging method

  15. Preference-based language model (2) • globally normalized similarity score • preference-based similarity score • both are combined into final result merging scores • where • Q – query • tk – term in Q • G – entire document set over all peers • Dij – document • U – set of pseudo-relevant documents :

  16. Experiments - – IDEAL database ranking, preference-based language model merging

  17. Conclusions • All merging algorithms are very close in absolute retrieval effectiveness • Language modeling methods are more effective than TF*IDF based methods • Limited statistics is reasonable choice in a peer-to-peer setting • The pseudo-relevance feedback information from the topically organized collections slightly improves the retrieval quality

More Related