1 / 18

Query Routing in Peer-to-Peer Web Search Engine

Query Routing in Peer-to-Peer Web Search Engine. International Max Planck Research School for Computer Science. Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender. Talk Outline. Motivation Proposed Search Engine architecture

lotta
Download Presentation

Query Routing in Peer-to-Peer Web Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Routing in Peer-to-Peer Web Search Engine International Max Planck Research School for Computer Science Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender

  2. Talk Outline • Motivation • Proposed Search Engine architecture • Query routing and database selection • Similarity-based measures • Example: GlOSS • Document-frequency-based measures • Example: CORI • Evaluation of methods • Proposals • Conclusion

  3. Problems of present Web Search Engines • Size of indexable Web: • Web is huge, it’s difficult to cover all • Timely re-crawls are required • Technical limits • Deep Web • Monopoly of Google: • Controls 80% of web search requests • Paid sites get updated more frequently and get higher rating • Sites may be censored by engine

  4. Crawler Ranking of peer usefulness (richness) for keyword Crawler Crawler 0 1 7 computer elephant cancer Peer 4 Peer 3 Peer 2 6 2 Peer 3 Peer 4 Peer 1 Peer 1 Peer 2 Peer 4 5 3 Chord Ring Peer 2 Peer 1 Peer 3 4 Make use of Peer-to-Peer technology • Exploit previously unused CPU/memory/disk power • Provide up-to-date results for small portions of Web • Conquer Deep Web by personalized and specialized web crawlers Global Directory Global directory must be shared among peers!

  5. query Query routing • Goal: find peers with relevant documents • Known before as Database Selection Problem • Not all techniques are applicable to P2P

  6. Database Selection Problem • 1st inference: Is this document relevant? • It’s a subjective user judgment, we model it • We use only representations of user needs and documents (keywords, inverted indices) • 2nd inference: Database is potentialto satisfy query, if it • has many documents (size-based naive approach) • has many documents, containing all query words • high number of them with given similarity • high summarized similarity of them

  7. Measuring usefulness • Number of documents with all query words is unknown • no full document representations available, • only database summaries (representatives) • 3rd Inference (usefulness) is built on top of previous two • Steps of database selection • Rely on sensible 1st and 2nd inferences • Choose database representatives for 3rd inference • Calculate usefulness measures • Choose most useful databases

  8. Similarity-based measures • Definition: Usefulness is a sum of document similarities, exceeding threshold l • Simplest: summarized weight of query terms across collection • no assumptions about word cooccurrence • l = 0

  9. GlOSS • High correlation assumption: • Sort all n query terms Ti in descendant order of their DF’s • DFn → Tn, Tn-1, … , T1, • DFn-1 – DFn→ Tn-1, Tn-2 , … , T1 ,… , • DF1 – DF2→ T1 • Use averaged term weights to calculate document similarity • l > 0 • l is query dependent • l is collection dependent • Usually because of local IDF’s difference • Proposal: use global term importance • Usually l is set to 0 in experiments

  10. Problems of similarity-based measures • Is this inference good? • A few high-scored documents and a lot of low scored documents are regarded as equal • Proposal: summarize first K similarities • Highly scored documents could be bad indicator of usefulness • Most of relevant documents have moderate scores • Highly scored documents could be non-relevant

  11. Document frequency based measures • Don’t use term frequencies (actual similarities) • Exploit document frequencies only • Exploit global measure of term importance • Average IDF • ICF (inversed collection frequency) = • Main assumption: many documents with rare terms • have more meaning for user • most likely contain other query terms

  12. DF : document frequency of query term DFMAX : maximum document frequency among all terms in collection CF : number of collections, containing query term |C| : number of collections in the system CORI: Using TFIDF normalization

  13. CORI Issues • Pure document frequencies make CORI better • The less statistics, the simpler • Smaller variance • Better estimates ranking, not actual database summaries • No use of document richness • To be normalized or not to be? • Small databases are not necessary better • Collection may specialize well in several topics

  14. Inform. Inform. Inform. Retrieval Peer2 Peer3 Peer3 Peer3 Peer1 Peer3 Peer1 Peer1 Peer2 Peer1 Peer2 Peer2 Using usefulness measures Information: CF = 120 Retrieval: CF = 40 |C| = 1000 DF avg_tf DFmax DF avg_tf DFmax 20 Peer1 12 60 Peer1 5 8 60 Peer2 60 6 400 Peer2 10 4 400 Peer3 20 15 60 Peer3 5 10 60

  15. Analysis of experiments • CORI is the best, but • Only when choosing more than 50 from 236 databases • Only 10% better when choosing more than 90 databases • Test collections are strange • Chronologically or even randomly separated documents • No topic specificity • No actual Web data used • No overlapping among collections • Experiments are unrealistic, it’s unclear • Which method is better • Is there any satisfactory method

  16. Possible solutions • Most of measures could be unified in framework • We can play with it and try • Various normalization schemes • Different notions of term importance (ICF, local IDF) • Use statistics of top documents • Change the power of factors • DF·ICF 4 is not worse than CORI • Change the form of expression GlOSS CORI

  17. Conclusion What done: • Measures are analytically evaluated • Sensible subset of measures is chosen • Measures are implemented What could be done next: • Carry out new sensible experiments • Choose appropriate usefulness measure • Experiment with database representatives • Build own measure • Try to exploit collections metadata • Bookmarks, authoritative documents, collection descriptions

  18. Thank you for attention!

More Related