1 / 16

P2P Content Search: Give the Web Back to the People

Santa Barbara, California, USA February 27-28, 2006 IPTPS 2006 - The 5th International Workshop on Peer-to-Peer System P2P Content Search: Give the Web Back to the People Outline of the Talk Feasibility of P2P Web Search Problem Statement Learning from Queries Exploiting Correlation

jaden
Download Presentation

P2P Content Search: Give the Web Back to the People

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Santa Barbara, California, USA February 27-28, 2006 IPTPS 2006 - The 5th International Workshop on Peer-to-Peer System P2P Content Search: Give the Web Back to the People Outline of the Talk Feasibility of P2P Web Search Problem Statement Learning from Queries Exploiting Correlation Experiments Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum Max-Planck-Institut for Informatics, Saarbrücken, Germany Peter Triantafillou University of Patras, Greece

  2. P2P and Web Search: Marriage in Heaven Li, Loo, Hellerstein, Kaashoek, Karger, Morris questioned Feasibility of Peer-to-Peer Web Indexing and Search(IPTPS 2003) But: Authors assume distribution of full term-document index  non-scalable! Better: light-weight approach with distributed term-peer directory Variety of projects following this line: PlanetP (Rutgers), Pepper (CMU), Galanx (Wisconsin), Odissea (Brooklyn), Minerva (MPII), and others P2P Web Search has potential advantages: • Highly distributed data • Better processing power

  3. Architectural Model Peers are connected by overlay network (e.g. DHT, random graph) and IP Each peer has full-fledged local search engine (with crawler / importer, indexer, query processor) Each peer has autonomously compiled (e.g. crawled) its own content according to the user‘s thematic interests  peer-specific collections When a query is issued by a peer, it is first executed locally and then possibly routed to carefully selected other peers Peers can post summaries / synopses / metadata / QoS info to (distr.) network-wide directory with efficient per-key lookup

  4. peer ranking and statistics peer ranking and statistics P3 term b: P3,P5,P8 P6 term f: P2,P4,P6 peer lists P2 term a: P1,P4,P8 peer ranking and statistics b a P4 term e: P1,P2,P5 term c: P2,P4,P6 P7 P5 c P1 P8 term d: P1,P3 query peer local index Minerva System Architecture • Based on top of a scalable, churn-resilient DHT • Conceptually global but physically distributed meta-data directory Query Routing driven by statistics on peer quality

  5. Pi native: P27, P4, P8, P112, P36, ... Doc1 american music Doc2 native american Pq Pj american: P1, P4, P18, P108, P25, ... Pk music: P13, P4, P88, P36, ... Post native Post american Post music Problem Statement Example Query q: „native american music“ • Ask global directory for three single-term PeerLists • Combine into single PeerList for complete query • Ask top peers for best documents • Combine all documents into single result documents What can happen? • Great results: top peers for q are selected! • Bad results: selected peers good for individual terms, mediocre for complete query.

  6. Problem: Term Correlations Queries with correlated or specifically „associated“ termsets: • „Michael Jordan“, „Lake Superior“, „Bell Labs“, „hurricane Katrina“, „Native American Music“, „PhD admission“, „black magic“, „ice hockey Honolulu“, „Natalya Kournikova“ Architectural compromise: • Best peers for q={t1, …, t|q|} may not be in tqPeerList(t)top-k and possibly not even in tqPeerList(t)top-k • Also possible:  tqPeerList(t)top-k is empty! • Name and phrase recognition helps but insufficient • Lack of correlation-awareness is standard in IR, but more severe in P2P because of peer-granularity directory Consider correlated termsets for query routing! The solution: • Special handling of correlated termsets as termset posts in the directory, but... • ... efficiency & scalability are critical!

  7. Critical Issues... ... and what remains to be done? • How to decide that a termset is correlated? • How to store termset posts in the directory? • How to exploit termset posts for queries?

  8. Possible Approaches Extraction of all possible term pairs out of the documents • Brute-force precomputation of termset posts • But: quadratic explosion and what about triples, quadruples, ... Possible sources of correlated termsets • Names and phrases from dictionaries or thesauries  incomplete! • Frequent itemset mining on data  computationally expensive! Impossible to predict all correlated termsets of interest!

  9. Our Approach... ...driven by „Give the Web back to the people“ Exploit query logs to learn correlated termsets Advantages of query logs: • Reflect real behavior of millions of user • Only termsets of interest need to be learned as correlated • As we will see: Integration in existing architecture for free Queries are a gold mine! Looking at query logs... • ... to validate that logs are useful to recognize correlated termsets • Excite Search Engine Log (1999) with about 2 million real web queries

  10. american music native american music american native native: P3,P5,P8 american: P1,P4,P8 P3 P2 P5 native native american music P4 american native american music P8 P6 music native american music music: P2,P4,P6 P7 music native Learning Correlated Termsets from Queries • Peerlist request: piggybacking complete query • Directory peers remember query as termsets Learning included in Query Routing P1

  11. american music native american music american native P3 P2 native: P3,P5,P8 american: P1,P4,P8 P5 Post american native Post american Post native P4 american music native american music american native P8 P6 P1 P7 music: P2,P4,P6 music native Collecting and Storing Termset Posts • Directory Peers manage termset posts • Posting procedure extended with termset posting american native: P8 No extra Communication Protocol needed!

  12. american music native: P8 P3 P2 native: P3,P5,P8,P2 american: P1,P4,P8 P5 PeerList native native native american music P4 PeerList american music native american native american music P8 P6 native music: P8,P4 P1 P7 music native american music music: P2,P4,P6,P8 PeerList music native Exploiting Termset Postings • Integrated in standard query execution • Fallback-option always possible No additional Communication Round! PeerList for complete query

  13. P3 P2 P5 b a b c d e P4 a b c a a b c d e c a b c d e a b d b c e P8 c e P6 d a b c d e d e P7 e a b c d e e No Termset for Complete Query • Especially for large queries • Covering problem! a b c b c e a b d b c a b b a c e c Integrated into Query Routing! d e e P1 e a b c a b d b c e a b c d e c e d e e

  14. What about Networking Costs? Big Concern: too many messages, high bandwidth consumption, too? All messages piggybacked, no extra costs! • Learning correlated termsets integrated in the query routing process • Asking for termsets integrated in the posting process • Exploiting correlated termsets in the query processing for free and includes the fallback option, too ... It‘s all free!! Our approach is still scalable because...

  15. Experimental Evaluation • Experiments: 750 peers with .Gov partitions (~1.2 million web documents) • Running 50 expanded queries from TREC-2003 Web Track (example: „robots research artificial“ or „shipwrecks accident“) Major Gain in Benefit / Cost

  16. Conclusion and Future Work • Reconcile scalability with good search-result quality • No extra networking costs and... • ... greatly improved benefit/cost for query routing and processing • Consider and benefit from user and community behavior • Optimization of termset covers for queries with many terms • Real-life testbed with real users! Thank You for Your Attention!

More Related