Santa Barbara, California, USA
Download
1 / 16

P2P Web Search: Give the Web Back to the People - PowerPoint PPT Presentation


  • 273 Views
  • Uploaded on

Santa Barbara, California, USA February 27-28, 2006 IPTPS 2006 - The 5th International Workshop on Peer-to-Peer System P2P Content Search: Give the Web Back to the People Outline of the Talk Feasibility of P2P Web Search Problem Statement Learning from Queries Exploiting Correlation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'P2P Web Search: Give the Web Back to the People' - jaden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Santa Barbara, California, USA

February 27-28, 2006

IPTPS 2006 - The 5th International Workshop on Peer-to-Peer System

P2P Content Search:

Give the Web Back to the People

Outline of the Talk

Feasibility of P2P Web Search

Problem Statement

Learning from Queries

Exploiting Correlation

Experiments

Christian Zimmer, Matthias Bender, Sebastian Michel, Gerhard Weikum

Max-Planck-Institut for Informatics, Saarbrücken, Germany

Peter Triantafillou

University of Patras, Greece


P2p and web search marriage in heaven l.jpg
P2P and Web Search: Marriage in Heaven

Li, Loo, Hellerstein, Kaashoek, Karger, Morris questioned Feasibility of Peer-to-Peer Web Indexing and Search(IPTPS 2003)

But: Authors assume distribution of full term-document index

 non-scalable!

Better: light-weight approach with distributed term-peer directory

Variety of projects following this line: PlanetP (Rutgers), Pepper (CMU), Galanx (Wisconsin), Odissea (Brooklyn), Minerva (MPII), and others

P2P Web Search has potential advantages:

  • Highly distributed data

  • Better processing power


Architectural model l.jpg
Architectural Model

Peers are connected by overlay network (e.g. DHT, random graph) and IP

Each peer has full-fledged local search engine (with crawler / importer, indexer, query processor)

Each peer has autonomously compiled (e.g. crawled) its own content according to the user‘s thematic interests

 peer-specific collections

When a query is issued by a peer, it is first executed locally and then possibly routed to carefully selected other peers

Peers can post summaries / synopses / metadata / QoS info to (distr.) network-wide directory with efficient per-key lookup


Minerva system architecture l.jpg

peer ranking

and statistics

peer ranking

and statistics

P3

term b: P3,P5,P8

P6

term f: P2,P4,P6

peer lists

P2

term a: P1,P4,P8

peer ranking

and statistics

b

a

P4

term e: P1,P2,P5

term c: P2,P4,P6

P7

P5

c

P1

P8

term d: P1,P3

query peer

local index

Minerva System Architecture

  • Based on top of a scalable, churn-resilient DHT

  • Conceptually global but physically distributed meta-data directory

Query Routing driven by statistics on peer quality


Problem statement l.jpg

Pi

native: P27, P4, P8, P112, P36, ...

Doc1

american

music

Doc2

native

american

Pq

Pj

american: P1, P4, P18, P108, P25, ...

Pk

music: P13, P4, P88, P36, ...

Post

native

Post

american

Post

music

Problem Statement

Example Query q: „native american music“

  • Ask global directory for three single-term PeerLists

  • Combine into single PeerList for complete query

  • Ask top peers for best documents

  • Combine all documents into single result documents

What can happen?

  • Great results: top peers for q are selected!

  • Bad results: selected peers good for individual terms, mediocre for complete query.


Problem term correlations l.jpg
Problem: Term Correlations

Queries with correlated or specifically „associated“ termsets:

  • „Michael Jordan“, „Lake Superior“, „Bell Labs“, „hurricane Katrina“, „Native American Music“, „PhD admission“, „black magic“, „ice hockey Honolulu“, „Natalya Kournikova“

Architectural compromise:

  • Best peers for q={t1, …, t|q|} may not be in tqPeerList(t)top-k and possibly not even in tqPeerList(t)top-k

  • Also possible:  tqPeerList(t)top-k is empty!

  • Name and phrase recognition helps but insufficient

  • Lack of correlation-awareness is standard in IR, but more severe in P2P because of peer-granularity directory

Consider correlated termsets for query routing!

The solution:

  • Special handling of correlated termsets as termset posts in the directory, but...

  • ... efficiency & scalability are critical!


Critical issues l.jpg
Critical Issues...

... and what remains to be done?

  • How to decide that a termset is correlated?

  • How to store termset posts in the directory?

  • How to exploit termset posts for queries?


Possible approaches l.jpg
Possible Approaches

Extraction of all possible term pairs out of the documents

  • Brute-force precomputation of termset posts

  • But: quadratic explosion and what about triples, quadruples, ...

Possible sources of correlated termsets

  • Names and phrases from dictionaries or thesauries

     incomplete!

  • Frequent itemset mining on data

     computationally expensive!

Impossible to predict all correlated termsets of interest!


Our approach l.jpg
Our Approach...

...driven by „Give the Web back to the people“

Exploit query logs to learn correlated termsets

Advantages of query logs:

  • Reflect real behavior of millions of user

  • Only termsets of interest need to be learned as correlated

  • As we will see: Integration in existing architecture for free

Queries are a gold mine!

Looking at query logs...

  • ... to validate that logs are useful to recognize correlated termsets

  • Excite Search Engine Log (1999) with about 2 million real web queries


Learning correlated termsets from queries l.jpg

american music native

american music

american native

native: P3,P5,P8

american: P1,P4,P8

P3

P2

P5

native

native american music

P4

american

native american music

P8

P6

music

native american music

music: P2,P4,P6

P7

music native

Learning Correlated Termsets from Queries

  • Peerlist request: piggybacking complete query

  • Directory peers remember query as termsets

Learning included in Query Routing

P1


Collecting and storing termset posts l.jpg

american music native

american music

american native

P3

P2

native: P3,P5,P8

american: P1,P4,P8

P5

Post

american native

Post

american

Post

native

P4

american music native

american music

american native

P8

P6

P1

P7

music: P2,P4,P6

music native

Collecting and Storing Termset Posts

  • Directory Peers manage termset posts

  • Posting procedure extended with termset posting

american native: P8

No extra Communication Protocol needed!


Exploiting termset postings l.jpg

american music native: P8

P3

P2

native: P3,P5,P8,P2

american: P1,P4,P8

P5

PeerList

native

native

native american music

P4

PeerList

american music native

american

native american music

P8

P6

native music: P8,P4

P1

P7

music

native american music

music: P2,P4,P6,P8

PeerList

music native

Exploiting Termset Postings

  • Integrated in standard query execution

  • Fallback-option always possible

No additional Communication Round!

PeerList for

complete query


No termset for complete query l.jpg

P3

P2

P5

b

a b c d e

P4

a b c

a

a b c d e

c

a b c d e

a b d

b c e

P8

c e

P6

d

a b c d e

d e

P7

e

a b c d e

e

No Termset for Complete Query

  • Especially for large queries

  • Covering problem!

a b c

b c e

a b d

b c

a b

b

a

c e

c

Integrated into Query Routing!

d e

e

P1

e

a b c

a b d

b c e

a b c d e

c e

d e

e


What about networking costs l.jpg
What about Networking Costs?

Big Concern: too many messages, high bandwidth consumption, too?

All messages piggybacked, no extra costs!

  • Learning correlated termsets integrated in the query routing process

  • Asking for termsets integrated in the posting process

  • Exploiting correlated termsets in the query processing for free and includes the fallback option, too

... It‘s all free!!

Our approach is still scalable because...


Experimental evaluation l.jpg
Experimental Evaluation

  • Experiments: 750 peers with .Gov partitions (~1.2 million web documents)

  • Running 50 expanded queries from TREC-2003 Web Track (example: „robots research artificial“ or „shipwrecks accident“)

Major Gain in Benefit / Cost


Conclusion and future work l.jpg
Conclusion and Future Work

  • Reconcile scalability with good search-result quality

  • No extra networking costs and...

  • ... greatly improved benefit/cost for query routing and processing

  • Consider and benefit from user and community behavior

  • Optimization of termset covers for queries with many terms

  • Real-life testbed with real users!

Thank You for Your Attention!


ad