Flood little cache more effective result reuse in p2p ir systems
Download
1 / 26

Flood Little, Cache More: Effective Result-Reuse in P2P IR Systems - PowerPoint PPT Presentation


  • 275 Views
  • Uploaded on

Flood Little, Cache More: Effective Result-Reuse in P2P IR Systems Christian Zimmer , Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics, Saarbrücken, Germany http://www.mpi-inf.mpg.de Outline of the Talk Motivation System Architecture Caching Framework

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Flood Little, Cache More: Effective Result-Reuse in P2P IR Systems' - niveditha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Flood little cache more effective result reuse in p2p ir systems l.jpg

Flood Little, Cache More: Effective Result-Reuse in P2P IR Systems

Christian Zimmer, Srikanta Bedathur, Gerhard Weikum

Max-Planck Institute for Informatics, Saarbrücken, Germany

http://www.mpi-inf.mpg.de

DASFAA Conference 2008


Outline of the talk l.jpg
Outline of the Talk

  • Motivation

  • System Architecture

  • Caching Framework

  • Exact Caching (EC)

  • Approximate Caching (AC)

  • Experimental Evaluation

  • Conclusions & Open Issues

DASFAA Conference 2008


Motivation l.jpg
Motivation

Basics

  • High Potential of P2P-based Information Retrieval (P2P IR) systems:

    • benefits in general: scalable, efficient, resilient to failures and dynamics, democratic, privacy preserving, and resilient to authoritarian controls

    • benefits from intellectual input of users: click streams, query logs, bookmarks, etc.

  • Performance Challenges:

    • providing high quality results (recall & precision)

    • enabling high scalability (number of participating peers & huge amounts of data).

    • unreliable networks: slow response times, intermittent loss of good results

    • extra load on network: many peers for good recall

DASFAA Conference 2008


Motivation con t l.jpg
Motivation (con't)

Caching of Results

  • Traditional performance booster (using previous query executions to help in the future)

  • Remember popular items to avoid computing / fetching

    Typical Issues

  • What to Cache?

    • value of cached items

    • inverted lists / full results / partial results

  • Where to Cache?

    • on querying peers, every node along lookup path (UIC), spread to neighbors (DiCAS), on good nodes (View Trees)

  • How much to Cache?

    • buffer size

  • When to drop from Cache?

    • buffering policy

  • Goals of Caching?

    • response time improvements, query result-quality improvement

DASFAA Conference 2008


System architecture l.jpg

P1

P2

P3

P4

P5

P7

P8

P6

D(c)

D(a)

D(b)

System Architecture

Maintaining Metadata

  • Autonomous peers with local index (local search engine)

  • Distributed global directory layered on top of distributed hash table (DHT)

  • DHT partitions term space such that each peer is responsible for subset of terms

  • Peers distribute per-term summaries (Posts) to global directory (size of the index, number of documents containing this term, etc.)

  • Directory manages aggregated statistical information in compact form

Minerva Search Architecture

DASFAA Conference 2008


System architecture6 l.jpg

P1

P2

P3

P4

P5

P7

P8

P6

D(c)

D(a)

D(b)

System Architecture

Query Execution

  • Multi-term query a b c

  • Peerlist requests to retrieve metadata from directory(metadata retrieval)

  • Compute most promising peers for complete query (e.g., CORI, DTF)

  • Complete query forwarded to these peers executing query locally(local result retrieval)

  • Local results returned and merged to global query result

Minerva Search Architecture

query: a b c

DASFAA Conference 2008


Caching framework l.jpg
Caching Framework

Main Goals

  • Caching for result-quality improvement

  • Integration of result caching with query routing (reduces message traffic)

  • Cache placement for seamless reuse

  • Aggressive result-reuse under certain conditions

    Where and What to Cache?

  • Potential locations for caching:

    • Query initiator or additional overlays: limited utility to network:

    • Directory: choose one directory peer involved in query execution using deterministic scheme (avoids load balancing concerns)

  • Caching full results:

    • Metadata of results (URL, statistics, etc.)

    • Set of source peers contributing to cached results

DASFAA Conference 2008


Caching framework con t l.jpg
Caching Framework (con't)

Extending Query Execution

  • Query Routing:

    • initiating peer sends full query to all directory peers responsible for query terms

    • directory checks availability of cached result and if available returns it to initiator

  • Adding / Updating Cache

    • query initiator computes full query result and cached result for top-k items

    • initiator determines directory peer responsible for maintaining cached result

    • directory peer incorporates received cache result in its cache

      Two Caching Strategies based on Caching Framework

  • Exact Caching (EC):

    • P2P counterpart of traditional result caching

  • Approximate Caching (AC):

    • aggressively reuse cached results of query subsets

DASFAA Conference 2008


Exact caching ec l.jpg

P1

P2

P3

P4

P5

P7

P8

P6

D(c)

D(a)

D(b)

Exact Caching (EC)

Main Property

  • Only used if stored result generated by exactly same query

    Caching Approach

  • After query execution: cached results stored at directory (by selecting one directory peer)

  • Request for a b c by another peer

  • Metadata retrieval returns in addition cached result

  • Initiator satisfied: saves additional communication at same result-quality

  • Improving: local result retrieval from additional peers

  • Updating cached result

query: a b c

query: a b c

DASFAA Conference 2008


Approximate caching ac l.jpg
Approximate Caching (AC)

Limitation of Exact Caching

  • EC only applicable when exact query was executed before

  • Approximate Caching tries to overcome this issue if cached result for complete query is not available

    Caching Approach

  • Aggressively retrieve and combine cached results of subsets of requested query to approximate full query

  • Avoidslocal result retrieval

  • Metadata retrieval:

    • querying peer requests peerlists for all query terms

    • directory peers return all existing maximal cached results for subsets of query term set

    • querying peer only considers cached results for maximal subqueries received from directory

  • By Design:

    • directory peers for query terms responsible for all possible subqueries

    • if AC strategy not satisfying, metadata retrieval already done

DASFAA Conference 2008


Approximate caching ac con t l.jpg

a c d

a c d

P6

P1

P8

P2

P3

P4

P5

P7

a c d

c

c

c

b c d

b c d

b c

b c d

b d

Approximate Caching (AC) (con't)

An Example

  • Request for a b c d

  • No cached result for full query, but directory stores cached results for subqueries

  • Metadata retrievalreturns in addition all cached results for maximal subqueries

  • To combine subquery results, querying peer only considers maximal ones

    Unsatisfactory Approximate Result

  • Querying peer retrieves local results from top-ranked peers for full query

query: a b c d

D(d)

D(c)

D(a)

D(b)

DASFAA Conference 2008


Approximate caching ac con t12 l.jpg
Approximate Caching (AC) (con't)

How to Combine Cached Results of Different Subqueries

  • Having determined document set contained in all cached results for maximal subqueries, documents need to be ranked for approximate result for full query

  • Consider document scores scored,p,q from cached results for document d as local result of peerp concerning (sub-)query q

    Final Score Computation

  • To rank the document set and get approximate result

  • scored = maxp,q (|q|  scored,p,q)

    • takes different query sizes into account: longer queries more selective and approximate better full query

    • more than one cached result can include a document: only consider maximal score

DASFAA Conference 2008


Experimental evaluation l.jpg
Experimental Evaluation

Experimental Setup

  • P2P IR Benchmark recently proposed for P2P system evaluation [ExpDB 2006]

  • > 800,000 documents from Wikipedia

  • 99 Google Zeitgeist queries (1-3 query terms)

  • Documents distributed to 1,000 peers (with controlled overlap)

  • In addition: AOL query-log (real-world log with time ordering)

  • Result retrieval returns top-25 local results per peer; final result obtains top-50 documents for full query

    Measurements

  • Relative Recall: fraction of ideal result documents included in results of P2P query processing

    • Ideal results as top-50 result documents of centralized query execution including combined document collection

  • Network Resource Consumption: total network traffic incurred during query processing

    • number of messages transfered across network

    • number of communication rounds

DASFAA Conference 2008


Experimental evaluation con t l.jpg
Experimental Evaluation (con't)

I. Improving Recall with Exact Caching (EC)

  • Focus on query result improvement by asking additional peers

  • Updated cached result stored in directory

  • Initial query processing disseminates query to 5% of network; each improvement step considers up to 5% additional network peers

  • Relative recall averaged over all 99 Zeitgeist queries

DASFAA Conference 2008


Experimental evaluation con t15 l.jpg
Experimental Evaluation (con't)

DASFAA Conference 2008


Experimental evaluation con t16 l.jpg
Experimental Evaluation (con't)

II. Cache Management Strategies

  • Assumes bounded cache space at directory peers such that cache management policy influences recall for Exact Caching strategy

  • Cache at directory peer restricted to three cached results each

  • Synthetic query workload from Zeitgeist queries:

    • all possible 9180 one- and two-term queries from single query terms

    • assuming a power law distribution (total of 102,158 requests)

  • Cache replacement strategies: LFU, LRU, FIFO, RAN, UNL (upper bound), and NOC (lower bound)

  • Measures: overall relative recall and cache hit ratio

DASFAA Conference 2008


Experimental evaluation con t17 l.jpg
Experimental Evaluation (con't)

DASFAA Conference 2008


Experimental evaluation con t18 l.jpg
Experimental Evaluation (con't)

III. Cost Analysis

  • Network cost analysis: per query network traffic, number of messages, and communication rounds in three scenarios:

    • No Caching (NC): standard query processing (5% of network)

    • EC Single-Step (EC-SS): Exact Caching without query result improvement

    • EC Multi-Step (EC-MS): Exact Caching with query result improvement up to 50% of network in 5% steps

  • Details (different phases, assumptions etc.) see paper!

DASFAA Conference 2008


Experimental evaluation con t19 l.jpg
Experimental Evaluation (con't)

NC

No Caching

EC-SS

EC Single -Step

EC-MS

EC Multi-Step

average

relative recall

0.32

0.32

0.71

(+122%)

network traffic

(per query)

55.3 Kbytes

23.1 Kbytes

(-58.2%)

41.0 Kbytes

(-25.9%)

messages

(per query)

106

25.7

(-75.8%)

61.4

(-42.1%)

response time

(rounds)

2

1.19

(-40.3%)

1.60

(-20.0%)

DASFAA Conference 2008


Experimental evaluation con t20 l.jpg
Experimental Evaluation (con't)

IV. Approximate Caching Scenarios

  • 4000 generated random 3- and 4-term queries from benchmark query set

  • Comparison of 5 scenarios against standard query routing (SQR):

  • Effectiveness of AC in terms of relative recall depending on number of peers contributed to cached subquery result

DASFAA Conference 2008


Experimental evaluation con t21 l.jpg
Experimental Evaluation (con't)

DASFAA Conference 2008


Experimental evaluation con t22 l.jpg
Experimental Evaluation (con't)

V. Real-World Query-Log

  • Using AOL query-log to have time-order of queries: overall 57,344 requests with 39,640 unique queries

  • Combination of EC and AC

  • Results: ~25% hit rate & recall imrovement from 0.45 to 0.52

DASFAA Conference 2008


Experimental evaluation con t23 l.jpg
Experimental Evaluation (con't)

VI. Impact of Churn

  • On benefits of EC-MS

  • Different churn rates: fraction of peers leave network

DASFAA Conference 2008


Experimental evaluation con t24 l.jpg
Experimental Evaluation (con't)

DASFAA Conference 2008


Conclusions open issues l.jpg
Conclusions & Open Issues

Conclusions

  • Introduced simple, yet effective, caching framework to take advantage of previous work of peers in P2P network

  • Exact Caching (EC):

    • possibility to improve recall - or to reduce response time / network cost

    • experiments used Wikipedia benchmark and real-world query-log

    • investigated various cache replacementstrategies and considered churn in P2P

  • Approximate Caching (AC):

    • aggressive reuse of cached results of subqueries - if full query results not available

    • demands on existing cached results for satisfying outcomes

      Open Issues

  • Proactive Caching (anticipate interesting queries, e.g., from existing logs)

  • Maintaining cache freshness (new or better results are available)

  • Replication (metadata and/or documents)

DASFAA Conference 2008


Slide26 l.jpg

Thank You For Your Attention!

Questions or Comments?

DASFAA Conference 2008


ad