query driven indexing for p2p text retrieval
Download
Skip this Video
Download Presentation
Query-Driven Indexing for P2P Text Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 28

Query-Driven Indexing for P2P Text Retrieval - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on

The Future of Web Search 19.07.2007 Bertinoro , Italy. Alvis. Query-Driven Indexing for P2P Text Retrieval. Gleb Skobeltsyn EPFL, Switzerland June 19, 2007. Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer. Goal.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Query-Driven Indexing for P2P Text Retrieval' - hovan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
query driven indexing for p2p text retrieval

The Future of Web Search

19.07.2007 Bertinoro, Italy

Alvis

Query-Driven Indexing forP2P Text Retrieval

Gleb Skobeltsyn

EPFL, Switzerland

June 19, 2007

  • Joint work with:
    • Toan Luu
    • Ivana Podnar Žarko
    • Martin Rajman
    • Karl Aberer

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

slide2
Goal
  • Our goalis to achieve scalable full-text retrieval with structured P2P networks (DHTs)
  • Each peer:
  • Provides resources (bandwidth, storage)
  • Searches the whole network
  • Publishes its own documents

DHT

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

na ve single term approach

K I

K I

K I

K I

K I

K I

K I

K I

K I

h(“gleb”)-{d2,d3}

h(“epfl”)-{d1,d2}

h(t’)-{d4,d5}

{d1,d2}

{d2}

Naïve (single-term) approach

... is to distribute the global inverted index in a DHT using term partitioning:

This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor

Query: “epfl & gleb”

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

single term vs multi term p2p indexing
Single-term vs. multi-term P2P indexing

voc. sizecould growexponentially!

How to choose keys to keep a satisfactory retrieval quality?

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

multi term indexing framework
Multi-term indexing: framework
  • Each peer is responsible for a set of keys assigned by the underlying DHT using the standard hashing mechanism
  • Each keycorresponds to a term or a set of terms
  • Each key is assigned to a truncated posting list (TPL) that stores at most DFmax top-ranked document references
  • Distributed index contains {key,TPL} pairs
  • The indexing load is handled by an optimizedDHT layer:
    • F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer
    • Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS\'07

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

single term vs multi term p2p indexing1
Single-term vs. multi-term P2P indexing

voc. sizecould growexponentially!

How to choose keys to keep a satisfactory retrieval quality?

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

multi term indexing techniques
Multi-term indexing techniques
  • Indexing with Highly Discriminative Keys (HDKs), based on:
    • Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer

in ICDE’07

    • Beyond term indexing: A P2P framework for Web information retrieval

I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer

Informatica, vol. 30, no. 2, 2006.

  • Query-Driven Indexing (QDI), based on:
    • Web Text Retrieval with a P2P Query-Driven IndexG. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Abererin SIGIR’07
    • Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in Infoscale’07

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

indexing with hdk
Indexing with HDK
  • Data-Driven key generation:
    • Each time a new document is indexed, some pos-ting lists for a key k can reach the max size of DFmax
      • It triggersthe generation of new keys (k + other frequent keys)
    • Use a number of filters to reduce the number of keys, e.g.:
      • Proximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w).

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

indexing with hdk1
Indexing with HDK
  • Pro’s:
    • ICDE’07 paper proves that the number of keys grows linearly
    • Elegant key generation mechanism
    • Low bandwidth while query processing (PL’s of limited size)
  • Con’s:
    • Practically the number of keys is LARGE: 68M for 0.6M docs
    • High bandwidth consumption at indexing
  • Problem:
    • Too many keys are superfluous (almost never used)

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

query driven indexing
Query Driven Indexing

Lets index only what is queried!

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

contents
Contents
  • Introduction
  • Single-term vs. multi term indexing
  • HDK approach for indexing
  • Query-driven approach for indexing/retrieval
    • Indexing structure
    • Example
    • Scalability
    • Evaluation
  • Conclusion

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

query driven index qdi
Query-Driven Index (QDI)
  • Query-Driven Indexing strategy solves the “Too-Many-Keys” problem:
    • Avoids maintenance of superfluous keys
    • Generates only such keys that are requested by users
    • Utilizes query-log to discover such keys
  • Problems
    • Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key
      • Smart Broadcast (ONM) or
      • Conventional intersection like TA, but less frequent
    • Incomplete index causes degradation of query results quality
      • Show that the degradation is low

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

which keys to index
Which keys to index?
  • Each single-term found in the document collection has to be indexed.
    • We call all single-term keys a basic single term index.
    • The posting lists are truncated at DFmax.
  • A key k is non-superfluous and can be activated iff:
    • k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFminis a parameter for our model (popularity filter).
    • k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter).
    • all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter).

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

qdi retrieval
QDI: Retrieval

?abc

nothing

  • Single term index is generated
  • Process abc
    • Probe Pabc
    • Probe PabPbc and Pac
    • Probe PaPb and Pc
    • Obtain top-DFmax results for a,b and c(ranked w.r.t a,b and c respectively)
    • Contact peers in the list, re-rank the obtained results w.r.t abc
    • Output top-10
  • Inc. the QF for ab, bc and ac
  • Activate (index) ac

?abc

peer

?abc

b

ab

ac

bc

a

c

abc

+1

+1

+1

popular

nothing

nothing

nothing

DFmax

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

qdi retrieval 2
QDI: Retrieval 2
  • Assume the frequency of b is below DFmax
  • Note, how the redundancy filter would simplify the lattice in such a case

(grayed nodes cannot be activated)

abc

ab

bc

ac

a

b

c

abc

ab

bc

DFmax

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

qdi retrieval 3
QDI: Retrieval 3

?abc

nothing

  • Single term index is generated and ac is indexed
  • Process abc
    • Probe Pabc
    • Probe PabPbc and Pac – obtain the result for ac
    • Probe Pb and obtain the result for b
    • Contact all peers in the list to re-rank the obtained results w.r.tabc
    • Output top-10
  • Inc. the QF for ab, bcand ac

?abc

peer

?abc

ab

abc

c

a

ac

bc

b

+1

+1

+1

nothing

nothing

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

scalability
Scalability
  • The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size)
  • The indexing traffic depends on the number of keys to be activated.
    • The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents
    • The number of keys does not depend on the document collection size but only on the size of the query log
    • We can use the QFmin parameter to adjust the tradeoff: indexing traffic <-> retrieval quality

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

contents1
Contents
  • Introduction
  • Single-term vs. multi term indexing
  • HDK approach for indexing
  • Query-driven approach for indexing/retrieval
    • Indexing structure
    • Example
    • Scalability
    • Evaluation
  • Conclusion

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

aol logs
AOL logs
  • 17M Queries from March, April, May 2006 (92 days)
  • 650K anonymous user sessions
  • Extracted all unique queries from each user session:

2006-05-31 23:50:30 wearthbow.com native.cheyenne origin.

2006-05-31 23:50:30 l6 screensaver

2006-05-31 23:50:30 horses for sale in tn ky

2006-05-31 23:50:30 bank of america.com

2006-05-31 23:50:30 ask

2006-05-31 23:50:29 del rosa lanes

2006-05-31 23:50:28 www.spirit airlines.com

2006-05-31 23:50:28 find holy women of the bible

2006-05-31 23:50:27 trains

2006-05-31 23:50:27 todaysmiricles

2006-05-31 23:50:27 constition

2006-05-31 23:50:26 german grocceries in las vegas nv

2006-05-31 23:50:25 porn

2006-05-31 23:50:25 northwest indiana

2006-05-31 23:50:24 united.eprize.net

2006-05-31 23:50:24 jessica laguna

<-0.7Gb

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

distribution of combinations in the aol logs
Distribution of combinations in the AOL logs

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

trec experiment
TREC Experiment
  • WT10G collection (~1.69 M docs)
  • 100 TREC queries (from TREC Web Track 9 & 10)
  • Query statistics generated form 17M AOL queries
  • Using Okapi-BM25 weighting schema to compute ranking score
  • QFmin = 1, 3, 5, ∞
  • DFmax= 100, 500
  • smax=3

TREC: Precision at Top Ranked Pages (table)

Precision is similar to centralized indexing

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

overlap experiment
Overlap experiment
  • Use the query-log to build the index (days 1..91)
  • Choose randomly 2K test queries from the day 92
  • Answer each test query with Google and compare to the union of top-DFmax Google results for each of its combinations that areindexed according to the logs.
  • Mimics our P2PIR system if Google’s ranking is used.
  • Example:

Non-superfluous (indexed) combinations

Original query

X

X

[email protected]=3/5=60%

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

overlap example
Overlap example
  • Cut-n-paste from the simulation log:

>id=481,q=“what did babe ruth do in the 1920”

“1920 babe ruth”, qf=0---->[email protected]=100%

“1920 babe”, qf=0--------->[email protected]= 9%

+++“1920 ruth”, qf=1--------->[email protected]=33%

+++“babe ruth”, qf=495 ------->[email protected]= 69%

---“1920”, qf=716 ------------>[email protected]= 1%

---“babe”, qf=3196 ----------->[email protected]= 2%

---“ruth”, qf=1653 ----------->[email protected]= 7%

Size: 192, Keys used: 2, [email protected]: 94%

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

google experiment impact of s max df max
Google experiment: impact of smax, DFmax

impact of Smaxfor all possible

combinations (QFmin=0)

Impact of DFmax with QFmin=1, Smax =3

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

google experiment impact of qf min
Google experiment: impact of QFmin
  • Does not depend
  • on the document
  • collection size
  • HDK approach
  • would require
  • ~65M keys for
  • 650K documents
  • >30% of badly performing queries are misspells => real quality is higher

impact of QFmin(DFmax=600)

Number of keys for different QFmin

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

google experiment impact of the log size
Google experiment: impact of the log size

impact of the log size (Qfmin=1, DFmax=600)

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

conclusions
Conclusions
  • We presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks:
    • Stores posting lists in a DHT for terms andterm combinations
    • Stores at most DFmax top document references in a posting list
    • Efficiently collects the query statisticsin a distributed fashion
    • Based on this statistics activates (indexes) only popularkeys
    • Computes the result of a multi-term query based only on the index entries available at the moment – nocostly intersections
  • We also showed that:
    • With real query-logs our approach achieves good retrieval quality
    • The QFmin parameter adjusts the traffic/quality tradeoff

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

last slide
Last slide

Thank you for your attention!

Questions?

AlvisP2P - to appear in July at

http://globalcomputing.epfl.ch/alvis/

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

ad