Query driven indexing for p2p text retrieval
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Query-Driven Indexing for P2P Text Retrieval PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

The Future of Web Search 19.07.2007 Bertinoro , Italy. Alvis. Query-Driven Indexing for P2P Text Retrieval. Gleb Skobeltsyn EPFL, Switzerland June 19, 2007. Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer. Goal.

Download Presentation

Query-Driven Indexing for P2P Text Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Query driven indexing for p2p text retrieval

The Future of Web Search

19.07.2007 Bertinoro, Italy

Alvis

Query-Driven Indexing forP2P Text Retrieval

Gleb Skobeltsyn

EPFL, Switzerland

June 19, 2007

  • Joint work with:

    • Toan Luu

    • Ivana Podnar Žarko

    • Martin Rajman

    • Karl Aberer

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Query driven indexing for p2p text retrieval

Goal

  • Our goalis to achieve scalable full-text retrieval with structured P2P networks (DHTs)

  • Each peer:

  • Provides resources (bandwidth, storage)

  • Searches the whole network

  • Publishes its own documents

DHT

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Na ve single term approach

K I

K I

K I

K I

K I

K I

K I

K I

K I

h(“gleb”)-{d2,d3}

h(“epfl”)-{d1,d2}

h(t’)-{d4,d5}

{d1,d2}

{d2}

Naïve (single-term) approach

... is to distribute the global inverted index in a DHT using term partitioning:

This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor

Query: “epfl & gleb”

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Single term vs multi term p2p indexing

Single-term vs. multi-term P2P indexing

voc. sizecould growexponentially!

How to choose keys to keep a satisfactory retrieval quality?

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Multi term indexing framework

Multi-term indexing: framework

  • Each peer is responsible for a set of keys assigned by the underlying DHT using the standard hashing mechanism

  • Each keycorresponds to a term or a set of terms

  • Each key is assigned to a truncated posting list (TPL) that stores at most DFmax top-ranked document references

  • Distributed index contains {key,TPL} pairs

  • The indexing load is handled by an optimizedDHT layer:

    • F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer

    • Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Single term vs multi term p2p indexing1

Single-term vs. multi-term P2P indexing

voc. sizecould growexponentially!

How to choose keys to keep a satisfactory retrieval quality?

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Multi term indexing techniques

Multi-term indexing techniques

  • Indexing with Highly Discriminative Keys (HDKs), based on:

    • Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

      I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer

      in ICDE’07

    • Beyond term indexing: A P2P framework for Web information retrieval

      I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer

      Informatica, vol. 30, no. 2, 2006.

  • Query-Driven Indexing (QDI), based on:

    • Web Text Retrieval with a P2P Query-Driven IndexG. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Abererin SIGIR’07

    • Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in Infoscale’07

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Indexing with hdk

Indexing with HDK

  • Data-Driven key generation:

    • Each time a new document is indexed, some pos-ting lists for a key k can reach the max size of DFmax

      • It triggersthe generation of new keys (k + other frequent keys)

    • Use a number of filters to reduce the number of keys, e.g.:

      • Proximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w).

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Indexing with hdk1

Indexing with HDK

  • Pro’s:

    • ICDE’07 paper proves that the number of keys grows linearly

    • Elegant key generation mechanism

    • Low bandwidth while query processing (PL’s of limited size)

  • Con’s:

    • Practically the number of keys is LARGE: 68M for 0.6M docs

    • High bandwidth consumption at indexing

  • Problem:

    • Too many keys are superfluous (almost never used)

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Query driven indexing

Query Driven Indexing

Lets index only what is queried!

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Contents

Contents

  • Introduction

  • Single-term vs. multi term indexing

  • HDK approach for indexing

  • Query-driven approach for indexing/retrieval

    • Indexing structure

    • Example

    • Scalability

    • Evaluation

  • Conclusion

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Query driven index qdi

Query-Driven Index (QDI)

  • Query-Driven Indexing strategy solves the “Too-Many-Keys” problem:

    • Avoids maintenance of superfluous keys

    • Generates only such keys that are requested by users

    • Utilizes query-log to discover such keys

  • Problems

    • Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key

      • Smart Broadcast (ONM) or

      • Conventional intersection like TA, but less frequent

    • Incomplete index causes degradation of query results quality

      • Show that the degradation is low

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Which keys to index

Which keys to index?

  • Each single-term found in the document collection has to be indexed.

    • We call all single-term keys a basic single term index.

    • The posting lists are truncated at DFmax.

  • A key k is non-superfluous and can be activated iff:

    • k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFminis a parameter for our model (popularity filter).

    • k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter).

    • all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter).

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Qdi retrieval

QDI: Retrieval

?abc

nothing

  • Single term index is generated

  • Process abc

    • Probe Pabc

    • Probe PabPbc and Pac

    • Probe PaPb and Pc

    • Obtain top-DFmax results for a,b and c(ranked w.r.t a,b and c respectively)

    • Contact peers in the list, re-rank the obtained results w.r.t abc

    • Output top-10

  • Inc. the QF for ab, bc and ac

  • Activate (index) ac

?abc

peer

?abc

b

ab

ac

bc

a

c

abc

+1

+1

+1

popular

nothing

nothing

nothing

DFmax

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Qdi retrieval 2

QDI: Retrieval 2

  • Assume the frequency of b is below DFmax

  • Note, how the redundancy filter would simplify the lattice in such a case

    (grayed nodes cannot be activated)

abc

ab

bc

ac

a

b

c

abc

ab

bc

DFmax

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Qdi retrieval 3

QDI: Retrieval 3

?abc

nothing

  • Single term index is generated and ac is indexed

  • Process abc

    • Probe Pabc

    • Probe PabPbc and Pac – obtain the result for ac

    • Probe Pb and obtain the result for b

    • Contact all peers in the list to re-rank the obtained results w.r.tabc

    • Output top-10

  • Inc. the QF for ab, bcand ac

?abc

peer

?abc

ab

abc

c

a

ac

bc

b

+1

+1

+1

nothing

nothing

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Scalability

Scalability

  • The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size)

  • The indexing traffic depends on the number of keys to be activated.

    • The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents

    • The number of keys does not depend on the document collection size but only on the size of the query log

    • We can use the QFmin parameter to adjust the tradeoff: indexing traffic <-> retrieval quality

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Contents1

Contents

  • Introduction

  • Single-term vs. multi term indexing

  • HDK approach for indexing

  • Query-driven approach for indexing/retrieval

    • Indexing structure

    • Example

    • Scalability

    • Evaluation

  • Conclusion

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Aol logs

AOL logs

  • 17M Queries from March, April, May 2006 (92 days)

  • 650K anonymous user sessions

  • Extracted all unique queries from each user session:

2006-05-31 23:50:30 wearthbow.com native.cheyenne origin.

2006-05-31 23:50:30 l6 screensaver

2006-05-31 23:50:30 horses for sale in tn ky

2006-05-31 23:50:30 bank of america.com

2006-05-31 23:50:30 ask

2006-05-31 23:50:29 del rosa lanes

2006-05-31 23:50:28 www.spirit airlines.com

2006-05-31 23:50:28 find holy women of the bible

2006-05-31 23:50:27 trains

2006-05-31 23:50:27 todaysmiricles

2006-05-31 23:50:27 constition

2006-05-31 23:50:26 german grocceries in las vegas nv

2006-05-31 23:50:25 porn

2006-05-31 23:50:25 northwest indiana

2006-05-31 23:50:24 united.eprize.net

2006-05-31 23:50:24 jessica laguna

<-0.7Gb

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Distribution of combinations in the aol logs

Distribution of combinations in the AOL logs

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Trec experiment

TREC Experiment

  • WT10G collection (~1.69 M docs)

  • 100 TREC queries (from TREC Web Track 9 & 10)

  • Query statistics generated form 17M AOL queries

  • Using Okapi-BM25 weighting schema to compute ranking score

  • QFmin = 1, 3, 5, ∞

  • DFmax= 100, 500

  • smax=3

TREC: Precision at Top Ranked Pages (table)

Precision is similar to centralized indexing

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Overlap experiment

Overlap experiment

  • Use the query-log to build the index (days 1..91)

  • Choose randomly 2K test queries from the day 92

  • Answer each test query with Google and compare to the union of top-DFmax Google results for each of its combinations that areindexed according to the logs.

  • Mimics our P2PIR system if Google’s ranking is used.

  • Example:

Non-superfluous (indexed) combinations

Original query

X

X

[email protected]=3/5=60%

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Overlap example

Overlap example

  • Cut-n-paste from the simulation log:

>id=481,q=“what did babe ruth do in the 1920”

“1920 babe ruth”, qf=0---->[email protected]=100%

“1920 babe”, qf=0--------->[email protected]= 9%

+++“1920 ruth”, qf=1--------->[email protected]=33%

+++“babe ruth”, qf=495 ------->[email protected]= 69%

---“1920”, qf=716 ------------>[email protected]= 1%

---“babe”, qf=3196 ----------->[email protected]= 2%

---“ruth”, qf=1653 ----------->[email protected]= 7%

Size: 192, Keys used: 2, [email protected]: 94%

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Google experiment impact of s max df max

Google experiment: impact of smax, DFmax

impact of Smaxfor all possible

combinations (QFmin=0)

Impact of DFmax with QFmin=1, Smax =3

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Google experiment impact of qf min

Google experiment: impact of QFmin

  • Does not depend

  • on the document

  • collection size

  • HDK approach

  • would require

  • ~65M keys for

  • 650K documents

  • >30% of badly performing queries are misspells => real quality is higher

impact of QFmin(DFmax=600)

Number of keys for different QFmin

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Google experiment impact of the log size

Google experiment: impact of the log size

impact of the log size (Qfmin=1, DFmax=600)

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Conclusions

Conclusions

  • We presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks:

    • Stores posting lists in a DHT for terms andterm combinations

    • Stores at most DFmax top document references in a posting list

    • Efficiently collects the query statisticsin a distributed fashion

    • Based on this statistics activates (indexes) only popularkeys

    • Computes the result of a multi-term query based only on the index entries available at the moment – nocostly intersections

  • We also showed that:

    • With real query-logs our approach achieves good retrieval quality

    • The QFmin parameter adjusts the traffic/quality tradeoff

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


Last slide

Last slide

Thank you for your attention!

Questions?

AlvisP2P - to appear in July at

http://globalcomputing.epfl.ch/alvis/

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval


  • Login