1 / 28

Query-Driven Indexing for P2P Text Retrieval

The Future of Web Search 19.07.2007 Bertinoro , Italy. Alvis. Query-Driven Indexing for P2P Text Retrieval. Gleb Skobeltsyn EPFL, Switzerland June 19, 2007. Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer. Goal.

hovan
Download Presentation

Query-Driven Indexing for P2P Text Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Future of Web Search 19.07.2007 Bertinoro, Italy Alvis Query-Driven Indexing forP2P Text Retrieval Gleb Skobeltsyn EPFL, Switzerland June 19, 2007 • Joint work with: • Toan Luu • Ivana Podnar Žarko • Martin Rajman • Karl Aberer G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  2. Goal • Our goalis to achieve scalable full-text retrieval with structured P2P networks (DHTs) • Each peer: • Provides resources (bandwidth, storage) • Searches the whole network • Publishes its own documents DHT G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  3. K I K I K I K I K I K I K I K I K I h(“gleb”)-{d2,d3} h(“epfl”)-{d1,d2} h(t’)-{d4,d5} {d1,d2} {d2} Naïve (single-term) approach ... is to distribute the global inverted index in a DHT using term partitioning: This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor Query: “epfl & gleb” G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  4. Single-term vs. multi-term P2P indexing voc. sizecould growexponentially! How to choose keys to keep a satisfactory retrieval quality? G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  5. Multi-term indexing: framework • Each peer is responsible for a set of keys assigned by the underlying DHT using the standard hashing mechanism • Each keycorresponds to a term or a set of terms • Each key is assigned to a truncated posting list (TPL) that stores at most DFmax top-ranked document references • Distributed index contains {key,TPL} pairs • The indexing load is handled by an optimizedDHT layer: • F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer • Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  6. Single-term vs. multi-term P2P indexing voc. sizecould growexponentially! How to choose keys to keep a satisfactory retrieval quality? G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  7. Multi-term indexing techniques • Indexing with Highly Discriminative Keys (HDKs), based on: • Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07 • Beyond term indexing: A P2P framework for Web information retrieval I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer Informatica, vol. 30, no. 2, 2006. • Query-Driven Indexing (QDI), based on: • Web Text Retrieval with a P2P Query-Driven IndexG. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Abererin SIGIR’07 • Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in Infoscale’07 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  8. Indexing with HDK • Data-Driven key generation: • Each time a new document is indexed, some pos-ting lists for a key k can reach the max size of DFmax • It triggersthe generation of new keys (k + other frequent keys) • Use a number of filters to reduce the number of keys, e.g.: • Proximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w). G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  9. Indexing with HDK • Pro’s: • ICDE’07 paper proves that the number of keys grows linearly • Elegant key generation mechanism • Low bandwidth while query processing (PL’s of limited size) • Con’s: • Practically the number of keys is LARGE: 68M for 0.6M docs • High bandwidth consumption at indexing • Problem: • Too many keys are superfluous (almost never used) G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  10. Query Driven Indexing Lets index only what is queried! G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  11. Contents • Introduction • Single-term vs. multi term indexing • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • Scalability • Evaluation • Conclusion G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  12. Query-Driven Index (QDI) • Query-Driven Indexing strategy solves the “Too-Many-Keys” problem: • Avoids maintenance of superfluous keys • Generates only such keys that are requested by users • Utilizes query-log to discover such keys • Problems • Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key • Smart Broadcast (ONM) or • Conventional intersection like TA, but less frequent • Incomplete index causes degradation of query results quality • Show that the degradation is low G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  13. Which keys to index? • Each single-term found in the document collection has to be indexed. • We call all single-term keys a basic single term index. • The posting lists are truncated at DFmax. • A key k is non-superfluous and can be activated iff: • k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFminis a parameter for our model (popularity filter). • k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter). • all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  14. QDI: Retrieval ?abc nothing • Single term index is generated • Process abc • Probe Pabc • Probe PabPbc and Pac • Probe PaPb and Pc • Obtain top-DFmax results for a,b and c(ranked w.r.t a,b and c respectively) • Contact peers in the list, re-rank the obtained results w.r.t abc • Output top-10 • Inc. the QF for ab, bc and ac • Activate (index) ac ?abc peer ?abc b ab ac bc a c abc +1 +1 +1 popular nothing nothing nothing DFmax G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  15. QDI: Retrieval 2 • Assume the frequency of b is below DFmax • Note, how the redundancy filter would simplify the lattice in such a case (grayed nodes cannot be activated) abc ab bc ac a b c abc ab bc DFmax G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  16. QDI: Retrieval 3 ?abc nothing • Single term index is generated and ac is indexed • Process abc • Probe Pabc • Probe PabPbc and Pac – obtain the result for ac • Probe Pb and obtain the result for b • Contact all peers in the list to re-rank the obtained results w.r.tabc • Output top-10 • Inc. the QF for ab, bcand ac ?abc peer ?abc ab abc c a ac bc b +1 +1 +1 nothing nothing G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  17. Scalability • The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size) • The indexing traffic depends on the number of keys to be activated. • The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents • The number of keys does not depend on the document collection size but only on the size of the query log • We can use the QFmin parameter to adjust the tradeoff: indexing traffic <-> retrieval quality G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  18. Contents • Introduction • Single-term vs. multi term indexing • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • Scalability • Evaluation • Conclusion G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  19. AOL logs • 17M Queries from March, April, May 2006 (92 days) • 650K anonymous user sessions • Extracted all unique queries from each user session: … 2006-05-31 23:50:30 wearthbow.com native.cheyenne origin. 2006-05-31 23:50:30 l6 screensaver 2006-05-31 23:50:30 horses for sale in tn ky 2006-05-31 23:50:30 bank of america.com 2006-05-31 23:50:30 ask 2006-05-31 23:50:29 del rosa lanes 2006-05-31 23:50:28 www.spirit airlines.com 2006-05-31 23:50:28 find holy women of the bible 2006-05-31 23:50:27 trains 2006-05-31 23:50:27 todaysmiricles 2006-05-31 23:50:27 constition 2006-05-31 23:50:26 german grocceries in las vegas nv 2006-05-31 23:50:25 porn 2006-05-31 23:50:25 northwest indiana 2006-05-31 23:50:24 united.eprize.net 2006-05-31 23:50:24 jessica laguna … <-0.7Gb G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  20. Distribution of combinations in the AOL logs G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  21. TREC Experiment • WT10G collection (~1.69 M docs) • 100 TREC queries (from TREC Web Track 9 & 10) • Query statistics generated form 17M AOL queries • Using Okapi-BM25 weighting schema to compute ranking score • QFmin = 1, 3, 5, ∞ • DFmax= 100, 500 • smax=3 TREC: Precision at Top Ranked Pages (table) Precision is similar to centralized indexing G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  22. Overlap experiment • Use the query-log to build the index (days 1..91) • Choose randomly 2K test queries from the day 92 • Answer each test query with Google and compare to the union of top-DFmax Google results for each of its combinations that areindexed according to the logs. • Mimics our P2PIR system if Google’s ranking is used. • Example: Non-superfluous (indexed) combinations Original query X X overlap@5=3/5=60% G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  23. Overlap example • Cut-n-paste from the simulation log: >id=481,q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0---->Ov@100=100% “1920 babe”, qf=0--------->Ov@100= 9% +++“1920 ruth”, qf=1--------->Ov@100=33% +++“babe ruth”, qf=495 ------->Ov@100= 69% ---“1920”, qf=716 ------------>Ov@100= 1% ---“babe”, qf=3196 ----------->Ov@100= 2% ---“ruth”, qf=1653 ----------->Ov@100= 7% Size: 192, Keys used: 2, Overlap@100: 94% G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  24. Google experiment: impact of smax, DFmax impact of Smaxfor all possible combinations (QFmin=0) Impact of DFmax with QFmin=1, Smax =3 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  25. Google experiment: impact of QFmin • Does not depend • on the document • collection size • HDK approach • would require • ~65M keys for • 650K documents • >30% of badly performing queries are misspells => real quality is higher impact of QFmin(DFmax=600) Number of keys for different QFmin G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  26. Google experiment: impact of the log size impact of the log size (Qfmin=1, DFmax=600) G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  27. Conclusions • We presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: • Stores posting lists in a DHT for terms andterm combinations • Stores at most DFmax top document references in a posting list • Efficiently collects the query statisticsin a distributed fashion • Based on this statistics activates (indexes) only popularkeys • Computes the result of a multi-term query based only on the index entries available at the moment – nocostly intersections • We also showed that: • With real query-logs our approach achieves good retrieval quality • The QFmin parameter adjusts the traffic/quality tradeoff G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

  28. Last slide Thank you for your attention! Questions? AlvisP2P - to appear in July at http://globalcomputing.epfl.ch/alvis/ G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

More Related