1 / 10

Heavy-Tailed Distribution and Multi-Keyword Queries

Heavy-Tailed Distribution and Multi-Keyword Queries. Surajit Chaudhuri , Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007 2008. 07. 31. Summarized by JongHeum Yeon , IDS Lab., Seoul National University. INTRODUCTION.

finola
Download Presentation

Heavy-Tailed Distribution and Multi-Keyword Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heavy-Tailed Distribution and Multi-Keyword Queries SurajitChaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007 2008. 07. 31. Summarized by JongHeumYeon, IDS Lab., Seoul National University

  2. INTRODUCTION • Inverted Index in Information Retrieval • T0 = "it is what it is“, T1 = "what is it“, T2 = "it is a banana“ • "a": {2}, "banana": {2}, "is": {0, 1, 2}, "it": {0, 1, 2}, "what": {0, 1} • Search “what”, “is”, “it” • {0,1} ∩{0,1,2} ∩{0,1,2} = {0,1} • Some queries require costly deep traversal into long lists in web-sites(Amazon, eBay, …) with large catalogs of products • The challenge is to reduce the worst-case overhead required to process arbitrary keyword queries

  3. Motivating Scenario • More frequent terms have relatively long inverted lists • Intersections of long inverted indexes are very slow relative to other queries • Figure • 20 million products • Frequency : F(>900K)-M(50K)-L(<1K)

  4. Problem Statement Given a document collection, propose a set of indexes to materialize Time for intersecting keywords does not exceed a given threshold Δ Additional indexes should not be larger than k(small factor) times the size of the original inverted index

  5. INDEX STRUCTURE AND USAGE • Notation • Query Q • words(Q) = {w1, … , wl} • kmax : maximum number of terms in query • γ : global vocabulary • π : global ordering • Given keyword-combination C = {w1, … , wl}, sort words by global ordering for avoiding permutations of keyword-combination • size(Q) : number of items(=document) whose text contains all keyword of a query Q • size(w) : single word w, number of documents containing w • |Q| : number of keywords a query Q contains

  6. Cost Model • Cost • Disk seeks to the beginning of posting lists + • Scanning postings • Unit of cost : scanning a single posting in an inverted index • Δ : Cost bound

  7. Processing Strategies • Execution Strategies • ID-intersection • Retrieves all inverted indexes of the queried keywords and intersects them • |Q| seeks accesses to disk, reading their contents entirely • Post-filtering • When wi in Q is very rare, • Reading text of wi by inverted index, then verifying the remaining keyword constraints using text

  8. Index Structure materialize combinations of frequent keywords and a small fraction of them For each vocabulary items w, a list of all keyword combinations containing w for which they have materialized the corresponding inverted index

  9. Query Processing Query Q = {w1, … , wl} Q contains rare keyword : post-filtering strategy Otherwise : retrieve all match-list entries

  10. EXPERIMENTS • Evaluation of Query Cost • Materialized the index structure : 10K frequent words • Kmax = 4, CostSeek = 1000 • Δ : cost of scanning 20% of the number of postings • Speed-ups • 18x (2 keywords) • 14x (4 keywords) • Evaluation of Index Sizes • 899M postings • No additional indexes for keywords occurring in less than 50 documents • 141K keywords for indexing • Multi-keyword index structures contained 734M postings • Accuracy of Intersection-size Estimation • Match list covers 99.3%

More Related