Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken Germany ACM SigIR ‘05

An Initial Example… • TREC Robust Track ’04, hard query no. 363 (Aquaint news corpus) “transportation tunnel disasters” • Increased retrieval robustness • Count only the best match per document and expansion set • Increased efficiency • Top-k-style query evaluations • Open scans on new terms only on demand • No threshold tuning transportation tunnel disasters 1.0 1.0 1.0 transit highway train truck metro “rail car” car … 0.9 0.8 0.7 0.6 0.6 0.5 0.1 tube underground “Mont Blanc” … 0.9 0.8 0.7 catastrophe accident fire flood earthquake “land slide” … 1.0 0.9 0.7 0.6 0.6 0.5 d2 d1 Expansion terms from relevance feedback, thesaurus lookups, Google top-10 snippets, etc. Term similarities, e.g., Rocchio, Robertson&Sparck-Jones, concept similarities, or other correlation measures Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Outline • Computational model & background on top-k algorithms • Incremental Merge over inverted lists • Probabilistic candidate pruning • Phrase matching • Experiments & Conclusions Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Computational Model • Vector space model with a Cartesian product space D1×…×Dm and a data set D  D1×…×Dm m • Precomputed local scoress(ti,d)∈ Di for all d∈ D • e.g., tf*idf variations, probabilistic models (Okapi BM25), etc. • typically normalized to s(ti,d)∈ [0,1] • Monotonous score aggregation • aggr: (D1×…×Dm )  (D1×…×Dm )→ + • e.g.,sum, max, product (using sum over log sij ), cosine (using L2 norm) • Partial-match queries (aka. “andish”) • Non-conjunctive query evaluations • Weak local matches can be compensated • Access model • Disk-resident inverted index over large text corpus • Inverted lists sorted by decreasing local scores  Inexpensive sequential accesses to per-term lists: “getNextItem()”  More expensive random accesses: “getItemBy(docid)” Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

No-Random-Access (NRA) Algorithm [Fagin et al., PODS ’01 Balke et al. VLDB ’00 Buckley&Lewit, SigIR ‘85] Corpus: d1,…,dn • NRA(q,L): • scan all lists Li (i = 1..m) in parallel // e.g., round-robin • < d, s(ti,d) > = Li.getNextItem() • E(d) = E(d)  {i} • highi = s(ti,d) • worstscore(d) = ∑E(d)s(t ,d) • bestscore(d) = worstscore(d) + ∑E(d)high • if worstscore(d) > min-k then • add d to top-k • min-k = min{ worstscore(d’) | d’  top-k} • else if bestscore(d) > min-k then • candidates = candidates  {d} • if max {bestscore(d’) | d’ candidates}  min-k then return top-k d1 d1 d1 s(t1,d1) = 0.7 … s(tm,d1) = 0.2 Query q = (transportation, tunnel disaster) Inverted Index k = 1 d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 Scan depth 1 transport Scan depth 2 … Naive Join-then-Sort in between O(mn) and O(mn2) runtime Scan depth 3 d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 tunnel … d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 STOP! disaster … Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Dynamic & Self-tuning Query Expansions virtual list~disaster … d42 d11 d92 d11 accident disaster fire d42 d92 d78 d11 d42 d10 d92 d32 d11 d21 d1 d87 ... ... ... top-k (transport, tunnel, ~disaster) • Incrementally merge inverted lists Li1…Lim’in descending order of local scores • Dynamically add lists into set of active expansions exp(ti) • Only touch short prefixes of each list, don’t need to open all lists • Best match score aggregation forcombined term similarities and local scores transport tunnel incr. merge d95 d66 d93 d17 d95 d11 d99 d101 ... ... Increased retrieval robustness & fewer topic drifts Increased efficiency through fewer active expansions No threshold tuning of term similarities in the expansions Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Incremental Merge Operator t1 d78 0.9 d23 0.8 d10 0.8 d1 0.4 d88 0.3 ... t2 d10 0.7 d64 0.8 d23 0.8 d12 0.2 d78 0.1 ... 0.4 0.72 0.18 t3 d11 0.9 d78 0.9 d64 0.7 d99 0.7 d34 0.6 ... 0.45 0.35 0.9 ~t Index list meta data (e.g., histograms) Relevance feedback, Thesaurus lookups,… Initial high-scores Expansion terms ~t = {t1,t2,t3} Correlation measures, Large corpus statistics … sim(t, t1 ) = 1.0 sim(t, t2 ) = 0.9 Expansion similarities sim(t, t3 ) = 0.5 Incremental Merge iteratively triggered by top-k operator  sequential access “getNextItem()” d88 0.3 d78 0.9 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 ... Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] • For each physically stored index list Li • Treat each s(ti,d)  [0,1]as a random variable Siand consider • Approximate local score distribution using an equi-width histogram with n buckets Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] • For each physically stored index list Li • Treat each s(ti,d)  [0,1]as a random variable Siand consider • Approximate local score distribution using an equi-width histogram with n buckets • For a virtual index list~Li = Li1…Lim’ • Consider the max-distribution (feature independence) • Alternatively, construct meta histogram for the active expansions Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04] • For each physically stored index list Li • Treat each s(ti,d)  [0,1]as a random variable Siand consider • Approximate local score distribution using an equi-width histogram with n buckets • For a virtual index list~Li = Li1…Lim’ • Consider the max-distribution (feature independence) • Alternatively, construct meta histogram for the active expansions Return current top-k list if candidate queue is empty! • For all d in the candidate queue • Consider the convolution over local score distributions to predict aggregated scores • Drop d from candidate queue, if Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Incremental Merge for Multidimensional Phrases Incr.Merge Nested Top-k Nested Top-k cable optic optics fiber fiber undersea d14 0.9 d78 0.9 d78 0.9 d34 0.9 d41 0.9 d78 0.8 d18 0.9 d23 0.8 d10 0.7 d12 0.8 d17 0.6 d23 0.8 d1 0.8 d10 0.8 d78 0.6 d75 0.5 d10 0.8 d23 0.6 d23 0.8 d1 0.7 d7 0.4 d5 0.4 d2 0.3 d1 0.7 d32 0.7 d88 0.2 d23 0.3 d23 0.1 d88 0.2 d47 0.1 … … … … … … q = {undersea „fiber optic cable“} Top-k • Nested Top-k operator iteratively prefetches & joins candidate items for each subquery condition “getNextItem()” • Propagates candidates in descending order of bestscore(d)values to provide monotonous upper score bounds • Provides [wortscore(d), bestscore(d)] guarantees to superordinate top-k operator • Top-level top-k operator performs phrase tests only for the most promising items (random access) (Expensive predicates & minimal probes[Chang&Hwang, SIGMOD ‘02]) • Single threshold condition for algorithm termination (candidate pruning at the top-level queue only) sim(„fiber optic cable“, „fiber optic cable“) = 1.0 sim(„fiber optic cable“, „fiber optics“) = 0.8 random access term-to- position index Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Experiments – Aquaint with Fixed Expansions MAP @1000 # RA #CPU sec P@10 max(KB) # SA relPrec max(m) avg(m) • Aquaint corpus of English news articles (528,155 docs) • 50 “hard” queries from TREC 2004 Robust track • WordNet expansions using a simple form of WSD • Okapi-BM25 model for local scores, Dice coefficients as term similarities • Fixed expansion technique (synonyms + first-order hyponyms) Title-only Baseline Static Expansions Dynamic Expansions Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Experiments – Aquaint with Fixed Expansions, cont’d Probabilistic Pruning Performance Incremental Merge vs. top-k with static expansions Epsilon controls pruning aggressiveness0 ≤ ε ≤ 1 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Conclusions & Ongoing Work • Increased efficiency • Incremental Merge vs. Join-then-Sort & top-k using static expansions • Very good precision/runtime ratio for probabilistic pruning • Increased retrieval robustness • Largely avoids topic drifts • Modeling of fine grained semantic similarities (Incremental Merge & Nested Top-k operators) • Scalability (see paper) • Large expansions (m < 876 terms per query) on Aquaint • Expansions for Terabyte collection (~25,000,000 docs) • Efficient support for XML-IR (INEX Benchmark) • Inverted lists for combined tag-term pairs e.g., sec=mining • Efficiently supports child-or-descendant axis e.g., //article//sec//=mining • Vague content & structure queries (VCAS) e.g., //article//~sec=~mining • TopX-Engine, VLDB ’05 Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Thank you! Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing