1 / 22

Topical Query Decomposition

Topical Query Decomposition. Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08. Abstract. Given a query and a document retrieval system

cooper-kirk
Download Presentation

Topical Query Decomposition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topical Query Decomposition Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08

  2. Abstract • Given a query and a document retrieval system • To produce a small set of queries whose union of resulting documents corresponds approximately to that of the original query. • Set cover problem • Greedy algorithm • Clustering problem • Two-phase algorithm based on hierarchical agglomerative clustering (dynamic programming)

  3. Introduction • A query log L • A list of pairs < q, D(q) > • q: query, • D(q): its result a set of documents that answer query q • Q(q) the maximal set of queries pi, where for each pi, the set D(pi) has at least one document in common with the documents returned by q

  4. The goal is to compute a cover. • Selecting a subcollection CQ(q7) such that it covers almost all of D(q7)

  5. Problem Statement – 1/3 • Red-Blue set cover problem • U={b1,…bn, r1,…rm} ( for a query q ) • B={b1,…bn} (i.e., document set) • R={r1,…rm} (i.e., query set) • S={S1,…,Sk} is provided from L (query log L) • SiU • SiB: blue points in Si (SiB= Si B) • SiR : red points in Si (SiR= Si B) • Goal:To find a subcollection C ⊆ S thatcovers many blue pointsof Uwithout covering too many red points.

  6. Problem Statement – 2/3 • For each query q, the candidate queriesQ(q) • For each set Si with blue and red points, its weight is • scatter sc(Si) (coherence: opposite of scatter)

  7. Problem Statement – 3/3 • Our goal is to find a subcollection C ⊆ S that covers almost all the blue points of U and has large coherence. • More precisely, we want that C satisfies the following properties: • Cover-blue • Not-cover-red • Small-overlap • Coherence

  8. Greedy Algorithm – 1/2 • At i-th iteration , minimizes s(S,VB,VR) • lC, lR, lO are parameters that weight the relative importance of the three terms. • VB : blue balls were already selected at before iterations • VR : red balls were already selectedat before iterations D. Peleg. Approximation algorithm for the label-covermax and red-blue set cover problem. Journal of Discrete Algorithms, 2007

  9. Greedy Algorithm – 2/2

  10. Integer Programming • Si+S2+….Sl <=10 • Si <= 1

  11. Clustering-Based Method • Two-phase approach • First phase: all points in set B are clustered using a hierarchical agglomerative clustering algorithm. (CLUTO toolkit) • Second phases: to match the clusters of the hierarchy produced by the agglomerative algorithm with the sets of S. • The main idea is to match sets of S into clusters of G • Every node T ∈ G corresponds to a cluster • T(B) be the set of points in B

  12. Clustering-Based Method DendrogramG

  13. Clustering-Based Method -Dynamic Programming - 1/2 • Complete Coverage: • for each set SS v.s. for each node T∈ G , • Matching score m(T, S) • m*(T) the score of the best matching set in S. • Optimal cost of covering the points of TB with sets in S.

  14. Clustering-Based Method -Dynamic Programming - 2/2 • Partial Coverage: • lU weights the relative importance between the two terms, the scatter cost of the sets S and the number of uncovered points.

  15. Application • Query log L : 2.9 million distinct queries • A majority of users only looks at the first page of results, while few users request more result pages. • D(q): any user asking for q in the query log navigated, and consider the set of result documents for the query • 24 million distinct documents seen by the users

  16. Application - Candidate queries for the cover • For each query q, the candidate queries Qk(q)

  17. Application - Results • A set of 100 queries were randomly picked from top 10,000 queries submitted by users. • Cost of k queries • The number of documents included outside the set D(q) • Average numbre of queries covering each element • Coverage after the top k candidates have been picked

  18. Conclusions • A novel problem : • Topical query decomposition • Elegant solutions • red-blue metric set cover • clustering with predefined clusters. ( hierarchical agglomerative clustering ) • The set-cover formulation provides solutions of better quality • Code and data for reproducing the results shown in Table 3 is available at • http://www.yr-bcn.es/querydecomp/ .

More Related