1 / 49

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation. Kaushik Chakrabarti. Venkatesh Ganti. Jiawei Han. Dong Xin. Presented by: Vaidergorn Eitan. Outline. Introduction System Overview Scoring Functions SQL implementation Early Termination Approach Experiments

feryal
Download Presentation

Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation Kaushik Chakrabarti Venkatesh Ganti Jiawei Han Dong Xin Presented by: Vaidergorn Eitan

  2. Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions

  3. Introduction • More and more document collections, has their documents relate to objects. • Laptop reviews site: Laptop reviews

  4. Laptop reviews Introduction OF: I need the best “lightweight” & a “business use” laptop. OF (Object Finder) Queries

  5. Introduction • The goal: • Get Top K • Exploiting the relationships between documents and objects. • Exploiting the Fact that we need only K.

  6. Introduction • Search Objects - SOs - Documents • Target Objects - TOs

  7. Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions

  8. System Overview • FTS (Full Text Search): • Input: Keyword/s. • Output: Ranked lists of documents

  9. System Overview • FTS (Full Text Search): • Most relational DBMS now support FTS functionality.

  10. System Overview T is used only for the final lookup of the TO values • DBMS: • T • R

  11. Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions

  12. Scoring Functions • The OF evaluation system returns top K target objects that has the best scores according to scoring function.

  13. Scoring Functions • W={w1,w2,…,wN} • keywords in the OF query. • Li • ranked sorted list • <document id, DocScore> • Dt • list of documents related to

  14. Scoring Functions • Score matrix Mt – for each t in TOs • Score(t) - the relevance score for the TO t. • compute rows score • compute cols score

  15. Scoring Functions Row-marginal Class:

  16. Scoring Functions Column-marginal Class:

  17. Scoring Functions • Fcomb is monotonic: Fcomb(x1,…,xn) ≤ Fcomb(y1,…,yn) when xi ≤ yi • Fagg is subset monotonic: Fagg(S) ≤Fagg(S’) if S ≤ S’. • Fagg distributes over append: Fagg(R1 append R2)= Fagg(Fagg(R1),Fagg(R2)).append here is ordered concatenation of tuples.

  18. Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions

  19. SQL Implementation

  20. Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions

  21. Early Termination Approach • Intuition: top scoring documents typically contribute the most to the scores of high scoring TOs. • The TOs related to these top scoring documents are most likely to be the best candidate matches. • We progressively retrieve documents in the decreasing order of their scores, and maintain upper and lower bound scores for the related TOs.

  22. Early Termination Approach • Generate-only Approach: • Rely on bounds • stops when identified the best K TOs • Generate-Prune Approach: • candidate generation • Stop condition more relaxed • pruning phase.

  23. Candidate Generation • Ci • We retrive in chunks from Li. • Prefix(Li) • documents retrieved so far from the Lis (rank list). • SeenTOs • current aggregation scores. • AggResulti - For each Li, table containing • numSeen • aggScore • upper bound and lower bound scores.

  24. Candidate Generation

  25. Candidate Generation • The Algorithm has 5 steps:

  26. Candidate Generation • Step1 - Retrieve Documents : • we retrieve the next Ci from each Li. • Reduce the number of join queries (with R).

  27. Candidate Generation • Step2 - Update SeenTOs: Prefix(L2) Prefix(L1) AggResult(2) AggResult(1)

  28. Candidate Generation Prefix(L2) Prefix(L1) AggResult(2) AggResult(1)

  29. Candidate Generation Prefix(L2) Prefix(L1) AggResult(2) AggResult(1)

  30. Candidate Generation • Step3 - Compute bounds: • t.lb= Fcomb(t.aggScore[1],…t.aggScore[N]).

  31. Candidate Generation • B: • maximum number of documents in any ranked list Li that can contribute to the score of any target object t. • xi • DocScore of last document retrieved from Li. • t.ub[i]= Fagg(t.aggScore[i], Fagg(xi,xi,..,)). t.ub= Fcomb(t.ub[1],…,t.ub[N]). (B- t.numseen[i]) times t1.ub[1]=1.0+1.0*(2-1)=2 t2.ub[1]=1.0+1.0*(2-1)=2

  32. Candidate Generation • Step4 - Stopping Condition: We can stop when there are at least K objects in SeenTOs whose lower bound scores are higher than the upper bound score of any unseen TOs. • UnseenUB=Fcomb (Fagg(x1,x1,…),…, Fagg(xN,xN,…,). • So the stopping criterion is: LBK ≥ UnseenUB • LBK – the Kth high LB B times

  33. Candidate Generation Prefix(L1) X1=0.2; X2=0.3

  34. Candidate Generation • LBK ≥ UnseenUB • UnseenUB= ((0.2+02)+(0.3+0.3))=1 • LB3= 1.1

  35. Candidate Generation • Step5 - Identify candidates: • Top(List,X) • the top X elements in the list. • The set of candidates is defined by Top(UB,h) • h - least value which satisfies:LBK≥UBh+1

  36. Candidate Generation • LBK≥UBh+1 • LB3= 1.1 • LB3≥UB4+1 => h=4 • Top(LB,3)={t1,t3,t4} Top(UB,4)={t1,t2,t3,t4}. • Top(UB,h)={t1,t2,t3,t4}

  37. Pruning to the Final Top-K

  38. Pruning to the Final Top-K • UB={t1(2.5), t2(1.8), t3(1.6), t4(1.6)} K=3 • t1=((1+0.1)+(0.1+1))=2.2 • t1=2.2, t2=1.6, t3=1.6, t4=1.6 • UB={t1(2.2), t2(1.6), t3(1.6), t4(1.6)} • The final top-k results are {t1, t2, t3}

  39. Exact Top-K with Approximate scores K=2 • Exact Top-K with Approximate Scores: • Crossing Objects: its rank in LB is more than K and its rank in UB is K or less. • Boundary Objects: a pair of target objects (A,B): • The top K in UB and LB are same. • A is the Kth object in LB and uth object in UB (u ≤ k) • B is the (K+1)th object in UB and lth object in LB (l ≥ K+1) • LBK ≤ UBK+1

  40. Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions

  41. Experiment • Our documents comprise of a collection of 714,192 news articles from 03’-04’ obtained from MSNBC news portal. • We index those news articles inside SQL Server FTS engine. • We extract three types of named entities: PersonNames, OrganizationNames, and LocationNames.

  42. Experiment • To get realistic OF queries, we picked the following top 10 sport news queries on Google in 2004 .

  43. Experiment • “PersonNames” the desired entity type for all the queries. All our measurements are averaged across the 10 queries. • Implementation all 3 approaches to evaluate OF queries: SQL implemetation, GenPrune,GenOnly. • SUM as the combination function.SUM as the aggregation function.

  44. Experiment

  45. Experiment

  46. Outline • Introduction • System Overview • Scoring Functions • SQL implementation • Early Termination Approach • Experiments • Conclusions

  47. Conclusions • Class of OF queries and defined its semantics. • Two broad class of scoring functions, which exploit relationships between documents and objects, to compute the relevance score of the target objects for a given set of keywords. • We present early termination techniques which shows that our approach is 4-5 times faster than SQL implementation.

  48. Questions

More Related