1 / 14

Top- k Queries on Uncertain Data: On Score Distribution and Typical Answers

Top- k Queries on Uncertain Data: On Score Distribution and Typical Answers. SIGMOD’09 Tingjian Ge and Stan Zdonik Computer Science Department Brown University Providence, RI, USA Samuel Madden CSAIL MIT Cambridge, MA, USA. Outline. Introduction Problem formulation

bruno
Download Presentation

Top- k Queries on Uncertain Data: On Score Distribution and Typical Answers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top-k Queries on Uncertain Data:On Score Distribution and Typical Answers SIGMOD’09 TingjianGe and Stan Zdonik Computer Science Department Brown University Providence, RI, USA Samuel Madden CSAIL MIT Cambridge, MA, USA

  2. Outline • Introduction • Problem formulation • Computing score distribution of top-k • Computing c-typical-topk • Empirical study • Conclusions • My thoughts

  3. Introduction • One important observation of U-topk • The tradeoffs between reporting high-score tuples and tuples with a high probability of being top-k The U-Topk answer is atypical! a twice score of U-Topk with probability not much less than 0.2 a total higher score with Probability 0.76 the expected score is 164.1 higher than U-Topk

  4. Problem formulation • Since the complete distribution of all potential top-ktuple vectors is too large to compute, we provide a number of typical vectors that sample the distribution instead. • c-Typical-Topk scores • c-Typical-Topktuples

  5. Computing score distribution of top-k • Two simple algorithms • k-Combo • Simply generate all k-combinations iteratively. • StateExpansion {}, P=1, S=0 {}, P=0.7, S=0 {}, P=0.42, S=0 {T7}, P=0.3, S=125 Output vectors {T3}, P=0.28, S=110 {T7, T3}, P=0.12, S=235

  6. Computing score distribution of top-k (cont.) • Above two algorithms execute with high complexity. • The authors propose the main algorithm using dynamic programming.

  7. Computing score distribution of top-k (cont.) • The need for approximation • The number of distributions in each cell is upper bounded by . • Thus, the time complexity of this algorithm is . • However, in most applications, there is no need to keep all distributions since the scores of different combinations are very close, or even the same. • Here, we set a constant number c’ to be the maximum size of a cell. Once the number of distributions is more than c’, we coalesce the closest two vectors into one: the score value is their average and the probability is their sum.

  8. Computing score distribution of top-k (cont.) • Handling mutually exclusive rules T7: (183, 0.15) T4: (138, 0.15) T2: (118, 0.2)

  9. Computing score distribution of top-k (cont.) • Handling ties • Configuration of top-k • (k-g) uncertain tuplesand g tuples from a tie group (the ending tie group with the lowest score) • E.g., configuration of top-5 • Pr({T1, T2, T3, T4, T5, T6, T7}) = 0.00012 • Pr({T1, T2, T3, T4, T5, T6}) = 0.00048 • Pr({T1, T2, T3, T4, T5, T7}) = 0.00018 • Pr({T1, T2, T3, T4, T6, T7}) = 0.00012 • →(56,0.0009) • We can simply run the main algorithm with the uncertain database in score ranking order.

  10. Computing c-Typical-Topk • Use dynamic programming to solve the following functions • Fa(j) finds the first typical score position from sj to sn. • Ga(j) finds the first position that is closest to the second typical score. • E.g. This is what we want. s1……s10 …… s26 …… sn Then, find the first typical score position from s26 to sn, recursively. the first position that is closest to the second typical score the first typical score from s1 to sn

  11. Empirical study

  12. Conclusions • Observe the need to put more emphasis on the ranking score. • Provide the distribution of top-k vectors and c-Typical-Topk answers. • Experimental results verify the motivation.

  13. My thoughts • The approximation approach in the dynamic programming algorithm • In each cell, we only keep the first mvectors with the highest probability.

  14. My thoughts (cont.) • The approach of handling ties is not applicable to our program. • In this paper, the problem of ties can be solved since vectors with the same score are coalesced into one. • However, in our program, we keep all vectors separately. • Assumed all scores are distinct in many researches.

More Related