Top- k Queries on Uncertain Data: On Score Distribution and Typical Answers

Top-k Queries on Uncertain Data:On Score Distribution and Typical Answers SIGMOD’09 TingjianGe and Stan Zdonik Computer Science Department Brown University Providence, RI, USA Samuel Madden CSAIL MIT Cambridge, MA, USA

Outline • Introduction • Problem formulation • Computing score distribution of top-k • Computing c-typical-topk • Empirical study • Conclusions • My thoughts

Introduction • One important observation of U-topk • The tradeoffs between reporting high-score tuples and tuples with a high probability of being top-k The U-Topk answer is atypical! a twice score of U-Topk with probability not much less than 0.2 a total higher score with Probability 0.76 the expected score is 164.1 higher than U-Topk

Problem formulation • Since the complete distribution of all potential top-ktuple vectors is too large to compute, we provide a number of typical vectors that sample the distribution instead. • c-Typical-Topk scores • c-Typical-Topktuples

Computing score distribution of top-k • Two simple algorithms • k-Combo • Simply generate all k-combinations iteratively. • StateExpansion {}, P=1, S=0 {}, P=0.7, S=0 {}, P=0.42, S=0 {T7}, P=0.3, S=125 Output vectors {T3}, P=0.28, S=110 {T7, T3}, P=0.12, S=235

Computing score distribution of top-k (cont.) • Above two algorithms execute with high complexity. • The authors propose the main algorithm using dynamic programming.

Computing score distribution of top-k (cont.) • The need for approximation • The number of distributions in each cell is upper bounded by . • Thus, the time complexity of this algorithm is . • However, in most applications, there is no need to keep all distributions since the scores of different combinations are very close, or even the same. • Here, we set a constant number c’ to be the maximum size of a cell. Once the number of distributions is more than c’, we coalesce the closest two vectors into one: the score value is their average and the probability is their sum.

Computing score distribution of top-k (cont.) • Handling mutually exclusive rules T7: (183, 0.15) T4: (138, 0.15) T2: (118, 0.2)

Computing score distribution of top-k (cont.) • Handling ties • Configuration of top-k • (k-g) uncertain tuplesand g tuples from a tie group (the ending tie group with the lowest score) • E.g., configuration of top-5 • Pr({T1, T2, T3, T4, T5, T6, T7}) = 0.00012 • Pr({T1, T2, T3, T4, T5, T6}) = 0.00048 • Pr({T1, T2, T3, T4, T5, T7}) = 0.00018 • Pr({T1, T2, T3, T4, T6, T7}) = 0.00012 • →(56,0.0009) • We can simply run the main algorithm with the uncertain database in score ranking order.

Computing c-Typical-Topk • Use dynamic programming to solve the following functions • Fa(j) finds the first typical score position from sj to sn. • Ga(j) finds the first position that is closest to the second typical score. • E.g. This is what we want. s1……s10 …… s26 …… sn Then, find the first typical score position from s26 to sn, recursively. the first position that is closest to the second typical score the first typical score from s1 to sn

Empirical study

Conclusions • Observe the need to put more emphasis on the ranking score. • Provide the distribution of top-k vectors and c-Typical-Topk answers. • Experimental results verify the motivation.

My thoughts • The approximation approach in the dynamic programming algorithm • In each cell, we only keep the first mvectors with the highest probability.

My thoughts (cont.) • The approach of handling ties is not applicable to our program. • In this paper, the problem of ties can be solved since vectors with the same score are coalesced into one. • However, in our program, we keep all vectors separately. • Assumed all scores are distinct in many researches.

Top- k Queries on Uncertain Data: On Score Distribution and Typical Answers

Top- k Queries on Uncertain Data: On Score Distribution and Typical Answers

Presentation Transcript

Foundations of Probabilistic Answers to Queries

An Abstract Framework for Generating Maximal Answers to Queries

Exam 2 Grade Distribution

Top-k Query Processing in Uncertain Database

Probabilistic/Uncertain Data Management -- III

Probabilistic Reasoning with Uncertain Data

Today’s lecture will aim at providing some typical answers given by firefighters about their social relationships at wor

Understanding z scores

Today’s class

Z-Scores Continued

Photovoltaic systems in a distribution network

Query Optimization: Relational Queries to Data Mining

A Generalized Version Space Learning Algorithm for Noisy and Uncertain Data

Supporting Ranking in Queries Score-based Paradigm

Custom Data Queries

Managing Uncertain Data

CSE 636 Data Integration

Join Queries

Oblivious Querying of Data with Irregular Structure

For which of the following probabilities would the binomial distribution be appropriate?