Efficient Top-K Query Evaluation on Probabilistic Data

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA Efficient Top-K Query Evaluation onProbabilistic Data

Introduction • Challenges in Probabilistic Databases • Possible Worlds • DNF Formula based Query Evaluation • Monte Carlo(MC) Simulation • Critical Region • Multi Simulations • Experiments & Results • Conclusions & Future Work OUTLINE

Probabilistic databases are used to model data which contain unreliable, inconsistent and imprecise information but SQL query evaluation on such data is difficult. • The imprecision in the data lead to large number of answers of low quality and users are interested only in the answers with the highest probabilities. • In comparison to previous approaches which restricted the SQL queries and made precise query evaluation, the algorithm in the paper computes and ranks efficiently the top-k answers to SQL query on probabilistic database. • In this paper there is a new approach to query evaluation on probabilistic databases, by combining top-k style queries with approximation algorithms with provable guarantees. Introduction

Probabilistic database is an uncertain database in which the possible worlds have associated probabilities. The simplistic definition is that every tuple belongs to the database with some probability (between 0 - 1). • #P complete queries are not handled efficiently in probabilistic databases. Probabilistic Databases

Major challenge in probabilistic database is query evaluation. Dalvi & Suciu have shown that most SQL queries are #P Complete and the algorithm described in the paper handles such queries efficiently. • Computing the exact output probabilities is computationally hard. • Any algorithm computing the output probabilities needs to iterate through all possible worlds(here all possible subsets of TitleMatchp). • Another challenge is that number of potential answers needed to compute a probability is large and we generally see that the user is likely to end up inspecting just the first few of them. Challenges In Probabilistic Databases

A possible world is thus any subset of the tuples in the database and its probability can be computed as a product of the probabilities of the tuples in it, and the respective probabilities of the tuples that are not in that world. Consider the following probabilistic database containing two relations S and T: • A probabilistic database over schema S is a pair (W,P) where W = {W1, . . . ,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is called a possible world. Possible Worlds

Let Jp be a database instance over schema Sp. Then Mod(Jp) is the probabilistic database (W,P) over the schema S. • The possible world semantics are shown on the figure below from the example considered in the paper : Example on possible worlds

In boolean logic, a disjunctive normal form (DNF) is a standardization of a logical formula which is a disjunction of conjunctive clauses. • Let (W,P) be a probabilistic database and let t1, t2, . . . be all the tuples in all possible worlds. We interpret each tuple as a boolean propositional variable, and each possible world W as a truth assignment to these propositional variables, as follows: ti = true if ti belongs to W, and ti = false if ti does not belong to W. • Consider now a DNF formula E over tuples: clearly E is true in some worlds and false in others. Define its probability P(E) to be the sum of P(W) for all worlds W where E true. Continuing our example, the expression E = (t1^t5)_t2 is true in the possible worlds W3,W7,W10,W11, and its probability is thus P(E) = P(W3) + P(W7) + P(W10) + P(W11). DNF formula Based Query Evaluation

Example Query: qe = SELECT * FROM AMZNReviewsa, AMZNReviewsb, TitleMatchax, TitleMatchby, IMDBMoviex, IMDBMoviey, IMDBDirectord WHERE ... Each answer returned by qe will have 7 tuple variables defined in where clause: (a, b, axp, byp, x, y, d) where: • axp and bypare probabilistic tuples. (From TupleMatch table) • Thus, every row returned by qe defines a boolean expression t.E =axpΛbyp. DNF formula Based Query Evaluation

Next we group the rows by their directors, and for each group G = {(axp1, byp1), . . . , (axpm, bypm)} • DNF formula: G.E = (axp1Λbyp1) V . . . V (axpmΛbxpm)The director’s probability give by P(G.E). • How to calculate the director’s probability? • Brute Force Approach: Choose every possible world and calculate the truth value of the boolean expression. p = P(G.E) is the frequency with which G.E = true #P Hard problem. • Alternative approach is the Monte Carlo simulation which is far better then the brute force approach. Group by usage

An MC algorithm repeatedly chooses at random a possible world, and computes the truth value of the Boolean expression G.E (Eq.(3)); the probability p = P(G.E) is approximated by the frequency ˜p with which G.E was true. • In Luby & Karp algorithm which is a variant of MC the important part for our algorithm is that after running N steps the algorithm guarantees with high probability that p is in some interval where p belongs to {a^N,b^N} as shown below: • Algorithm : fix an order on the disjuncts: t1, t2, . . . , tm C := 0 repeat Choose a random disjuncttiЄ G Choose a random truth assignment s.t. ti.E = true if forall j < itj.E = false then C := C + 1 until N times return ˜p = C/N Monte Carlo Simulation

Critical region - (c, d) = (topk(a1, . . . , an), topk+1(b1, . . . , bn)) • Top objects - T = {Gi | d <= ai} • Bottom objects - B = {Gi | bi <= c} • Assumptions : Intervals : Critical region

The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers. • G = {G1……………….Gn} set of n objects with p1………..pn unknown probabilities where TopK is a subset of G • We assume c < d from now on, and call Gi a crosser if [c, d] subset [ai, bi] • Gi is a double crosser if ai < c, d < bi • Gi is a lower(upper) crosser if ai < c (d < bi) MultiSimulations

Algorithm : MS TopK(G, k) : /* G = {G1, . . . ,Gn} */ Let [a1, b1] = . . . = [an, bn] = [0, 1], (c, d) = (0, 1) while c <= d do Case 1: choose a double crosser to simulate Case 2: choose upper and lower crosser to simulate Case 3: choose a maximal crosser to simulate Update (c, d) using Eq.(5) end while returnTopK = T = {Gi | d <= ai} Multisimulations

Experiments-various Query Result SizeS • The experiments are done on 4 queries that illustrate different scales for the number of groups and the average size for each group, where S->Small & L->Large • SS • SL • LS • LL

Results Obtained

Through this paper we have proved that using the algorithm we can get near optimal answers for top-k queries on probabilistic databases, with applications to imprecision in data. • In the future we also need to take care of a data model where the probabilities are not listed explicitly. Conclusions & Future WOrk

THANK YOU ?QUESTIONS?

Efficient Top-K Query Evaluation on Probabilistic Data

Efficient Top-K Query Evaluation on Probabilistic Data

Presentation Transcript

Top-k Query Processing

Efficient Evaluation of HAVING Queries on a Probabilistic Database

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Top-K Query Evaluation on Probabilistic Data

Top-k Queries on Temporal Data

Efficient Query Processing On Massive Multi-dimension Data

Top- k Queries on Uncertain Data

Space-Efficient Data Structures for Top-k Completion

Space-Efficient Data Structures for Top- k Completion

Top- K Query Evaluation with Probabilistic Guarantees

Probabilistic Similarity Query on Dimension Incomplete Data

EFFICIENT RANK BASED K-NN QUERY PROCESSING OVER UNCERTAIN DATA

Efficient Query Evaluation on Probabilistic Databases

IO-Top-k: Index-access Optimized Top-k Query Processing

Query Ranking in Probabilistic XML Data

Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data

Efficient Top-K Query Calculation in Distributed Networks

Efficient Top-k Query Evaluation on Probabilistic Data

Efficient Query Evaluation on Probabilistic Databases

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

IO-Top-k: Index-access Optimized Top-k Query Processing