1 / 18

Efficient Top-K Query Evaluation on Probabilistic Data

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007. PRESENTED BY : JITENDRA GUPTA. Efficient Top-K Query Evaluation on Probabilistic Data. Introduction Challenges in Probabilistic Databases Possible Worlds

tarak
Download Presentation

Efficient Top-K Query Evaluation on Probabilistic Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA Efficient Top-K Query Evaluation onProbabilistic Data

  2. Introduction • Challenges in Probabilistic Databases • Possible Worlds • DNF Formula based Query Evaluation • Monte Carlo(MC) Simulation • Critical Region • Multi Simulations • Experiments & Results • Conclusions & Future Work OUTLINE

  3. Probabilistic databases are used to model data which contain unreliable, inconsistent and imprecise information but SQL query evaluation on such data is difficult. • The imprecision in the data lead to large number of answers of low quality and users are interested only in the answers with the highest probabilities. • In comparison to previous approaches which restricted the SQL queries and made precise query evaluation, the algorithm in the paper computes and ranks efficiently the top-k answers to SQL query on probabilistic database. • In this paper there is a new approach to query evaluation on probabilistic databases, by combining top-k style queries with approximation algorithms with provable guarantees. Introduction

  4. Probabilistic database is an uncertain database in which the possible worlds have associated probabilities. The simplistic definition is that every tuple belongs to the database with some probability (between 0 - 1). • #P complete queries are not handled efficiently in probabilistic databases. Probabilistic Databases

  5. Major challenge in probabilistic database is query evaluation. Dalvi & Suciu have shown that most SQL queries are #P Complete and the algorithm described in the paper handles such queries efficiently. • Computing the exact output probabilities is computationally hard. • Any algorithm computing the output probabilities needs to iterate through all possible worlds(here all possible subsets of TitleMatchp). • Another challenge is that number of potential answers needed to compute a probability is large and we generally see that the user is likely to end up inspecting just the first few of them. Challenges In Probabilistic Databases

  6. A possible world is thus any subset of the tuples in the database and its probability can be computed as a product of the probabilities of the tuples in it, and the respective probabilities of the tuples that are not in that world. Consider the following probabilistic database containing two relations S and T: • A probabilistic database over schema S is a pair (W,P) where W = {W1, . . . ,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is called a possible world. Possible Worlds

  7. Let Jp be a database instance over schema Sp. Then Mod(Jp) is the probabilistic database (W,P) over the schema S. • The possible world semantics are shown on the figure below from the example considered in the paper : Example on possible worlds

  8. In boolean logic, a disjunctive normal form (DNF) is a standardization of a logical formula which is a disjunction of conjunctive clauses. • Let (W,P) be a probabilistic database and let t1, t2, . . . be all the tuples in all possible worlds. We interpret each tuple as a boolean propositional variable, and each possible world W as a truth assignment to these propositional variables, as follows: ti = true if ti belongs to W, and ti = false if ti does not belong to W. • Consider now a DNF formula E over tuples: clearly E is true in some worlds and false in others. Define its probability P(E) to be the sum of P(W) for all worlds W where E true. Continuing our example, the expression E = (t1^t5)_t2 is true in the possible worlds W3,W7,W10,W11, and its probability is thus P(E) = P(W3) + P(W7) + P(W10) + P(W11). DNF formula Based Query Evaluation

  9. Example Query: qe = SELECT * FROM AMZNReviewsa, AMZNReviewsb, TitleMatchax, TitleMatchby, IMDBMoviex, IMDBMoviey, IMDBDirectord WHERE ... Each answer returned by qe will have 7 tuple variables defined in where clause: (a, b, axp, byp, x, y, d) where: • axp and bypare probabilistic tuples. (From TupleMatch table) • Thus, every row returned by qe defines a boolean expression t.E =axpΛbyp. DNF formula Based Query Evaluation

  10. Next we group the rows by their directors, and for each group G = {(axp1, byp1), . . . , (axpm, bypm)} • DNF formula: G.E = (axp1Λbyp1) V . . . V (axpmΛbxpm)The director’s probability give by P(G.E). • How to calculate the director’s probability? • Brute Force Approach: Choose every possible world and calculate the truth value of the boolean expression. p = P(G.E) is the frequency with which G.E = true #P Hard problem. • Alternative approach is the Monte Carlo simulation which is far better then the brute force approach. Group by usage

  11. An MC algorithm repeatedly chooses at random a possible world, and computes the truth value of the Boolean expression G.E (Eq.(3)); the probability p = P(G.E) is approximated by the frequency ˜p with which G.E was true. • In Luby & Karp algorithm which is a variant of MC the important part for our algorithm is that after running N steps the algorithm guarantees with high probability that p is in some interval where p belongs to {a^N,b^N} as shown below: • Algorithm : fix an order on the disjuncts: t1, t2, . . . , tm C := 0 repeat Choose a random disjuncttiЄ G Choose a random truth assignment s.t. ti.E = true if forall j < itj.E = false then C := C + 1 until N times return ˜p = C/N Monte Carlo Simulation

  12. Critical region - (c, d) = (topk(a1, . . . , an), topk+1(b1, . . . , bn)) • Top objects - T = {Gi | d <= ai} • Bottom objects - B = {Gi | bi <= c} • Assumptions : Intervals : Critical region

  13. The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers. • G = {G1……………….Gn} set of n objects with p1………..pn unknown probabilities where TopK is a subset of G • We assume c < d from now on, and call Gi a crosser if [c, d] subset [ai, bi] • Gi is a double crosser if ai < c, d < bi • Gi is a lower(upper) crosser if ai < c (d < bi) MultiSimulations

  14. Algorithm : MS TopK(G, k) : /* G = {G1, . . . ,Gn} */ Let [a1, b1] = . . . = [an, bn] = [0, 1], (c, d) = (0, 1) while c <= d do Case 1: choose a double crosser to simulate Case 2: choose upper and lower crosser to simulate Case 3: choose a maximal crosser to simulate Update (c, d) using Eq.(5) end while returnTopK = T = {Gi | d <= ai} Multisimulations

  15. Experiments-various Query Result SizeS • The experiments are done on 4 queries that illustrate different scales for the number of groups and the average size for each group, where S->Small & L->Large • SS • SL • LS • LL

  16. Results Obtained

  17. Through this paper we have proved that using the algorithm we can get near optimal answers for top-k queries on probabilistic databases, with applications to imprecision in data. • In the future we also need to take care of a data model where the probabilities are not listed explicitly. Conclusions & Future WOrk

  18. THANK YOU ?QUESTIONS?

More Related