1 / 18

Presented By Chandrashekar Vijayarenu Anirban Maiti

Efficient Top-k Query Evaluation on Probabilistic Data Paper By Christopher R´e Nilesh Dalvi Dan Suciu. Presented By Chandrashekar Vijayarenu Anirban Maiti. Agenda . Overview DNF Evaluation Query Evaluation Problem definition Monte Carlo (MC) Simulation Multi simulation(MS)

abra-hyde
Download Presentation

Presented By Chandrashekar Vijayarenu Anirban Maiti

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Top-k Query Evaluation on Probabilistic DataPaper By Christopher R´eNileshDalvi Dan Suciu Presented By ChandrashekarVijayarenu AnirbanMaiti

  2. Agenda • Overview • DNF Evaluation • Query Evaluation • Problem definition • Monte Carlo (MC) Simulation • Multi simulation(MS) • Experiments

  3. Probabilistic Databases • Probabilistic database is an uncertain database in which the possible worlds have associated probabilities. The simplistic definition is that every tuple belongs to the database with some probability (between 0 - 1) • Example:

  4. Select Query on Probabilistic Database • Find directors with a highly rated Drama and low rated comedy SELECT DISTINCT d.dirName AS Director FROM AMZNReviewsa, AMZNReviewsb, TitleMatchax, TitleMatchby, IMDBMoviex, IMDBMoviey, IMDBDirectord WHERE a.asin=ax.asin and b.asin=by.asin and ax.mid=x.mid and by.mid=y.mid and x.did=y.did and y.did=d.did and x.genre=’comedy’ and y.genre=’drama’ and abs(x.year - y.year) <= 5 and a.rating<2 and b.rating>4 Major challenge in probabilistic database is query evaluation. Dalvi and Suciu have shown that most SQL queries are #P Complete. This paper propose a new approach to query evaluation on probabilistic databases, by combining top-k style queries. User specifies a SQL Query and a number k, and the system returns the highest ranked k answers.

  5. Query Processing Challenges • Compute exact output probabilities is computationally hard. Meaning, any algorithm computing the probabilities need to iterate through all possible subsets of TitleMatch. • Potential answers for which we need to calculate the probability is large. User is likely to end up inspecting just the first few of them. This paper introduces the multisimulation algorithm which enables effective processing of probabilistic queries with some error guarantees.

  6. Overview

  7. Possible Worlds • A possible world is thus any subset of the tuples in the database and its probability can be computed as a product of the probabilities of the tuples in it, and the respective probabilities of the tuples that are not in that world. Consider the following probabilistic database containing two relations S and T: • A probabilistic database over schema S is a pair (W,P) where W = {W1, . . . ,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is called a possible world. • Let Jp be a database instance over schema Sp. Then Mod(Jp) is the probabilistic database (W,P) over the schema S obtained as described below

  8. DNF Formula In boolean logic, a disjunctive normal form (DNF) is a standardization of a logical formula which is a disjunction of conjunctive clauses. In our Example E = (t1Λt5) V t2 = true in the possible worlds W3,W7,W10,W11, and its probability is thus P(E) = P(W3) + P(W7) + P(W10) + P(W11).

  9. Query Evaluation  DNF Evaluation qe = SELECT * FROM AMZNReviewsa, AMZNReviewsb, TitleMatchax, TitleMatchby, IMDBMoviex, IMDBMoviey, IMDBDirectord WHERE ... • Each answer returned by qe will have 7 tuple variables defined in where clause: (a, b, axp, byp, x, y, d) • axp and bypare probabilistic tuples. (From TupleMatch table) • Thus, every row returned by qe defines a boolean expression t.E = axpΛbyp.

  10. Introduce Group By Next we group the rows by their directors, and for each group G = {(axp1, byp1), . . . , (axpm, bypm)} DNF formula: G.E = (axp1Λbyp1) V . . . V (axpmΛbxpm) The director’s probability give by P(G.E). How to calculate the director’s probability? Brute Force Approach: Choose every possible world and calculate the truth value of the boolean expression. p = P(G.E) is the frequency with which G.E = true #P Hard problem Monte Carlo Simulation: Choose the possible world at random and calculate the truth value.

  11. Problem Definition G = {G1, . . . ,Gn} of n objects, with unknown Probabilities p1, . . . , pn, and a number k <= n. Our goal is to find a set of k objects with the highest probabilities, denoted Top-K Subset of G. Solution: The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers

  12. Monte Carlo (MC) Simulation An MC algorithm repeatedly chooses at random a possible world, and computes the truth value of the Boolean expression G.E (Eq.(3)); the probability p = P(G.E) is approximated by the frequency ˜p with which G.E was true. Luby and Karp have described the variant shown in Algorithm fix an order on the disjuncts: t1, t2, . . . , tm C := 0 repeat Choose a random disjuncttiЄ G Choose a random truth assignment s.t. ti.E = true if forall j < itj.E = false then C := C + 1 until N times return ˜p = C/N

  13. Multisimulation (MS) Assumptions: Intervals:

  14. Multisimulation (MS) Critical Region: The critical region, top objects, and bottom objects are: (c, d) = (topk(a1, . . . , an), topk+1(b1, . . . , bn))……Eq (5) T = {Gi | d <= ai} B = {Gi | bi <= c} Algorithm: MS TopK(G, k) : /* G = {G1, . . . ,Gn} */ Let [a1, b1] = . . . = [an, bn] = [0, 1], (c, d) = (0, 1) while c <=d do Case 1:choose a double crosser to simulate Case 2:choose upper and lower crosser to simulate Case 3:choose a maximal crosser to simulate Update (c, d) using Eq.(5) end while returnTopK= T = {Gi| d <=ai}

  15. Example : Let us select Top 2. c d G1 G2 G3 G4 G5 0.1 0.3 0.4 0.5 0.6 0.8 0.9 0.2 0.7

  16. Experiments

  17. Experiments results

  18. Thank you

More Related