1 / 33

Top-K Query Evaluation on Probabilistic Data

Top-K Query Evaluation on Probabilistic Data. Christopher Ré , Nilesh Dalvi and Dan Suciu University of Washington. High Level Overview. DBMS: Precise answers over clean data Data are often imprecise Information Integration Information Extraction

jared
Download Presentation

Top-K Query Evaluation on Probabilistic Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top-K Query Evaluation on Probabilistic Data Christopher Ré, NileshDalvi and Dan Suciu University of Washington

  2. High Level Overview • DBMS: Precise answers over clean data • Data are often imprecise • Information Integration • Information Extraction • Probabilistic DB (PDB) handle imprecision • Many low quality answers • Top-K ranked by probability This talk: Compute Top-K Efficiently Evaluating Complex SQL on PDBs

  3. Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs

  4. Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs

  5. Example Application How will I know which movie they are about? Find all years where ‘Anthony Hopkins’ starred in a good movie On the web there are lots of reviews Is a movie good or bad? A probabilistic database can help Alice store and query her uncertain data. Alice needs to do information extraction and object reconcillation. • Lots of interesting data above movies (e.g. actors, directors) • Well maintained and clean • But no reviews! IMDB Alice wants to do sentiment analysis. Evaluating Complex SQL on PDBs

  6. Imprecision is out there… Object Reconciliation Felligi-Sunter Approach: Score (s) each (RID,MID) Clean IMDB Data Our Approach: Convert scores to probabilities No Match Match Output: (RID,MID) pairs Data extracted from Reviews t’ t 12/8/2006 Evaluating Complex SQL on PDBs

  7. Object Reconciliation Imprecision is out there… Felligi-Sunter Approach: Score (s) each (RID,MID) No Match Match t’ t Evaluating Complex SQL on PDBs

  8. Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs

  9. Query Processing Background • Intensional Query Processing [FR97] • Associate to each tuple an event • Probability event is satisfied = query value Technical Point: Projection as last operator implies result is a DNF Query Processing builds event expression Evaluating Complex SQL on PDBs

  10. DNF Sampling at a High Level • Estimate p(t),probability DNF sat satisfied • Do for each output tuple, t • #P-Hard [Valiant79] even if only conjunctive queries [RDS06,DS04] • Randomized Approximation [LK84] Simulation reduces uncertainty 1.0 0.0 Uncertain about p(t) Evaluating Complex SQL on PDBs

  11. Naïve Query Processing • Naïve algorithm (PTIME): Simulate until all small • “Epsilon”-small Can we do better? 0.0 1.0 1 4 Christopher Walken 2 Samuel L. Jackson 3 Harvey Keitel Bruce Willis Evaluating Complex SQL on PDBs

  12. Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs

  13. A Better Method: Multisimulation • Separate Top-K with few simulations • Concentrate on intervals in Top-K • Asymptotically, confidence intervals are nested • Compare against OPT • “knows” which intervals to simulate 0.0 1.0 1 4 Christopher Walken 2 Samuel L. Jackson 3 Harvey Keitel Bruce Willis 12/8/2006 Evaluating Complex SQL on PDBs Evaluating Complex SQL on PDBs 13

  14. The Critical Region • The critical region is the interval • (kth-highest min, k+1sthigest max) • For k = 2 0.0 1.0 Evaluating Complex SQL on PDBs

  15. Three Simple Rules: Rule 1 • Pick a “Double Crosser” • OPT must pick this too 0.0 1.0 Evaluating Complex SQL on PDBs

  16. Three Simple Rules: Rule 2 • All lower/upper crossers then maximal • OPTmust pick this too 0.0 1.0 Evaluating Complex SQL on PDBs

  17. Three Simple Rules: Rule 3 • Pick an upper and a lower crosser • OPTmay only pick 1 of these two 0.0 1.0 Evaluating Complex SQL on PDBs

  18. Multisimulationis a 2-Approx • Thm: Multisimulation performs at most twice as many simulations as OPT • And, no deterministic algorithm can do better on every instance. • Extensions • Top-K Set (shown) • Anytime (produce from 1 to k) • Rank (produce top k ranked) • All ( rank all intervals ) Evaluating Complex SQL on PDBs

  19. Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs

  20. Experiment Details: Uncertain tuples Evaluating Complex SQL on PDBs

  21. Running Time Evaluating Complex SQL on PDBs

  22. Running Time “Find all years in which Anthony Hopkins was in a highly rated movie” (SS) Small Number of Tuples Output (33) Small DNFs per Output (Avg. 20.4, Max 63) Evaluating Complex SQL on PDBs

  23. Running Time “Find all directors who have a highly rated drama but low rated comedy” (LL) Large #Tuples Output (1415) Large DNFs per Output (Avg. 234.8, Max. 9088) Evaluating Complex SQL on PDBs

  24. Conclusions • Mystiq is a general purpose probabilistic database • Multisimulationand Logical Optimization • key to performance on large data sets • Advert: Demo on my laptop Evaluating Complex SQL on PDBs

  25. Running Time “Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction” (SL) Small Number of Tuples Output (33) Large DNFs per Output (Avg. 117.7,Max 685) Evaluating Complex SQL on PDBs

  26. Running Time “Find all directors in the 80s who had a highly rated movie” (LS) Large #Tuples Output (3259) Small DNFs per Output (Avg 3.03, Max 30) Evaluating Complex SQL on PDBs

  27. 0.0 1.0 Christopher Walken Samuel L. Jackson Harvey Keitel Bruce Willis Evaluating Complex SQL on PDBs

  28. 0.0 1.0 1 4 Christopher Walken 2 Samuel L. Jackson 3 Harvey Keitel Bruce Willis Evaluating Complex SQL on PDBs

  29. 0.0 1.0 Evaluating Complex SQL on PDBs

  30. 0.0 1.0 Evaluating Complex SQL on PDBs

  31. 0.0 1.0 Evaluating Complex SQL on PDBs

  32. 0.0 1.0 Evaluating Complex SQL on PDBs

  33. 0.0 1.0 Evaluating Complex SQL on PDBs

More Related