1 / 45

An overview of the Mystiq System

An overview of the Mystiq System. Christopher Ré , Dan Suciu and the Mystiq Team University of Washington. One slide overview. Data are uncertain in many applications Business: Dedup , Info. Extraction Data from physical-world: RFID. Probabilistic DBs ( pDBs ) manage uncertainty.

sylvie
Download Presentation

An overview of the Mystiq System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An overview of the Mystiq System Christopher Ré, Dan Suciu and the Mystiq Team University of Washington

  2. One slide overview • Data are uncertain in many applications • Business: Dedup, Info. Extraction • Data from physical-world: RFID Probabilistic DBs (pDBs) manage uncertainty Query and Build Applications on uncertain data Value: Higher recall, without loss of precision This talk: An overview of Mystiq DEMO

  3. Outline • Motivation • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SELECT-FROM-WHERE (SFW) queries 2. Safe Queries 3. Materialized Views

  4. [R,Dalvi&S’07] Example: Alice Looks for Movies I’d like to know whichmovies are really good… • Internet Movie Database (IMDB): • Lots of data ! • Well maintained and clean • But no reviews! Think: Enterprise Data IMDB

  5. What is the title of the movie in the review? On the web thereare lots of reviews… Which movie does that title match in my DB? Is the reviewpositive or negative ? …Pulp Fiction was a great.. …Pul Fiction was awful … Should I trustthe reviewer ? • Alice may need (Buzzwords): • Information Extraction • Fuzzy joins • Sentiment analysis • Social networks IMDB Alice is forced to deal with uncertainty

  6. Find actors in Pulp Fiction whoappeared in two bad moviesfive years earlier Answer combines uncertainty from information extraction, fuzzy joins, etc. A probabilistic database helps Alice store and query uncertain data • Alice may need (Buzzwords): • Information Extraction • Fuzzy joins • Sentiment analysis • Social networks IMDB

  7. Alice needs Fuzzy Joins Titles don’t match Clean database IMDB Reviews

  8. [Gravanoet al’01,Arasu’06] Result of a Fuzzy Join Higher scores, more likely to match TitleReviewMatchp

  9. Queries over Fuzzy Joins IMDB TitleReviewMatchp Reviews Ranked ! Answer: Who reviewed movies made in 1935 ? SELECT DISTINCT z.ByFROM IMDB x, TitleReviewMatchp y, Amazon zWHERE x.title=y.title and x.year=1935 and y.review=z.review Find movies reviewed by Jim and Joe Answer: SELECT DISTINCT x.Title FROM IMDB x, TitleReviewMatchp y1, Amazon z1, TitleReviewMatchp y2, Amazon z2 WHERE . . .z1.By=‘Joe’ . . . . z2.By=‘Jim’ . . .

  10. Hasn’t this been solved? (an analogy to keep in mind) SCALE Impact: Fortune 500 companies rely on DBs, but how many have theorem provers?

  11. Mystiq Design Goals: scale. • Middleware/Query rewriting system. • RDBMS does heavy lifting. • In apps, lots of certain data. • Research Focus: Efficient query evaluation • Philosophy: Change as little as possible. • Restricted inference at large scales • Use DB tricks: static analysis, data complexity, materialized views.

  12. Outline • Motivation • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SFW queries 2. Safe Queries 3. Materialized Views

  13. [Barbara et al. ‘92] Mystiq’s BID tables Probability Keys Non-keys HasObjectp What does it mean ? NB: Probabilities need not add to 1

  14. [Fagin,Halpern,Megido’90] Possible Worlds Semantics HasObjectp Distribution over possible worlds PDB HasObject 3 * 4 = 12 Worlds Possibleworlds p1p3 p1p4 p1(1- p3-p4-p5)

  15. Possible Worlds Query Semantics HasObjectp PDB HasObject Q=“John has laptop77 and doesn’t have book302” p1p3 P[Q]= p1(1-p4) p1p5 QP Goal: Compute w.o. expanding all worlds p1(1- p3-p4-p5)

  16. Outline • Motivation • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SFW queries 2. Safe Queries 3. Materialized Views

  17. [Fuhr&Roellke’97, Graedel et al. ’98, Dalvi & S ‘04] SFW Query via IntensionalEval Goal: Make relational ops compute Boolean expression f Pr[q] reduced to Pr[fis SAT]. Duplicate removing P s JOIN Approx Pr[f is SAT] NB: f is also known as lineage Tuples = variables in expression

  18. Approximating Tuple answers Q=“Find actors in Pulp Fiction who appeared in two bad movies five years earlier” • SQL queries have provably fast apx-inference (LK) 0.0 p 1.0 Christopher Walken Don’t know prob (p) that ‘C. Walken’ is in output of Q Run many “simulations” to reduce uncertainty

  19. [R,Dalvi&S’07] Motivation for Top-K for SFW queries Naïve: Run LK, make all small • LK fast in theory… “Find the top (most-likely) actor in Pulp Fiction who appeared in two bad movies five years earlier” Lots of wasted effort. Can we do better? 0.0 1.0 1 4 Christopher Walken 2 Samuel L. Jackson 3 Harvey Keitel Bruce Willis “Confidence intervals” contain true probability

  20. [R,Dalvi&S’07] A Better Method: Multisimulation • Goal: Separate Top-K with few simulations • LK is more expensive than SQL, reduce this cost • Ranking is all that is important • Intuition: • Concentrate LK on intervals in Top-K • View intervals as “nested” or “shrinking” 12/8/2006 Evaluating Complex SQL on PDBs 20

  21. [R,Dalvi&S’07] Key Idea: Critical Region • The critical region is the interval • (kth-highest min, k+1st higest max) • For k = 2 0.0 1.0

  22. [R,Dalvi&S’07] Key Idea: Critical Region • The critical region is the interval • (kth-highest min, k+1sthigest max) • For k = 2 Separated the top 2 0.0 1.0

  23. DEMO See how Mystiq uses the critical region to reduce unnecessary simulations.

  24. Three Simple Rules: Rule 1 • Pick a “Double Crosser” • OPT must pick this too 0.0 1.0 Compare v. OPT: “knows” intervals to simulate

  25. Three Simple Rules: Rule 2 • All lower/upper crossers then maximal • OPT must pick this too 0.0 1.0 Compare v. OPT: “knows” intervals to simulate

  26. Three Simple Rules: Rule 3 • Pick an upper and a lower crosser • OPT may only pick 1 of these two 0.0 1.0 Compare v. OPT: “knows” intervals to simulate

  27. [R,Dalvi&S’07] Multisimulation Performance • Thm: Multisimulation performs at most twice as many simulations as OPT • And, no deterministic algorithm can do better on every instance. • Practice: very slow w.o. low-level optimization • Still slow with current techniques. • Open question! • Slow v. SQL, not inference

  28. Outline • Motivation/Type of Apps considerd • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SFW queries 2. Safe Queries 3. Materialized Views

  29. [Fuhr&Roellke’97, Dalvi & S ‘04] Extensional Query Evaluation “Not all are false” Goal: Make relational ops compute probabilities Removes Duplicates P s JOIN Why? It’s SQL–scale and SQL-fast

  30. [Fuhr&Roellke’97, Dalvi & S ‘04] Extensional Plan to SQL SELECT DISTINCT loc FROMReviewers P{loc} SELECTloc, 1 – PRODUCT(1-p) as p FROM Reviewers GROUP BY loc Translation Important point: Extensional Evaluation is SQL – so SQL fast So pDBs are just SQL, but… Reviewers

  31. SELECT DISTINCT x.City FROM Personp x, Reviewedpy WHERE x.Name = y.Reviewer and y.Movie= ‘Iron Man’ “Cities where someone reviewed ‘Iron Man’ ” Wrong ! Not independent! Correct P JOIN JOIN P Depends on plan !!!

  32. [Dalvi&S’04] Safe Plans • A plan that correctly computes probabilities (extensionally) is called a safe plan • Query Compilation = finding this condition • i.e., it isa syntactic condition • Intuition: A plan is safe if • it only multiplies independent probabilities.

  33. DEMO See how safe plans allow query answering at SQL speed

  34. [Dalvi&S’04] Thm: The algorithm is complete Data complexityis #P complete Qbad :- R(x), S(x,y), T(y) Bottomline: If there is a plan, we find it. If we don’t find a plan, it’s provably hard • Theorem The following are equivalent (no self-joins) • Q has PTIME data complexity • Q admits an extensional plan (and one finds it in PTIME) • Q does not have Qbad as a subquery NB: never looked at the data, so is query compilation

  35. Outline • Motivation/Type of Apps considerd • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SFW queries 2. Safe Queries 3. Materialized Views

  36. [R&S 07] Views in Block-based pDBs by example p1 q1 p2 q2 R(Chef,Dish,Rate) Rated W(Chef,Restaurant) WorksAt “Chef and restaurant pairs where chef serves a highly rated dish” V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) {c→`Tom’, r→ `D. Lounge’, d→`Crab’} S(Restaurant,Dish) Serves 0.72 = 0.9 * 0.8

  37. [R&S 07] Example coming… Eager Materialization of BID Views Idea: Throw away the lineage, process views • Why? • Lineage can be much larger than view • Can do expensive prob. computations off-line • Use view directly in safe-plan optimizer • Interleave Monte-Carlo Sampling with safe-plan • pDB analog of Materialized Views • Allows GB scale pDB processing • Catch:tuples in view independent for any instance.

  38. [R&S 07] Eager Materialization of pDB Views p1 q1 p2 q2 R(Chef,Dish,Rate) Rated W(Chef,Restaurant) WorksAt “Chef and restaurant pairs where chef serves a highly rated dish” V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) S(Restaurant,Dish) Serves Can we understand w.o. lineage? Not every probabilistic view is good for materialization!

  39. [R&S 07] Is a view a good candidate for materialization? • Thm: Deciding if a view is representable as a BID is decidable & NP-Hard (Complete for P2) • Good News: Simple but cautious, PTIME test i.e., a sufficient condition In wild, simple test works, i.e., is necessary as well NB: Can take into account a query q, i.e., can we use V1 without the lineage to answer q?

  40. Uses for Views • Precomputation • Can make #P hard query, safe • No magic, precompute the hard part • Intermediate Results • Approximate Views • All views have small lineage [R&S 08]

  41. Conclusions • Discussed the Mystiq System • http://mystiq.cs.washington.edu • 3 strategies for processing: • Multisimulation • Safe Plans • Materialized Views • Allows interesting, large-scale applications

  42. [R&S 07] Eager Materialization of pDB Views p1 q1 p2 q2 R(Chef,Dish,Rate) Rated W(Chef,Restaurant) WorksAt “chefs that serve a highly rated dish” V2(c) :- W(c,r),S(r,d),R(c,d,’High’) Obs: if no prob. tuple shared by two chefs, then they are independent S(Restaurant,Dish) Serves Can we understand w.o. lineage? Where could such a tuple live? V2 is a good choice for materialization

  43. [R&S 07] Views in BID pDBs p1 q1 p2 q2 R(Chef,Dish,Rate) Rated W(Chef,Restaurant) WorksAt “Chef and restaurant pairs where chef serves a highly rated dish” V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) S(Restaurant,Dish) Serves View has correlations Thm[ R,Dalvi,S ’07] BID are complete with of views

  44. DEMO

More Related