1 / 33

Probabilistic/Uncertain Data Management -- III

Probabilistic/Uncertain Data Management -- III. Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’2004. Das Sarma et al. “Working models for uncertain data”, ICDE’2006. Slides based on the Suciu/Dalvi SIGMOD’05 tutorial. What is a Probabilistic Database ?.

sezja
Download Presentation

Probabilistic/Uncertain Data Management -- III

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic/Uncertain Data Management -- III • Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’2004. • Das Sarma et al. “Working models for uncertain data”, ICDE’2006. Slides based on the Suciu/Dalvi SIGMOD’05 tutorial

  2. What is a Probabilistic Database ? • “An item belongs to the database” is a probabilistic event • Tuple-existence uncertainty • Attribute-value uncertainty • “A tuple is an answer to the query” is a probabilistic event • Can be extended to all data models; we discuss only probabilistic relational data

  3. Definition A probabilistic database Ipis a probability distribution on INST Pr : INST ! [0,1] s.t. åi=1,N Pr(Ii) = 1 Possible Worlds Semantics The set of all possible database instances: INST = {I1, I2, I3, . . ., IN} Definition A possible world is I s.t. Pr(I) > 0

  4. Query Semantics Given a query Q and a probabilistic database Ip,what is the meaning of Q(Ip) ?

  5. Query Semantics Semantics 1: Possible Answers A probability distribution on sets of tuples 8 A. Pr(Q = A) = åI 2 INST. Q(I) = A Pr(I) Semantics 2: Possible Tuples A probability function on tuples 8 t. Pr(t 2 Q) = åI 2 INST. t2 Q(I) Pr(I)

  6. Possible Worlds Query Semantics Possible answers semantics • Precise • Can be used to compose queries • Difficult user interface Possible tuples semantics • Less precise, but simple; sufficient for most apps • Cannot be used to compose queries • Simple user interface

  7. Possible Worlds Semantics: Summary Completemodel; Clean formal semantics for SQL queries Not very useful as a representation or implementation tool • HUGE number of possible worlds! Need more effective representation formalisms • Something that users can understand/explore • Allow more efficient query execution • Avoid “possible worlds explosion” • Perhaps giving up completeness

  8. Representation Formalisms ProblemNeed a good representation formalism • Will be interpreted as possible worlds • Several formalisms exists, but no winner Main open problem in probabilistic db

  9. Evaluation of Formalisms Completeness? • What possible worlds can it represent? • What probability distributions on worlds? Closure? • Is it closed under evaluation of query operators?

  10. A Complete Formalism: Intensional Databases Atomic event ids e1, e2, e3, … Probabilities: p1, p2, p3, … 2 [0,1] Event expressions: Æ, Ç, : e3Æ (e5Ç: e2) Intensional probabilistic database J: each tuple t has an event attribute t.E

  11. e1e2e3= 000 001 010 011 100 101 110 111 (1-p1)(1-p2)(1-p3)+(1-p1)(1-p2)p3+(1-p1)p2(1-p3)+p1(1-p2)(1-p3 p1(1-p2) p3 (1-p1)p2 p3 p1p2(1-p3)+p1p2p3 Intensional DB ) Possible Worlds J = Ip ;

  12. E1 = e1E2 = :e1Æ e2 E3 = :e1Æ:e2Æ e3 E4 = :e1Æ:e2Æ:e3Æ e4 Pr(e1) = p1Pr(e2) = p2/(1-p1) Pr(e3) = p3/(1-p1-p2) Pr(e4) = p4 /(1-p1-p2-p3) “Prefix code” Possible Worlds ) Intensional DB p1 p2 =Ip J = p3 p4 Intensional DBs are complete

  13. Closure Under Operators P - s £ One still needs to compute probability of event expression

  14. Summary on Intensional Databases Event expression for each tuple • Possible worlds: any subset • Probability distribution: any Complete… but impractical • Evaluate the probability of long event expressions Important abstraction: consider restrictions Related to c-tables [Imilelinski&Lipski:1984]

  15. INST = P(TUP)N = 2M TUP = {t1, t2, …, tM} = all tuples pr : TUP ! [0,1] No restrictions A Restricted Formalism: Explicit Independent Tuples Tuple independent probabilistic database Pr(I) = Õt 2 I pr(t) £Õt Ï I (1-pr(t))

  16. I1 I2 I3 I4 I5 I6 I7 I8 (1-p1)(1-p2)(1-p3) p1(1-p2)(1-p3) (1-p1)p2(1-p3) (1-p1)(1-p2)p3 p1p2(1-p3) p1(1-p2)p3 (1-p1)p2p3 p1p2p3 Tuple Prob. ) Possible Worlds E[ size(Ip) ] = 2.3 tuples å = 1 J = Ip = ;

  17. Tuple-Independent DBs are Incomplete p1 Very limited – cannot capture correlations across tuples Not Closed • Query operators can introduce complex correlations! p1p2 =Ip ; 1-p1 - p1p2

  18. Tuple Prob. ) Query Evaluation SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ p1( ) 1-(1-q2)(1-q3) 1- (1- )£(1 - ) p2( ) 1-(1-q5)(1-q6) p3 q7

  19. Application: Similarity Predicates Step 1:evaluate ~ predicates SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’

  20. Application: Similarity Predicates Step 1:evaluate ~ predicates SELECT DISTINCT x.city FROM Personp x, Purchasep y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ and x.profession ~ ‘scientist’ and y.category ~ ‘music’ Step 2:evaluate restof query

  21. Summary on Explicit Independent Tuples Independent tuples • Possible worlds: subsets • Probability distribution: restricted • Closure: no

  22. Query Evaluation on Probabilistic DBs • Focus on possible tuple semantics • Compute likelihood of individual answer tuples • Probability of Boolean expressions • Complexity of query evaluation

  23. Needed for query processing Probability of Boolean Expressions E = X1X3Ç X1X4Ç X2X5 Ç X2X6 Randomly make each variable true with the following probabilities Pr(X1) = p1, Pr(X2) = p2, . . . . . , Pr(X6) = p6 What is Pr(E) ??? Answer: re-group cleverly E = X1 (X3Ç X4 ) Ç X2 (X5 Ç X6) Pr(E)=1 - (1-p1(1-(1-p3)(1-p4))) (1-p2(1-(1-p5)(1-p6)))

  24. Now let’s try this: E = X1X2Ç X1X3Ç X2X3 No clever grouping seems possible. Brute force: Pr(E)=(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3 Seems inefficient in general…

  25. [Valiant:1979] Complexity of Boolean Expression Probability Theorem [Valiant:1979]For a boolean expression E, computing Pr(E) is #P-complete NP = class of problems of the form “is there a witness ?” SAT #P = class of problems of the form “how many witnesses ?” #SAT The decision problem for 2CNF is in PTIMEThe counting problem for 2CNF is #P-complete

  26. Summary on Boolean Expression Probability • #P-complete • It’s hard even in simple cases: 2DNF • Can approximate through Monte Carlo (MC) simulation

  27. Query Complexity Data complexity of a query Q: • Compute Q(Ip), for probabilistic database Ip Simplest scenario only: • Possible tuples semantics for Q • Independent tuples for Ip

  28. [Fuhr&Roellke:1997,Dalvi&Suciu:2004] Extensional Query Evaluation Relational ops compute probabilities P s £ - Unlike intensional evaluation, data complexity: PTIME

  29. £ P P £ [Dalvi&Suciu:2004] SELECT DISTINCT x.City FROM Personp x, Purchasep y WHERE x.Name = y.Cust and y.Product = ‘Gadget’ Wrong ! Correct Depends on plan !!!

  30. [Dalvi&Suciu:2004] Query Complexity Sometimes @ correct (“safe”) extensional plan Data complexityis #P complete Qbad :- R(x), S(x,y), T(y) • Theorem The following are equivalent • Q has PTIME data complexity • Q admits an extensional plan (and one finds it in PTIME) • Q does not have Qbad as a subquery

  31. Computing a Safe SPJ Extensional Plan Problem is due to projection operations • An “unsafe” extensional projection combines tuples that are correlated assuming independence Projection over a join that projects away at least one of the join attrs  Unsafe projection! • Intuitive: Joins create correlated output tuples

  32. Computing a Safe SPJ Extensional Plan Algorithm for Safe Extensional SPJ Evaluation • Apply safe projections as late as possible in the plan • If no more safe projections exist, look for joins where all attributes are included in the output • Recurse on the LHS, RHS of the join Sound and complete safe SPJ evaluation algorithm • If a safe plan exists, the algo finds it!

  33. Summary on Query Complexity Extensional query evaluation: • Very popular • Guarantees polynomial complexity • However, result depends on query plan and correctness not always possible! General query complexity • #P complete (not surprising, given #SAT) • Already #P hard for very simple query (Qbad) Probabilistic databases have high query complexity

More Related