1 / 28

General Database Statistics Using Maximum Entropy

General Database Statistics Using Maximum Entropy. Raghav Kaushik 1 , Christopher Ré 2 , and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison 3 University of Washington. 1. Model: Information that optimizer knows

asher
Download Presentation

General Database Statistics Using Maximum Entropy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. General Database Statistics Using Maximum Entropy Raghav Kaushik1, Christopher Ré2, and Dan Suciu3 1Microsoft Research 2University of Wisconsin--Madison 3University of Washington

  2. 1. Model: Information that optimizer knows 2. Prediction: use the model to estimate cardinality of future queries Study Cardinality Estimation Propose a declarative language with statistical assertions “We estimate that distinct # of Employees is 10” Contribution: A principled, declarative approach to cardinality estimation based on Entropy Maximization.

  3. Motivating Applications 1. Incorporate query feedback records - Underutilized: No general purpose mechanism 2. Optimizers for new domains (DB Kit 2.0) • Cloud Computing, Information Extraction 3. Data generation and description

  4. Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions Outline

  5. An assertion is a CQ Views + sharp (#) statement: V(x) :- R(x,y), …. #V= 106 Statistical Assertions V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” V2(y) :- R(-,y),S(y) #V2= 50 “The number of values in the output V2 is 50” A program is a set of assertions

  6. V(x) :- R(x,y), …. #V= 106 Model as a Probabilistic Database Intuitively, # is “Expected Value” V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” A model is a probabilistic database s.t. the expected number of tuples in V1 is 20. Ok, but whichpdb?

  7. Two Desiderata for the distribution (D1): Should agree with provided statistics (D2): Should assume nothing else V(x) :- R(x,y), …. #V= 106 Desiderata for our solution Approach: maximize entropy subject to D1 Challenge: Compute params of MaxEnt Distribution Technical Desideratum: want paramsanalytically

  8. Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions Outline

  9. Consider a domain D of size n. Fix a schema R=R1, R2,… Let Inst(n) = all instances over R on D An element I of Inst(n) is called a world Notation for Probabilistic Databases

  10. Consider a domain D of size n. Fix a schema R=R1, R2,… Let Inst(n) = all instances over R on D An element I of Inst(n) is called a world Notation for Probabilistic Databases A probabilistic database is a pair (Inst(n),p) Essentially, any discrete probability distribution on relations

  11. Achieving (D1): Stats must agree The semantics of # # means “expected value” V1(x) :- R(x,-) #V1 = 20 “The number of values in the output of V1 is 20” NB: In truth, we let n tend to infinity, and settle for asymptotically equal.

  12. Given V1, V2, … with #Vi = di for i=1,…,t Achieving (D1): Stats must agree Multiple Views If p satisfies these equations, we’ve achieved: (D1): Should agree with provided statistics Many such distributions exist. How do we pick one?

  13. Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Selecting the best one

  14. Maximize Entropy subject to constraints: Achieving (D2) : No ad-hoc assumptions Selecting the best one One can show that p has following form: NB: p is only a function of the stats, and so we have achieved (D2) Z is normalizing constant and ai is positive parameter for i=1,..,t

  15. Every (consistent) statistical program induces a well-defined distribution • Every query has a well-defined cardinality estimate • Statistics as a whole, not as individual stats. • Can add new statistics to our heart’s content Benefits of MaxEnt A statistical program Technical Challenge:ai analytically

  16. Statistical programs and desiderata Semantics of Statistical Programs Two examples Conclusions Outline

  17. I: A material random Graph • Even simple EM solutions have interesting theory • II: Intersection Models • Generating function , and • Different, analytic technique Two quick Examples

  18. Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d

  19. Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d By Linearity, E[V] = xn2 = d

  20. Example I: Random Graphs are EM Random Graph: Add edges independently at random V(x,y) :- R(x,y) #V = d By Linearity, E[V] = xn2 = d This isMaxEnt…write:

  21. Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1

  22. Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1

  23. Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1

  24. Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1

  25. Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1

  26. Example II:an intersection model V(x) :- R1(x), R2(x) #R1 = d1 , #R2 = d2 , #V = d3 Read: Each element is either in R1, R2, or all three e.g., term with x1k is an instance where k distinct values in R1

  27. Normal Form for statistical programs • Syntactic classes that we can solve analytically • “Project-Semijoin” queries (previous slide) • A general technique, conditioning: • Start with tuple independent prior, and condition • Introduces inclusion constraints • Extensions to handle histograms Results in the paper

  28. Showed a principled, general model for database statistics based on MaxEnt Analytically solved syntactic classes of statistics Applications: Query Feedback and the Cloud Conclusion

More Related