1 / 29

Selectivity Estimation using Probabilistic Models

Selectivity Estimation using Probabilistic Models. Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng. Outline. Introduction Estimation for single tables SRM: Statistical Relational Models Selectivity Estimation using SRMs Learning SRM. Introduction.

gryta
Download Presentation

Selectivity Estimation using Probabilistic Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Selectivity Estimation using Probabilistic Models Author: Lise Getoor, Ben Taskar, Daphne Koller Presenter: Qidan Cheng

  2. Outline • Introduction • Estimation for single tables • SRM: Statistical Relational Models • Selectivity Estimation using SRMs • Learning SRM

  3. Introduction • Accurate estimates of the result size of queries are crucial to several query processing components of DBMS. • Cost-based query optimizers: choose the optimal query execution plan. • Query profilers: predicting resource consumption and distribution of query results. • Answer counting queries

  4. Introduction • How to estimate the size of a selection query over multiple attributes for a single table? The result size is determined by the joint frequency distribution of the values of these attributes. Query: select * from R where R.A1 = a1 and … and R.Ak = ak Size(Q) = |R|  PD(a1,…,ak)

  5. Introduction • But…exponential in # of attributes vn, representing all combination of attribute values is infeasible. • Attribute value independence assumption: joint distribution is product of single attribute distributions • Problem: overestimate or underestimate the query size. → Bayesian Network

  6. Estimation for single table • Example: Given a simple relation R with three attributes: Education (high-school, college,degree) 3 values Income (low, medium, high) 3 values Home-Owner (false, true) 2 values Joint distribution need 18 numbers. • Observation: Some of the correlations between attributes might be indirect ones, mediated by other attributes. → Conditional Independence

  7. Estimation for single table • P(H=h|E=e,I=i)=P(H=h|I=i) Home-owner is conditionally independent of Education given Income. Education Income Home-owner → Compact form of the joint distribution P(H,E,I) = P(E)P(I|E)P(H|I,E)=P(E)P(I|E)P(H|I)

  8. Estimation for single table Figure (b) can encode precisely the same joint distribution as in Figure (a)

  9. Bayesian Networks Pneumonia Tuberculosis Lung Infiltrates X-Ray Sputum Smear Nodes = random variables Edges = direct probabilistic influence Network structure encodes independence assumptions: X-Ray conditionally independent of Pneumonia given Infiltrates

  10. Bayesian Networks P(I |P, T ) P T Pneumonia Tuberculosis p t 0.8 0.2 p t 0.6 0.4 Lung Infiltrates p t 0.2 0.8 t p 0.01 0.99 X-Ray Sputum Smear • Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) —distribution over Xi for each assignment to parents

  11. BN Semantics P T I X S Compact & natural representation: nodes have  k parents  2k n vs. 2n params conditional independencies in BN structure local probability models full joint distribution over domain + =

  12. BNs for Query Estimation • Query: select * from R where R.A1 = a1 and … and R.Ak = ak • P(a1,a2,…an)= P(ai|parents(ai)) Use Bayesian inference algorithm to compute PD(a1,…,ak) • Algorithm complexity depends on BN connectivity; efficient in practice Size(Q) = |R|  PD(a1,…,ak)

  13. Join Selectivity Estimation Size(Purchase Person) = | Purchase | Naïve Approach Uniform Join Assumption Purchase Person Assuming referential integrity

  14. Join Selectivity Estimation Example query Q: “person.income=high and purchase.type=luxury” p = P (person.income=high) q = P (purchase.type=luxury) SizeQ = |Purchase|*p*q Problems: Joining Two Tables

  15. Correlated Attributes Type = luxury Income = high Income = low Type = necessity Purchase Person The attributes of the two different tables are often correlated

  16. Skewed Join Type = luxury Income = high Income = low Type = necessity Purchase Person The probability that two tuples join with each other can also be correlated with various attributes.

  17. Join Indicator R S Query: select * from R, S where R.F = S.K and R.A = a and S.B = b P(JF) = prob. randomly chosen tuple from R joins with a randomly chosen tuple from S size(Q) = | R | | S | P(JF, a, b)

  18. Statistical Relational Models • Model distribution of attributes across multiple tables • Allow attribute values to depend on attributes in the same table (like a BN) • Allow attribute values to depend on attributes in other tables along a foreign key join • Can model the join probability of two tuples using join indicator variable

  19. Statistical Relational Model • A SRM for a relational database is a pair(S,θ),which specifies a local probabilistic model for each of the following variables • A variable R.A for each table R and each attribute A R.* • A boolean join indicator variable R.JF • For each variable R.X • S specifies a set of parents Pa(R.X) • Θ specifies a CPD P(R.X|Pa(R.X))

  20. Example SRM School Person Prestige Age Purchase Bought-by Attended Type Income Jschool Jperson Type=necessity false true Income = high 0.999, 0.001 false true 0.9998, 0.0002 0.99, 0.01

  21. Universal Foreign Key Closure t s s1 s2 • Schema: R, S, T.R.F refers to S, S.F refers to Tstratification: T < S < R s.F2 = t.K r.F1 = s.K r • Schema: R, SR.F1 refers to S, R.F2 refers to Sstratification: S < R r.F1 = s1.K r.F2 = s2.K r

  22. Universal Foreign Key Closure • Minimal extension Q+ to a query Q: Let Q be a keyjoin query over r1,r2,…rk For each r, if there is an attribute R.A with parent R.F.B where R.F points to S, then there is a unique tuple variable s representing the join r.F=s.K • Proposition: Let Q be a query and let Q+ be its minimal extension. Then sizeQ[D]=sizeQ+[D]

  23. Answering Queries Using SRMs Prestige Age Person Purchase Type Income Jschool Jperson • Construct Query Evaluation BN for Query: select * from Person, Purchase where Person.id = Purchase.buyer-id and Person.Income = high and Purchase.Type=luxury School Compute upward closure of query attributes by including all parents as well

  24. SRM Learning Database Strain Patient Contact • Learn parameters & qualitative dependency structure • Extend known techniques for learning Bayesian networks from data.

  25. Structure selection • Define scoring function: log-likelihood function l( θ,S | D)=log P(D | S, θ) Finding the model that has maximum log-likelihood given data. • Do the greedy local structure search

  26. Parameter Estimation • The model contains a parameter Θa|x for each value a of A and each assignment of values x to X. Θa|x = P(R.A=a |X=x) Θa|x = FD(R.A=a,X=x)/FD(X=x)

  27. System Architecture Query Q Selectivity Estimator Size(Q) execution time Database Model Constructor offline

  28. Conclusions • SRM is unique in its ability to handle select and join operators • Estimates the high-dimensional joint distribution using a set of lower-dimensional conditional distributions To do: • Incremental maintenance of the SRM as the database changes • Joins over non-key attributes

  29. Selected Publications • “Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D. Koller and B. Taskar, JMLR 2002. • “Probabilistic Models of Text and Link Structure for Hypertext Classification”, L. Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond Classification’, 2001. • “Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D. Koller, SIGMOD-01. • “Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller, and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N. Lavrac, 2001. • see also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI-99. • “Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman, D. Koller, and B. Taskar, ICML-01. • “From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D. Koller and N. Friedman, ICML Workshop on Attribute-Value and Relational Learning: Crossing the Boundaries, 2000. • Notes from AAAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, 2000. • Notes from IJCAI Workshop on Learning Statistical Models from Relational Data, eds. L.Getoor and D. Jensen, 2003. See http://www.cs.umd.edu/~getoor

More Related