Robust Estimation With Sampling and Approximate Pre-Aggregation

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle

Overview • Issues and Questions • Approximate Pre-Aggregation (APA) • Maximum Likelihood Estimation • Performing the Estimation • Experiments • Observations

Issues • Assumes numerical or discretized attributes • Many data warehouse (for DSS) attributes are categorical (ex. country = France) • Can not handle too many dimensions • Stratified sampling (used to handle variance in values over which aggregation is performed) • Assumes workload is known • Targeting particular queries means other queries suffer

Questions • Can we decrease the effect of random sampling variance on categorical (or mixed) data? • Can we still require only one pass through the data? • Can we increase the accuracy of the estimation for all aggregate queries? • Can we do this without prior knowledge of the query workload?

Approximate Pre-Aggregation • Uses a true random sampling that is not biased towards any query or workload. • Sample is combined with small set of statistics about the data that is gathered at same time that sampling is performed.

Approximate Pre-Aggregation (example) • Using a sample of 50% of the data (unshaded) to answer the query: select SUM(cmplaints) from sample where prof = ‘Smith’ we get 36, and since the sample constitutes 50% of the database tuples, we estimate that 72 students complained about Professor Smith. • The actual number is 121 – a relative error of 40.5%! • The issue is in the variance of students who complained each semester. • APA uses some additional information to decrease this error.

Approximate Pre-Aggregation (cont.) • Suppose we also know that the total number of complaints in the entire table is 148, and the total number of database tuples is 16. • Use the selection predicate (i.e. the “where” clause), and divide the data space into 2n quadrants. • In this example, n=1 (prof=‘Smith’), so there will be two quadrants: prof=‘Smith’ and prof<>’Smith’. • For each of the quadrants, estimate the probability density function (pdf). • Use additional information to create constraints on the distributions. • In this example, we know that the number of complaints is 148 and the number of tuples is 16, which gives us a mean of 148/16 = 9.25 • Then need to resolve any errors in the mean. • In this example, the mean of the “prof=‘Smith’” quadrant is 4.5 and the mean of the “prof<>’Smith’” quadrant is 1.25, for a total mean of 5.75 – significantly different from 9.25. • Use the Maximum Likelihood Estimation (MLE) to re-estimate the mean. • In this example, the mean of complaints for Professor Smith now becomes 8.52 (instead of 4.5), which gives us an estimated total of 136.3 (8.52 * 16)… much closer to the actual 121 than 72.

Maximum Likelihood Estimation • Let x be an observable outcome from a given experiment. • In our example, x might be the fact that our random sample predicts that the value of our relational aggregate SUM query is 72. • Let the pdf with respect to the observing outcome x be the following function, where the parameters are the hidden parameters that we wish to estimate (they describe the model we want to discover):

Maximum Likelihood Estimation (cont.) • In order to “fit” the hidden model parameters to the observation (or observations) we maximize the likelihood that our particular model produced the data (loglikelihood):

Maximum Likelihood Estimation (in APA) • Basic idea of APA is simply to find the best (or most likely) explanation for the sample which does not violate any of the known facts about the database. • Need to pose problem of approximate aggregation over categorical data as a MLE problem. • Three specific components of maximum likelihood that we need to describe in the context of APA: • The experimental outcomes x1,x2,…,xn • The model parameters • The pdf • Once we these three components have been defined, we have transformed the problem of estimation of aggregate functions over categorical data into a MLE problem, and we can begin the task of developing a method to solve the problem.

Maximum Likelihood Estimation (outcomes) • First, we describe how we obtain the “outcomes” x1,x2,…,xn to postulate APA as a MLE problem. In APA, those outcomes are a set of predictions made by our sample. • Suppose we have the following aggregate query: select SUM(salary) from Employee where sex=‘M’ and dept=‘accounting’ and job_type=‘supervisor’ • We can number each of the clauses in the relational selection predicate from 1 to m. In this case: • b1=(sex=‘male’), b2=(dept=‘accounting’) and b3=(job_type=‘supervisor’) • Also the negation of each of the clauses • Conceptually, the result is a data cube (which we will call 2m). Each of the boolean conditions corresponds to a single cell in the multidimensional data cube.

Maximum Likelihood Estimation (outcomes, cont.) • The outcomes x1,x2,…,x2m for the MLE in APA are the results of the aggregate function in question with respect to each of the relational selection predicates in 2m, estimated using our sample. • Examples: • We know that our sample-based estimate for SUM(salary) over b1 ^ b2 ^ b3 is $1.5M. b1 ^ b2 ^ b3 is the first entry in 2m. This, $1.5M is used as the value of x1. • We know that our sample-based estimate for SUM(salary) over b1 ^ ~b2 ^ ~b3 is $1.1M. This is the fourth entry in 2m, thus x4 is $1.1M. • In this way, the random sample is used to estimate the value for each cell in the cube, and these values become the outcomes x1,x2,…,x2m.

Maximum Likelihood Estimation (parameters) • Second, we need to describe the set of model parameters which we will attempt to estimate. • In APA, these parameters are defined to be the APA guess as to the real value of the aggregate function in question, with respect to each of the cells in the multidimensional data cube. • If xi is the value of cell i predicted by the sample, then the model parameter 0i is the APA maximum likelihood estimate for the correct value for the aggregate function applied to cell i. • Example: • Assume we know the fact: (SUM(salary) where job_type!=‘supervisor’) = $2.3M • The relational selection predicate in this fact is (b1^b2^~b3) v (b1^~b2^~b3) v (~b1^b2^~b3) v (~b1^~b2^~b3). This is a disjunction of the third, fourth, seventh and eight predicates present in 2m. • Thus, this fact is equivalent to the constraint that 03+04+07+08 = $2.3M.

Maximum Likelihood Estimation (pdf) • Finally, we need to define the probability density function f, which will give us the likelihood that we would see the experimental observations x1,x2,…,x2m, given model parameters 01,02,…,02m. • Example: Assume we attempt to estimate the value for an aggregate of the form: select AGG(expression) from TABLE where (predicate) Assume we have a sample size of n from a database of size db, and the estimated value of the query based on the sample is z… (Hellerstein and Haas) • To make a long story short, the pdf is defined as follows:

Performing the Estimation • Many ways to attempt to estimate the solution to a maximum likelihood problem. • Usually necessary to approximate or estimate because of the inherent intractability of discovering the most likely model in the general case. • Best known is the Expectation Maximization algorithm (EM). • Begins with an initial guess at the solution and then repeatedly refines the guess until it reaches a locally optimal solution • EM simply seeks to maximize the loglikelihood value, while APA needs to maximize within the constraints of the model parameters. • Instead, APA uses quadratic programming formulation.

Performing the Estimation (quadratic programming) • Quadratic programming is an extension of linear programming with the generalization that the objective function to maximize may contain products of two variables, and not simply scalars. • A key advantage of this is that many algorithms have been developed that efficiently solve problems posed as quadratic equations. • The ability of quadratic programming to incorporate constraints makes it ideal for APA. • The constraints for this quadratic programming formulation are simply the linear sums of the values 01,02,…,02B.

Experiments • Six different approximate options, over eight real, high-dimensional data sets, were performed: • Random sampling • Stratified sampling • APA0: store and use all “0-dimensional” facts as constraints in the quadratic programming. • APA1: same as APA0, as well as “1-dimensional” facts. • APA2: same as APA1, as well as “2-dimensional” facts. • APA3: same as APA2, as well as “3-dimensional” facts. • Wavelets • Results from AVG aggregation:

Observations • Wavelets are unsuitable in this domain. • Random sampling better for COUNT queries. • Additional accuracy of APA2 and APA3 probably not worth the overhead; also are impractical for numerical data as that would require joint distributions of numerical and categorical attributes (difficult). • APA0 and APA1 can easily be extended to handle numerical attributes. • Should be possible to easily extend APA to work across foreign key joins, using a technique for sampling from the results of joins (ex. join synopses). • Largely sidestepped issues associated with computational efficiency.

Robust Estimation With Sampling and Approximate Pre-Aggregation