Loading in 5 sec....

Overcoming Limitations of Sampling for Aggregation QueriesPowerPoint Presentation

Overcoming Limitations of Sampling for Aggregation Queries

- 251 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'overcoming limitations of sampling for aggregation queries' - Patman

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Overcoming Limitations of Sampling for Aggregation Queries

Surajit Chaudhuri, Microsoft Research

Gautam Das, Microsoft Research

Mayur Datar, Stanford University

Rajiv Motwani, Stanford University

Vivek Narasayya, Microsoft Research

Presented by Daniel Kuang

Outline

Introduction

Limitations of uniform random sampling

Handling Data skew

Handling low selectivity & small groups

Implementation result

Conclusion

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Introduction

- Approximate aggregation query processing using sampling
- Uniform sampling performs poorly when
- Distribution of aggregated attribute is skewed
- low selectivity

- Solution proposed by the paper
- Outlier indexing
- Weighted sampling based on workload information
- A combined approach significantly reduces the approximation error.

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Limitations of uniform random sampling – an example

- Relation R with 10,000 tuples aggregate on column ‘C’
- 99% tuples have value 1 - 9900
- 1% tuples have value 1000 – 100, 000
- 1% URS of R – Sample size = 100
- sum result = sum of sample x scale factor(100)
- All of 100 tuples have value 1 in sample
estimated result = 100x100 = 10, 000

- 2 or more tuples have value 1,000 in sample
estimated result > 209, 800

- Reasonable result only when we get exactly one tuple of value 1000 per sample – probability 0.37
- With probability of 0.63, we would get a large error in the estimate.

Actual sum =109, 900

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Limitations of uniform random sampling – skewed data

- Contribution of the outliers to the error in estimating sum via uniform sampling
- Relation R size N having values {y1, y2, ……yN}. Let U be a uniform random sample of yi’s of size n. Then
- Actual sum
- Estimated sum
- Standard error
- Where S is the standard deviation of the values in the relation defined as
- If there are outliers in the data then S could be very large. For a given error bound we need to increase the sample size n.

yi

Ye = (N/n)

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Limitations of uniform random sampling – low selectivity

- Selection query partitions relation into 2 sub-relations –
- Tuples that satisfy the condition – relevant sub-relation
- Tuples that do not satisfy the condition

- Number of tuples sampled from relevant sub-relation is proportional to its size.
- If this relevant sample size is small due to low selectivity of the query, it may lead to larger error
- Success of uniform sampling depends on the size of the relevant sub-relation

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling data skew – outlier indexes

- Idea of outlier indexing – identify outlier tuples and store them separately
- Example: selection query (Q) with sum aggregate
- Partition to two sub-relations Ro and RNO.
- Apply the query to the sub-relation of outliers.
- Apply the query to the sub-relation of RNO.
- Combine the above two to get overall estimate result of query

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling data skew – outlier indexes – an example

- Aggregate Query Q - Select sum (sales) from lineitem
- Preprocessing
- 1. Determine outliers – lineitem_outlier view
- 2. Sample non-outliers – sample table lineitem_samp

- Query processing
- 1. Aggregate outliers – sum(sales) from lineitem_outlier view
- 2. Aggregate non-outliers – apply the query to sample T and extrapolate – sum(sales) for lineitem_samp * scale factor
- 3. Combine aggregates – approximate result from RNO + exact result from RO = approximate result for R

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling data skew – outlier indexes – selection of outliers

- Query error is solely due to error in estimating the aggregate from non-outliers
- Additional overhead of maintaining and accessing the outlier index
- An optimal sub-relation Ro that leads to the minimum possible sampling error

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling data skew – outlier indexes – selection of outliers

Theorem 2

- Consider a multiset R = {y1, y2,…yN} in sorted order. Ro C R such that
|Ro| ≤ τ

S(R\ Ro) = min R’C R,|R’|≤ τ{S(R\R’)}

- There exists some 0 ≤ τ’ ≤ τ such that Ro = {yi|1 ≤ i≤ τ’ }U{yi|(N+ τ’ +1- τ) ≤i ≤N}
- States that the subset that minimizes the standard deviation over the remaining set consists of the leftmost τ’ elements and the right most τ- τ’ elements from the multiset R when elements are arranged in sorted order.

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling data skew – outlier indexes – selection of outliers

- Alogirithm Outlier-Index(R,C, τ):
- Let the values in column C be sorted in relation R
- For i = 1 to τ+1, compute
- E(i) = S({yi, yi+1,…yN- τ+i-1})

- Let i’ be the value of i where E(i) is minimum.
- Outlier-index is the tuples that correspond to the set of values {yj|1 ≤ j≤ τ’ }U{yj|(N+ τ’ +1- τ) ≤j ≤N}
- where τ’ = i’ - 1

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling data skew – outlier indexes – storage allocation

Given sufficient space to store m tuples, how do we allocate storage between samples and outlier-index in order to minimize error?

S(τ) is the standard deviation in non-outliers for an optimal outlier-index of τ. If we allocate space m such that τ tuples in the outlier and m- τ in the sample. Then error will be proportional to S(τ)/√(m- τ).

Finding the value of τ for which S(τ)/√(m- τ) will be minimum.

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Use of outlier-indexing – extension to other aggregates allocation

- Avg aggregate: same as sum aggregate
- Count aggregate: outlier-indexing not beneficial since there is no variance among the data values
- Rank order dependent aggregates: such as min, max, and median – outlier not useful

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling low selectivity and small groups allocation

- Use weighted sampling with the help of workload information instead of uniform sampling
- More samples from the subsets that are small in size but are more important

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling low selectivity and small groups – Exploit workload info

- Workload collection – collect workload information consisting of representative queries from query profilers and query logs
- Trace query patterns – parsed information e.g. the set of selection conditions
- Trace tuple usage – usage of specific tuples like frequency of access to each tuple
- Weighted sampling – perform sampling by taking into account weights of the tuples

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Handling low selectivity and small groups – Weighted Sampling

- Let the weight of tuple ti be wi
- Normalized weight be wi’ = wi/ Σwi
- probability of acceptance of this tuple in the sample pi= n.wi’
- For each tuple included in the sample, store corresponding pi.
- Each aggregate computed over this tuple gets multiplied by the inverse of pi
- works well if
- Access pattern is local i.e. most queries access a small part of relation
- Workload is a good representative of actual queries

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Implementation & Experimental Result Sampling

- Parameters
- (1) skew of the data (z) -- 1,1.5, 2, 2.5, and 3
- (2) sampling fraction (f) -- from 1% to 100%
- (3)storage for the outlier-index - 1%, 5%, I0%, and 20%

- Comparisons
- (1) Uniform sampling
- (2) Weighted sampling
- (3) weighted sampling + outlier-indexing

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Experimental Result - SamplingVarying the data skew

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Experimental Result - SamplingVarying the sampling fraction

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Experimental Result - SamplingVarying the selectivity of queries

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Conclusion and Open Problems Sampling

- Skew can lead to large errors - outlier-indexing significantly reduce the error.
- Problem of low selectivity of queries – Weighted sampling based on workload information.

Ranjan Dash [email protected]

Daniel Kuang [email protected]

Download Presentation

Connecting to Server..