1 / 22

New Sampling-Based Summary Statistics for Improving Approximate Query Answers

New Sampling-Based Summary Statistics for Improving Approximate Query Answers. Yinghui Wang 02-07-2006. Outline. Introduction Concise samples Counting samples Application to hot list queries Conclusion Reference. Introduction.

marcellus
Download Presentation

New Sampling-Based Summary Statistics for Improving Approximate Query Answers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang 02-07-2006

  2. Outline • Introduction • Concise samples • Counting samples • Application to hot list queries • Conclusion • Reference

  3. Introduction • In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. • Effectiveness of a synopsis is evaluated as a function of its footprint, i.e., the number of memory words to store the synopsis. New Data Data Warehouse Queries Response Figure 1:A traditional data warehouse New Data Approx. Answer Engine Data Warehouse Queries Response Figure 2: Data warehouse set-up for providing approximate query answers.

  4. Definition • Concise samples • a uniform random sample of the data set such that values appearing more than once in the sample are represented as a value and a count, ex: <value, count>. • Counting samples • a variation on concise samples in which the counts are used to keep track of all occurrences of a value inserted into the relation since the value was selected for the sample. • Hot list queries • request an ordered set of <value, count> pairs for the k most frequently occurring data values, for some k. • ex: the top selling items in a database of sales transactions.

  5. Outline • Introduction • Concise samples • Counting samples • Application to hot list queries • Conclusion • Reference

  6. Concise samples Consider a relation R with n tuples and an attribute A. The goal is to obtain a uniform random sample of R.A, i.e., the values of A for a random subset of the tuples in R. • Definition:Let S = {<v1, c1>,…, <vj, cj>, vj+1,..., vl} be a consice sample. Then sample-size(S) = l-j+∑ji = 1ci, and footprint(S) = l+j • Lemma 1 For any footprint m ≥ 2, there exists data sets for which the sample-size of a consice sample is n/m times lager than its footprint, where n is the size of the data set.

  7. Concise samples – offline/static • Offline/static computation • Repeat m times: select a random tuple from the relation and extract its value for attribute A. • Semi-sort the set of values, and replace every value occurring multiple times with a <value, count> pair. • Continue to sample until either adding the sample point would increase the concise sample footprint to m+1 or n samples have been taken. • For each new value sampled, look-up to see if it is already in the concise sample and then either add a new singleton value, convert a singleton to a <value, count> pair, or increment the count for a pair.

  8. Concise samples – online • With concise samples, the sample-size depends on the data distribution to date, and any changes in the data distribution must be reflected in the sampling frequency. • Maintenance algorithm: Let S be the current concise sample and consider a new tuple t. Set up an entrythresholdβ(initially 1) for new tuples to be selected for the sample. • Addt.A to S with probability1/ β. • Do a look-up on t.A inS. • if it is represented by a pair, its count is incremented. • if t.A is a singleton in S, a pair is created, • if it is not in S, a singleton is created. • Increase footprint by 1 in cases b) and c) • Raise the threshold to someβ’. Subject each sample point in S to this higher threshold.Subsequent inserts are selected for the sample withprobability 1/ β’

  9. Concise samples (cont.) • Theorem 2 Consider the family of exponential distributions: for I = 1,2,…,Pr(v=i) = α-i(α-1), for α>1. For any footprint m≥2, the expected sample-size of a concise sample with footprint m is at least αm/2 • Theorem 3 For any data set, when using a concise sample S with sample-size m, the expected gain is E[m-number of distinct values in S] =

  10. Concise samples (cont.) • Update time overheads • The coin flips that must be performed to decide which inserts are added to the concise sample and to evict values from the concise sample when the threshold is raised • The lookups into the current concise sample to see if a value is already present in the sample

  11. Concise Samples – experimental evaluation D: potential number of distinct values m: footprint size Figure 3: Comparing sample-sizes of concise and traditional samples as a function of skew, for varying footprints and D/m ratios. In (a) and (b), authors compare footprint 100 and footprint 1000, respectively, for the same data sets. In (c) and (d), authors compare D/m = 50 and D/m = 5 , respectively, for the same footprint 1000.

  12. Outline • Introduction • Concise samples • Counting samples • Application to hot list queries • Conclusion • Reference

  13. Counting samples • Counting samples – a variation on concise samples in which the counts are used to keep track of all occurrences of a value inserted into the relation since the value was selected for the sample. • Definition: A counting sample for R.A with thresholdβis anysubset of R.A obtained as follows: 1. For each value v occurring c times in R, we flip a coinwith probability1/βof heads until the first heads, up to atmost ccoin tosses in all; if the ith coin toss is heads, then voccurs c-i+1 times in the subset, else vis not in the subset. 2. Each value v occurring c>1times in the subset is representedas a pair <v, c>, and each value v occurring exactlyonce is represented as a singleton v.

  14. Counting samples (cont.) • An algorithm for incremental maintenance is introduced . • Theorem 4 Let R be an arbitrary relation, and let βbe the currentthreshold for a counting sample S. (i) Any value v that occurs at least βtimes in R is expected to be in S. (ii) Any value v thatoccurs fv times in R will be in S with probability 1-(1-1/β) fv. (iii) For allα>1, if fv≥ αβ, then with probability ≥ 1 - e-α, the value will be in S and its count will be at least fv - αβ

  15. Outline • Introduction • Concise samples • Counting samples • Application to hot list queries • Conclusion • Reference

  16. Application to hot list queries • Hot list queries request an ordered set of <value, count> pairs for the k most frequently occurring data values, for some k. • Algorithms • Using traditional samples • Using concise samples • Using counting samples • Using histogram on disk – maintains a full histogram on disk, i.e., <value, count> pairs for all distinct values in R, with a copy of the top m/2 pairs stored as a synopsis within the approximate answer engine. • -- is considered only as a baseline for accuracy comparisons

  17. Application to hot list queries (cont.) x-axis: rank of a value y-axis: count for the values

  18. Application to hot list queries (cont.)

  19. Application to hot list queries (cont.)

  20. Application to hot list queries – overheads

  21. Conclusion • Using concise samples may offer the best choice when considering both accuracy and overheads. • In this paper, a batch-like processing of data warehouse inserts, in which inserts and queries do not intermix, is assumed. To address the more general case , issues of concurrency bottlenecks need to be addressed. • Future work is to explore the effectiveness of using concise samples and counting samples for other concrete approximate answer scenarios.

  22. Reference • P. B. Gibbons and Y. Matias. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. ACM SIGMOD 1998.

More Related