new sampling based summary statistics for improving approximate query answers n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
New Sampling-Based Summary Statistics for Improving Approximate Query Answers PowerPoint Presentation
Download Presentation
New Sampling-Based Summary Statistics for Improving Approximate Query Answers

Loading in 2 Seconds...

play fullscreen
1 / 22

New Sampling-Based Summary Statistics for Improving Approximate Query Answers - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

New Sampling-Based Summary Statistics for Improving Approximate Query Answers. Yinghui Wang 02-07-2006. Outline. Introduction Concise samples Counting samples Application to hot list queries Conclusion Reference. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'New Sampling-Based Summary Statistics for Improving Approximate Query Answers' - marcellus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
new sampling based summary statistics for improving approximate query answers

New Sampling-Based Summary Statistics for Improving Approximate Query Answers

Yinghui Wang

02-07-2006

outline
Outline
  • Introduction
  • Concise samples
  • Counting samples
  • Application to hot list queries
  • Conclusion
  • Reference
introduction
Introduction
  • In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible.
  • Effectiveness of a synopsis is evaluated as a function of its footprint, i.e., the number of memory words to store the synopsis.

New Data

Data Warehouse

Queries

Response

Figure 1:A traditional data warehouse

New Data

Approx.

Answer

Engine

Data Warehouse

Queries

Response

Figure 2: Data warehouse set-up for providing

approximate query answers.

definition
Definition
  • Concise samples
    • a uniform random sample of the data set such that values appearing more than once in the sample are represented as a value and a count, ex: <value, count>.
  • Counting samples
    • a variation on concise samples in which the counts are used to keep track of all occurrences of a value inserted into the relation since the value was selected for the sample.
  • Hot list queries
    • request an ordered set of <value, count> pairs for the k most frequently occurring data values, for some k.
    • ex: the top selling items in a database of sales transactions.
outline1
Outline
  • Introduction
  • Concise samples
  • Counting samples
  • Application to hot list queries
  • Conclusion
  • Reference
concise samples
Concise samples

Consider a relation R with n tuples and an attribute A. The goal is to obtain a uniform random sample of R.A, i.e., the values of A for a random subset of the tuples in R.

  • Definition:Let S = {<v1, c1>,…, <vj, cj>, vj+1,..., vl} be a consice sample. Then sample-size(S) = l-j+∑ji = 1ci, and footprint(S) = l+j
  • Lemma 1 For any footprint m ≥ 2, there exists data sets for which the sample-size of a consice sample is n/m times lager than its footprint, where n is the size of the data set.
concise samples offline static
Concise samples – offline/static
  • Offline/static computation
    • Repeat m times: select a random tuple from the relation and extract its value for attribute A.
    • Semi-sort the set of values, and replace every value occurring multiple times with a <value, count> pair.
    • Continue to sample until either adding the sample point would increase the concise sample footprint to m+1 or n samples have been taken.
    • For each new value sampled, look-up to see if it is already in the concise sample and then either add a new singleton value, convert a singleton to a <value, count> pair, or increment the count for a pair.
concise samples online
Concise samples – online
  • With concise samples, the sample-size depends on the data distribution to date, and any changes in the data distribution must be reflected in the sampling frequency.
  • Maintenance algorithm:

Let S be the current concise sample and consider a new tuple t. Set up an entrythresholdβ(initially 1) for new tuples to be selected for the sample.

    • Addt.A to S with probability1/ β.
    • Do a look-up on t.A inS.
      • if it is represented by a pair, its count is incremented.
      • if t.A is a singleton in S, a pair is created,
      • if it is not in S, a singleton is created.
    • Increase footprint by 1 in cases b) and c)
    • Raise the threshold to someβ’. Subject each sample point in S to this higher threshold.Subsequent inserts are selected for the sample withprobability 1/ β’
concise samples cont
Concise samples (cont.)
  • Theorem 2 Consider the family of exponential distributions: for I = 1,2,…,Pr(v=i) = α-i(α-1), for α>1. For any footprint m≥2, the expected sample-size of a concise sample with footprint m is at least αm/2
  • Theorem 3 For any data set, when using a concise sample S with sample-size m, the expected gain is

E[m-number of distinct values in S] =

concise samples cont1
Concise samples (cont.)
  • Update time overheads
    • The coin flips that must be performed to decide which inserts are added to the concise sample and to evict values from the concise sample when the threshold is raised
    • The lookups into the current concise sample to see if a value is already present in the sample
concise samples experimental evaluation
Concise Samples – experimental evaluation

D: potential number

of distinct values

m: footprint size

Figure 3: Comparing sample-sizes of concise and traditional samples as a function of skew, for varying footprints and D/m ratios. In (a) and (b), authors compare footprint 100 and footprint 1000, respectively, for the same data sets. In (c) and (d), authors compare D/m = 50 and D/m = 5 , respectively, for the same footprint 1000.

outline2
Outline
  • Introduction
  • Concise samples
  • Counting samples
  • Application to hot list queries
  • Conclusion
  • Reference
counting samples
Counting samples
  • Counting samples – a variation on concise samples in which the counts are used to keep track of all occurrences of a value inserted into the relation since the value was selected for the sample.
  • Definition: A counting sample for R.A with thresholdβis anysubset of R.A obtained as follows:

1. For each value v occurring c times in R, we flip a coinwith probability1/βof heads until the first heads, up to atmost ccoin tosses in all; if the ith coin toss is heads, then voccurs c-i+1 times in the subset, else vis not in the subset.

2. Each value v occurring c>1times in the subset is representedas a pair <v, c>, and each value v occurring exactlyonce is represented as a singleton v.

counting samples cont
Counting samples (cont.)
  • An algorithm for incremental maintenance is introduced .
  • Theorem 4 Let R be an arbitrary relation, and let βbe the currentthreshold for a counting sample S. (i) Any value v that occurs at least βtimes in R is expected to be in S. (ii) Any value v thatoccurs fv times in R will be in S with probability 1-(1-1/β) fv. (iii) For allα>1, if fv≥ αβ, then with probability ≥ 1 - e-α, the value will be in S and its count will be at least fv - αβ
outline3
Outline
  • Introduction
  • Concise samples
  • Counting samples
  • Application to hot list queries
  • Conclusion
  • Reference
application to hot list queries
Application to hot list queries
  • Hot list queries request an ordered set of <value, count> pairs for the k most frequently occurring data values, for some k.
  • Algorithms
    • Using traditional samples
    • Using concise samples
    • Using counting samples
    • Using histogram on disk – maintains a full histogram on disk, i.e., <value, count> pairs for all distinct values in R, with a copy of the top m/2 pairs stored as a synopsis within the approximate answer engine.
      • -- is considered only as a baseline for accuracy comparisons
application to hot list queries cont
Application to hot list queries (cont.)

x-axis: rank of a value

y-axis: count for the

values

conclusion
Conclusion
  • Using concise samples may offer the best choice when considering both accuracy and overheads.
  • In this paper, a batch-like processing of data warehouse inserts, in which inserts and queries do not intermix, is assumed. To address the more general case , issues of concurrency bottlenecks need to be addressed.
  • Future work is to explore the effectiveness of using concise samples and counting samples for other concrete approximate answer scenarios.
reference
Reference
  • P. B. Gibbons and Y. Matias. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. ACM SIGMOD 1998.