1 / 32

STHoles: A Multidimensional Workload-Aware Histogram

STHoles: A Multidimensional Workload-Aware Histogram. Nicolas Bruno* Columbia University. Surajit Chaudhuri Microsoft Research. Luis Gravano* Columbia University. * Work done in part while the authors were visiting Microsoft Research. SIGMOD 2001. Histograms as Succinct Data Set Summaries.

robbin
Download Presentation

STHoles: A Multidimensional Workload-Aware Histogram

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno*Columbia University Surajit ChaudhuriMicrosoft Research Luis Gravano*Columbia University * Work done in part while the authors were visiting Microsoft Research. SIGMOD 2001

  2. Histograms as Succinct Data Set Summaries • Used for selectivity estimation and approximate query processing. • Data set partitioned into buckets, each approximated by aggregate statistics.

  3. Histograms • Each bucket consists of a bounding box and a tuple frequency value. • Uniformity is assumed inside buckets. • Histograms should partition data set in buckets with uniform tuple density. • Multi-dimensional data makes partitioning even more challenging.

  4. Outline • Overview of existing multidimensional histogram techniques. • Introduction to STHoles histograms. • System architecture and STHoles construction algorithm. • Experimental evaluation.

  5. Histograms Techniques: EquiDepth • Correctly identifies core of densest clusters. • Partitioning uses “equi-count” instead of “equi-density” Gaussian Data Set EquiDepth Histogram [Muralikrishna and DeWitt 1988]

  6. Histogram Techniques: MHist • Works well for highly skewed data distributions. • Devotes too many buckets to the densest clusters. • Bad initial “choices” are amplified in later steps. Gaussian Data Set MHist Histogram [Poosala and Ioannidis 1997]

  7. Histogram Techniques: GenHist • More robust than previous techniques (based on multidimensional information). • Difficult to choose right values of various parameters. • Requires at least 5-10 passes over the data. GenHist Histogram [Gunopulos et al. 2000] Gaussian Data Set

  8. Histogram Techniques: STGrid • Incorporates feedback from query execution. • Grid partitioning strategy is sometimes too rigid. • Focuses on efficiency rather than accuracy. Gaussian Data Set STGrid Histogram [Aboulnaga and Chaudhuri 1999]

  9. Our New Histogram Technique: STHoles • Flexible bucket partitioning. • Exploits workload information to allocate buckets. • Query feedback captures uniformly dense regions. • Does not examine actual data set.

  10. Non rectangular region STHoles Histograms • Tree structure among buckets. • Buckets with holes: relaxes rectangular regions while using rectangular bucket structures.

  11. System Architecture for STHoles Range Query

  12. STHoles Construction Algorithm • Initialize histogram H as an empty histogram. • For each query q in workload: 1- Gather simple statistics from query results. 2- Identify candidate holes and drill (add)them as new buckets in H. 3- Merge superfluous buckets in H.

  13. ? Drilling New Candidate Buckets For each query q in workload and bucket b in histogram: • Count how many tuples in result stream lie inside qb. • Drill qb as a new bucket (child of b). q

  14. Shrinking Candidate Buckets • Partition constraint: Bounding boxes must be rectangular. • Apply greedy technique to shrink a candidate hole to a rectangle.

  15. Merging Buckets • To avoid exceeding available space. • Merge most “similar” buckets in terms of tuple density.

  16. Parent-Child Merges Eliminate buckets too similar to their parents. Example: The interesting region in bc is covered by its child b1.

  17. Sibling-Sibling Merges • Consolidate buckets with similar densities that cover close regions. • Extrapolate frequency distributions to yet unseen regions.

  18. An Example STHoles Histogram Gaussian Data Set STHoles Histogram

  19. Experimental Setting • Data Sets: • Real: (UCI Repository) • Sample of Census data set (200K tuples) • Cover data set (500K tuples) • Synthetic: Variations of Gaussian and Zipfian(Array) distributions. 200K to 500K tuples, 2 to 4 dimensions. • Histograms: • 1024 available bytes per histogram. • EquiDept, MHist, GenHist, STGrid, STHoles.

  20. Accuracy Metric: Absolute Error. Census data set Biased (tuples) workload Gaussian (area) workload (with some normalization; details in paper) Experimental Setting (cont.) • Workloads [Pagel et al. 1993]: • 1,000 queries. • Query centers follow different distributions: Uniform, Biased, Gaussian. • Query boundaries follow different constraints: area covered, tuples covered.

  21. Biased workload, query boundaries cover around 1% of the data domain Comparison with Other Approaches: Biased Workload

  22. Uniform workload, query boundaries cover around 1% of the data set tuples. Comparison with Other Approaches: Uniform Workload

  23. Biased workload Convergence with Workload

  24. Handling Data Set Updates From Gaussian to Zipfian data distributions.

  25. Other Experiments • Varying: • data skew. • data dimensionality. • histogram size. • workload generation parameters. • number of attributes in queries. • Overhead for intercepting query results in Microsoft SQL Server 2000 is less than 8%. • STHoles lead to robust selectivity estimates across data distributions and workloads. • See full paper for details!

  26. Summary: STHoles, a Multidimensional Workload-Aware Histogram • Exploits query feedback. • Built without examining data set. • Allows bucket nesting to capture complex shapes using only rectangular bucket structures. • Results in robust and accurate selectivity estimations. • In many cases, outperforms the best techniques that access full data sets.

  27. Related Work (Histograms) • Unidimensional: • EquiDepth [Piatetsky-Shapiro and Connell 1984] • MaxDiff [Poosala et al. 1996] • V-Optimal [Jagadish et al. 1998] • Many more! • Multidimensional: • EquiDepth [Muralikrishna and DeWitt 1988] • MHist [Poosala and Ioannidis 1997] • GenHist [Gunopulos et al. 2000] • STGrid [Aboulnaga and Chaudhuri 1999]

  28. Related Work (Other Techniques) • Sampling [Olken and Rotem 1990] • Wavelets [Matias et al. 1997] • Discrete transformations [Lee et al. 1999] • Parametric Curve Fitting [Chen and Roussopoulos 1994]

  29. Evaluation Metric • Absolute Error: • Normalized Absolute Error:

  30. Overhead Evaluation over Microsoft SQL Server 2000

  31. Zipfian Data Set Census Data Set Gaussian Data Set Varying Histogram Size

  32. Gaussian Data Set Zipfian Data Set Census Data Set Varying Spatial Selectivity

More Related