1 / 24

Query Optimization In Compressed Database Systems

Query Optimization In Compressed Database Systems. Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs. Why Compression? . CPU speed outpaces Disk speed exponentially! x10 / decade (bandwidth), x100 / decade (latency) Trade CPU for I/O: improve query performance

hall
Download Presentation

Query Optimization In Compressed Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs

  2. Why Compression? • CPU speed outpaces Disk speed exponentially! • x10 / decade (bandwidth), x100 / decade (latency) • Trade CPU for I/O: improve query performance + Save bandwidth for sequential I/O + Improve buffer pool hit ratio - Pay decompression cost • Environment • Decision support queries • Lossless compression

  3. Issues • Database compression methods • Efficient query processing

  4. General-purpose compression Only compression ratio matters Large decompression unit (whole file) Database compression Both compression ratio and decompression cost matter Small decompression unit (attribute or tuple) Database Compression Methods Our setting: allow to decompress a single attribute

  5. Efficient Query Processing • Compared to uncompressed DB • When to decompress • Assumption: no compression in query processing • Our story • Different strategies of when to decompress • None of them is always optimal • Combined optimization problem: Query plan + decompression placement • Solutions • Experiments

  6. Lazy Transient All uncompressed AB uncompressed All compressed R.A = S.B d(R.A) = d(S.B) D(R) D(R.A) D(S.B) D(S) R S R S Different Decompression Strategies Eager R.A = S.B Mem Disk R S

  7. Which Strategy Is Optimal? • Lazy vs. eager • Lazy is always better • Transient vs. Lazy • Transient: more I/O savings • Lazy: lower decompression cost • In practice • Numerical attributes: transient is always better • String attributes: no clear winner • Expensive to decompress • High I/O savings if compressed

  8. An Example With TPCH Data Select S_NAME, S_ADDRESS, C_NAME, C_PHONE From Supplier, Customer Where S_ADDRESS = C_ADDRESS Order by S_NAME, C_NAME Sort(S_N, C_N) S_A = C_A Supplier Customer

  9. Transient sort (3s) Transient sort (0.5s) 3 attributes compressed All attributescompressed Lazy BNL (2s) Transient BNL (42s) Transient vs. Lazy Lazy sort (7s) 1 attribute compressed Lazy BNL (2s) An optimization problem!

  10. Transient sort (0.5s) All attributes compressed Pruned by System R Transient SM (2.5s) Interactions With Traditional Optimization Algorithm: run System R, then decide when to decompress. Transient sort (3s) 3 attributes compressed Lazy BNL (2s) Optimal plan returned by System R is no longer optimal!

  11. Compression Aware Optimization • Given a query and a compressed DB: Find the optimal query plan • New operators • Explicit decompression operators • Transient versions of existing relational operators • Search space: O (nm) factor over old search space • n is the depth of the plan • m is the number of attributes • Each attribute explicitly decompressed at most once • For each attribute, n places to decompress explicitly

  12. Dynamic Programming - OPT Extend system R optimizer • Bottom up, one minimal plan per interesting property • What attributes remain compressed as a new property Lazy BNL (2s)Property: S_A, C_A uncompressed Transient SM join (2.5s)Property: all compressed Customer Supplier Customer Supplier Blowup reduced from nm to 2m

  13. Min-K Heuristic Algorithm • Store plans for k rather than 2m properties • The k properties whose plans are cheapest • Storage blowup reduced from 2m to k • Time: still exponential blowup in the worst case Stored plans: Join on S_A, C_A Lazy: S_A, C_ATransient: S_A, C_ALazy: S_A, transient: C_ATransient: S_A, Lazy: C_A S_A,… C_A,…

  14. Min-K Heuristics (2) • If transient decompression is bad for one join attribute, often so for the other • BNL join: both S_A and C_A decompressed N2 times • Only consider two cases Stored plans: Join on S_A, C_A Lazy: S_A, C_A Transient: S_A, C_A S_A,… C_A,… • Time blowup is 2k

  15. Experiments • Setup • Modify Predator query engine & optimizer • Algorithms • Uncompressed, Eager, Lazy, Transient-Only,Two-Step, OPT, Min-1, Min-2 • 100 MB TPCH data • 50% compression ratio • Pentium III 550 Mhz, vary buffer pool size

  16. Experimental Setup (2) • Randomly add join conditions on string attributes • Divide queries into workloads • Number of string join conditions, number of join tables • Metrics: for algorithm X • Average relative-cost: Average(cost of plan returned by X / cost of opt plan) • Average blowup factor: Average(# plans searched by X / # plans by System R)

  17. Average Relative Cost Queries with 3-4 join tables, buffer pool 10% of compressed DB

  18. Distribution of Query Performance Percentage of Good plans (cost within twice of OPT) for all queries

  19. Optimization Cost Queries with 3-4 join tables

  20. Related Work • How to compress • Roth&Horn93, Iyer&Wilhite94, Goldstein98 • How to query • Graefe&Shapiro91, Westmann00, Greer99 • Query optimization • Compressed MOLAP aggregates: Li99 • Compressed Bitmap indices:Amer-Yahia&Johnson00 • Expensive predicates: • Chaudhuri&Shim99, Hellerstein93

  21. Conclusions & Future Work • Novel optimization problem • Search for regular query plan + when to decompress • Separate search sub-optimal • OPT and Min-K heuristic • Up to an order improvement in experiments • Future work • Caching decompressed values • Updates

  22. 3 places to place D(S_A) Regular sort Before: convert to transient After: as it is D(S_A) Transient join Search Space Sort(S_A) 3 extended plans (3 is depth) nm blow up over old space • n: depth of plan • m: number of attributes S_A = C_A S_A, …

  23. Relative-Cost - Varying Buffer Pool Size Queries with 3- 4 join tables, 2 additional string joins

  24. Relative Performance (2) Queries with more than 5 join tables

More Related