1 / 28

Compression Aware Physical Database Design

Compression Aware Physical Database Design. Microsoft Research . Brown University . Hideaki Kimura *. Vivek Narasayya Manoj Syamala. hkimura@cs.brown.edu. { viveknar,manojsy }@ microsoft.com. (*) Graduates soon. On Job Market. Background: Compression in DB. Every Major DBMS Supports

jaron
Download Presentation

Compression Aware Physical Database Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compression Aware Physical Database Design Microsoft Research Brown University Hideaki Kimura* Vivek Narasayya Manoj Syamala hkimura@cs.brown.edu {viveknar,manojsy}@microsoft.com (*) Graduates soon. On Job Market.

  2. Background: Compression in DB • Every Major DBMS Supports • Saves Storage Consumption • Saves I/O Bandwidth DBMS A: 4x! DBMS B: 10x! DBMS C: 12x! Query Process Engine Tables, Indexes SELECT Decompress Compressed Data Compress INSERT

  3. Compression Schemes in DB Dictionary Encoding NULL Suppression • Local dict. (Oracle, SQL Server) • Global dict. (DB2) + Prefix Suppression, LZO, RLE…

  4. Two Types of Compression in DB Order Independent Order Dependent • NULL-Supp. • Global dict. • … IAB IBA IAB IBA IAB IBA page fragmented = ≠ • Run Length Enc. • Local dict. • …

  5. Benefits and Overheads • Saves Storage Space, I/O • CPU Overhead to Compress & Decompress • Different Compression Scheme= Different Saving ↔ Overhead How Do We Use It? DBA

  6. Issue 1: To Compress or not.. • Depends on Data • Depends on Workload • SELECTs/INSERTs Frequency • CPU bottleneck? IO bottleneck? 10GB 10GB 1GB 9GB -10% -90% Low Compression Ratio High Compression Ratio

  7. Q1 Q2 Syntactically Relevant Indexes Issue 2: What Index to Create I4 I3 I5 I2 I1 Select Candidate Configurations I3 I5 I1 Physical DB Design Tool DBMS Configuration Enumerate BestConfiguration I1 I5 Hypothetical Indexes Prune What-if Analysis Query Optimizer Estimate Runtime

  8. Naïve Solution: Staged Design • Run Design Tool to Select Indexes • Compress them, then Repeat. Stage 2 Stage 1 Compress! MV MV MV Workload Idx Idx Idx 100 MB Budget 100 MB 50 MB 100 MB

  9. Problem in tight space budget SELECT SUM(Price*Discount) FROM Sales WHERE State='CA' and Jul 01 < Shipdate < Sep 01 • Misses an index that makes sense only with compression Sales Choice for 100 MB? I1 (State, Shipdate): 95 MB → 50 MB I2 (State, Shipdate) Include (Price, Discount): 170 MB → 90 MB

  10. Example: Tight Space Budget Good design: 175MB CREATE COMPRESSED INDEX (L_PARTKEY,L_ORDERKEY,L_SUPPKEY) INCLUDE (L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT) Staged: 155MB CREATE INDEX (L_ORDERKEY) INCLUDE(L_SUPPKEY,L_COMMITDATE,L_RECEIPTDATE) ?

  11. Problem in plenty space budget Choice for 200 MB? • Result in too high CPU overheads for compression/decompression. I1 (State, Shipdate): 95 MB → 50 MB INSERT INTO Sales … UPDATE Sales SET Price=.. I2 (State, Shipdate) Include (Price, Discount): 170 MB→ 90 MB CPU Overheads

  12. Example: Plenty Space Budget Worse with More Budget!

  13. IntegratedSolution Needed! • How to Estimate Index-size after compression? • How to Evaluate benefits/overheads of compression? • How Compression affects Candidate Selection/Enumeration?

  14. Size Estimation • Essential Metric of Indexes • To Fit Space Budget • To Estimate I/O cost • Need Compression Fraction Table #tuple=1M Col-A Width=8 Col-B Width=4 Col-C Width=10 Clust. Key Width=4 Stats Size (IABC) = (8 + 4 + 10 + 4) * 1M = 26 MB Comp. Size (IABC) = 26 MB * CF (IABC)

  15. Prior work SampleCF Overheads • SampleCF[Idreoset al. ICDE'10] Table 1GB Sample 10MB CREATE COMPRESSED INDEX • Sample Size: Cost ↔ Accuracy • Still Expensive for 1,000s of indexes

  16. Solution Overview

  17. Index Size Deduction Local dict. (ORD-DEP) NULL supp. (ORD-IND) IBA IAB Ia Ib Ia,b Sum-up Savings SampleCF Ia Ib Col-Set Deduction Col-Ext Deduction Ia,b Ib,a Estimate From Run-Length More Details in paper

  18. Optimize Accuracy-Cost Trade-off • Size-Estimation Strategy • Sample Size? • Deduction Path? • Expected Errors? • Formulate as Graph Problem • Greedy algorithm to solve(details in the paper)

  19. Issues in Design Tool • Query Cost model to consider (De)Compression CPU cost • Candidate Selection/Enumeration Key Challenge:Space-Performance Trade-off

  20. Candidate Selection:Space-Performance Trade-off Q1 Q2 • Add CompressedIndexes IB ID IA IC Compressed Versions Most of them are Ignored! ID IB IC IA Select Fastest Compressed Indexes are often Slower-but-Smaller IA IC (exception: very highcompression ratio)

  21. Skyline Candidate Selection • Construct Skyline of Configurations • Pick Both Fast-Indexes and Small-Indexes

  22. Comp. IC IB 10MB ICB 10MB 5MB Enumeration: Problem ICB IB IA IC IA Seed IB IC IA IA IB IA Optimal Design 15MB Room IA ICB IC IA ICB • Greedy picks un-compressed indexes too early

  23. IC IA ICB IB IA ICC … Local Backtrack in Enumeartion Recover If Oversized IB IB IC IA IA IA • Recover oversized configurations • Compress indexes in the config.

  24. Experimental Results • Implemented on SQL Server 2008 • Modified Database Tuning Advisor (DTA) "DTAc" • Modified Query Cost Model • TPC-H Scale-1 (more results in paper) • SELECT-intensive/UPDATE-intensive • Compared Estimated Runtime

  25. Candidate Selection/Enumeration • Both Skyline & Backtrack are required esp. for tight budget Clustered/2ndary Indexes

  26. DTAc vs. DTA • Especially better in tight budget • Choose lightly compressed designs in UPDATE-intensive Clustered/2ndary/MV Indexes

  27. Overhead in DTA • Reduce Size Estimation Overheads for a factor of 3 • Mostly <10% Estimation Error

  28. Conclusion • Opportunities and Challenges • Integrated Approach to exploit compression in physical design • Space-Performance Tradeoff • Size Estimation • Open Issues • Column-Store

More Related