Author: Bill Minor, Toronto Lab Speaker: Christopher Tsounis, Sr. Consulting I/T Specialist

Deep Compression in DB2 9 Author: Bill Minor, Toronto Lab Speaker: Christopher Tsounis, Sr. Consulting I/T Specialist . Platform:LUW

Disclaimer • This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. • The licensed program described in this information and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement, or any equivalent agreement between us. • Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in the operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. • Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. • All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. • This information may contain examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

Agenda • Compression Concepts and Overview • Enablement and Management • Compression Estimation • Use Cases • Implementation considerations • Logging • Limitations • Proof points: Compression Ratios and Performance • Utilities • Summary • White Papers

DB2 ‘Compression’ • V8 GA - NULL and Default Value Compression • No disk storage consumed for NULL column values, zero length data in variable length columns and system default values • V8 GA - Multidimensional Clustering (MDC) • Significant index compression can be achieved through block indexes • One key per thousands of records (vs one key per record with traditional indexes) • V8 FP4 - Database Backup Compression • Smaller backup images; compress index and lf/lob tablespaces • DB2 9 - Deep Compression DB2 Storage Optimization Solution

The Compression Value Proposition: Use less data pages to store table data. • What are the benefits? • Consume less storage • Consume storage at a slowerrate • Possibly, improve your performance

Compression Basics • It is substitutional: replaces longer strings with 12 bit symbols e.g., “Once upon a time ………… The End” “Once” -> [symbol # 0] • Uses a static dictionary: [string, symbol #] list is pre-built and stored in the table e.g.,a sample dictionary {(“Once”, 0), (“upon a”, 1), …(“The End”, 4095)} • Compression is just the process of matching a byte string (data row) into the longest sub-strings present in the tree, and substituting the corresponding symbol #s • Expansion is even easier: given a compressed row [symbol #][symbol #] … simply do a look up and substitute each [symbol #] with the corresponding byte string

DB2 Deep Compression Concepts • Dictionary based - symbol table for compressing/decompressing data records • Lempel-Ziv (LZ) based algorithm (static dictionary) utilizing 12bit symbols • Dictionary per table stored within the permanent table object (~75KB in size; disk+memory) • Data resides compressed on pages (both on-disk and in bufferpool) • Significant I/O bandwidth savings • Significant memory (bufferpool) savings • CPU costs • Rows must be decompressed before being processed for evaluation • Log data from compressed records in compressed format • Does not compress rows where no storage saving is realized for that row “red” -> ‘#1#2#3’ symbol #1=“r” symbol #2=“e” symbol #3=“d” “red” == 3 bytes ‘#1#2#3’ == 12bits+12bits+12bits ( or 4.5 bytes) More beneficial to store “red” versus longer compressed format of “red”

DB2 Data Row Compression • Repeating patterns within the data (and just within each row) is the key to good compression. Text data tends to compress well because of reoccurring strings as well as data with lots of repeating characters, leading or trailing blanks Logical Table String Inputs seen by Compression Algorithm ‘Fred Smith 500 10000 Plano TX 24355’ ‘John Smith 500 20000 Plano TX 24355’ (35 bytes per row) Dictionary Symbol Pattern Data Rows on Disk 70 bytes now represented by 10 symbols (1.5bytes x 10 = 15bytes)

DB2 Data Row Compression Uncompressed Row 000 PLANO TX 24355 Common sequences of consecutive bytes in row replaced with 12 bit symbol Compressed Row ’05’ Data page with compressed rows Data page with uncompressed rows ‘05’

How Do I Compress? In order to compress data, two pre-requisites must be satisfied: • The table COMPRESS attribute must be set to YES • A Compression Dictionary must exist for the table object Once both step 1) and step 2) are satisfied, all data subsequently populated into the table is subject to being compressed.

STEP 1. Enablement - Table DDL • CREATE TABLE <table name> --->|---COMPRESS NO---| ----+-----------------+----> |---COMPRESS YES--| • ALTER TABLE <table name> --->--+----------------------+----> |--COMPRESS--+-YES--+--| |--NO--| • Compression is enabled at the table level via either the CREATE or ALTER TABLE statements • Compression will be in effect once a table dictionary is built

STEP 2. Dictionary Building – Classic or ‘Offline’ Table REORG • When the compression dictionary is being built a temporary in-memory buffer of 10MB is required • This memory will be allocated from the utilities heap • All the data rows that exist in a table will participate in the building of the compression dictionary >--REORG--<table name>--+-----------------------+----> '--INDEX--<index name>--' .-ALLOW READ ACCESS-.>--+-------------------+--+---------------+--+-----------+-> '-ALLOW NO ACCESS---' '-USE-<tbspace>-' '-INDEXSCAN-' .-KEEPDICTIONARY---.>--+-------------+-+-------------------+-+--> '-LONGLOBDATA-' '-RESETDICTIONARY---'

Compression Dictionary Build in DB2 9 EMPTY TABLE Uncompressed Row Data Compressed Row Data Dictionary TABLE REORG INSERT LOAD INDEX

RESETDICTIONARY Dictionary Management by Table REORG KEEPDICTIONARY

Compression Estimator (Advisor) – DB2 INSPECT • This tool looks at all the rows of the table data, and builds a compression dictionary from it. This dictionary will then be used to test compression against the records contained in the sample. Results include: • Estimate of compression savings • Dictionary size • Will insert the dictionary if COMPRESS YES is set • Allows for online dictionary creation/insertion • Future rows inserted/updated are compressed • Does not address existing rows (REORG to be used) >>-INSPECT-+-| Check Clause |--------------------+--------------> '-| Row Compression Estimate Clause |-'…|--ROWCOMPESTIMATE-TABLE--+-NAME--<table-name>--+------------------------+-+-| | '-SCHEMA-- <schema-name>-' | '-TBSPACEID-- <tbspc id> --OBJECTID-- <obj id>---' …

DB2 INSPECT – Compression Evaluation • INSPECT ROWCOMPESTIMATE TABLE NAME <tbname> SCHEMA <schema> RESULTS KEEP <filename> • In the db2dump directory, you will see <filename> • db2inspf <filename> <outfile> • ../sqllib/db2dump/<outfile> will contain the results of compression estimation Action: ROWCOMPESTIMATE TABLE Schema name: MIKEW Table name: EMPLOYEE Tablespace ID: 2 Object ID: 6 Result file name: emp Table phase start (ID Signed: 6, Unsigned: 6; Tablespace ID: 2) : MIKEW.EMPLOYEE Data phase start. Object: 6 Tablespace: 2 Row compression estimate results: Percentage of pages saved from compression: 46 Percentage of bytes saved from compression: 46 Percentage of rows ineligible for compression due to small row size: 0 Compression dictionary size: 13312 bytes. Expansion dictionary size: 10240 bytes. Data phase end. Table phase end.

DB2 9 INSPECT COMPRESSION DICTIONARY BUILD EMPTY TABLE Uncompressed Row Data Uncompressed Row Data INSPECT ROWCOMPESTIMATE INSERT LOAD Dictionary INDEX

INSPECT Online Dictionary Build • ROWCOMPESTIMATE option always builds a compression dictionary as part of it’s processing • Dictionary insert is only applicable if no dictionary exists in the table and the table COMPRESS attribute is YES • IX Table Lock is required (same as INSERT) • Alter Table Lock is required

Disabling Table Compression • For a table that is already compressed, compression can be disabled by setting the table COMPRESS attribute to NO • the dictionary stills exists in the table but is no longer used to compress • subsequent table population is not subject to compression • UPDATE of existing compressed records changes record to uncompressed • The compress attribute can be toggled back to YES and as long as the dictionary still exists, it will be respected (data is subject to compression again) • The compression dictionary can only be removed during table truncation operations

Dictionary Removal / Characteristics • The dictionary is deleted only during table truncation, if the COMPRESS table attribute is turned off • If compress attribute is set, dictionary is preserved • Operations such as TABLE REORG or LOAD REPLACE or IMPORT REPLACE drive table truncation - dictionary can be removed • The Dictionary is Static - No other alters or updates are performed against the dictionary, or any other dictionary records

Row Compression: New Table • Create Table that is Eligible for Compression CREATE TABLE Sales COMPRESS YES • Get Representative Data Sample LOAD FROM filesmall OF DEL REPLACE INTO Sales • Creates dictionary on sample data REORG TABLE Sales • Load respects dictionary LOAD FROM filerest OF DEL INSERT INTO Sales

Row Compression: New Table Partitions • Makes table eligible for compression • ALTER TABLE Sales COMPRESS YES • New Sales Data Range • CREATE TABLE NewSales … COMPRESS YES • Load Data into New Partition • LOAD FROM file OF DEL REPLACE INTO NewSales • Data Now Gets Compressed • REORG TABLE NewSales RESETDICTIONARY • New Partition is Added to the Table ALTER TABLE Sales ATTACH PARTITION Mar05 STARTING '03/01/2005' ENDING '03/31/2005' FROM TABLE NewSales • Execute Set Integrity

DB2 INSPECT – Partitioned Tables db2 "create table t_dpart (c1 int, c2 varchar(100)) partition by range(c1) (starting from (1) ending (100) every(25)) compress yes” db2 “describe data partitions for table bminor.t_dpart show detail” … PartitionId PartitionName TableSpId PartObjId LongTblSpId AccessMode Status ------------- ----------------- ------------- ------------ ---------------- ----------------------- 0 PART0 3 260 3 F 1 PART1 3 261 3 F 2 PART2 3 262 3 F 3 PART3 3 263 3 F db2 inspect rowcompestimate table tbspaceid 3 objectid 260 results keep inspd.out

New Administrative Table Function db2 describe "select * from table(sysproc.admin_get_tab_info(‘MIKEW','STAFF')) as t" sqltype sqllen sqlname.data sqlname.length -------------------- ------ ------------------------------ -------------- 449 VARCHAR 128 TABSCHEMA 9 449 VARCHAR 128 TABNAME 7 453 CHARACTER 1 TABTYPE 7 501 SMALLINT 2 DBPARTITIONNUM 14 497 INTEGER 4 DATA_PARTITION_ID 17 453 CHARACTER 1 AVAILABLE 9 493 BIGINT 8 DATA_OBJECT_L_SIZE 18 493 BIGINT 8 DATA_OBJECT_P_SIZE 18 493 BIGINT 8 INDEX_OBJECT_L_SIZE 19 493 BIGINT 8 INDEX_OBJECT_P_SIZE 19 493 BIGINT 8 LONG_OBJECT_L_SIZE 18 493 BIGINT 8 LONG_OBJECT_P_SIZE 18 493 BIGINT 8 LOB_OBJECT_L_SIZE 17 493 BIGINT 8 LOB_OBJECT_P_SIZE 17 493 BIGINT 8 XML_OBJECT_L_SIZE 17 493 BIGINT 8 XML_OBJECT_P_SIZE 17 501 SMALLINT 2 INDEX_TYPE 10 453 CHARACTER 1 REORG_PENDING 13 449 VARCHAR 10 INPLACE_REORG_STATUS 20 449 VARCHAR 12 LOAD_STATUS 11 453 CHARACTER 1 READ_ACCESS_ONLY 16 453 CHARACTER 1 NO_LOAD_RESTART 15 501 SMALLINT 2 NUM_REORG_REC_ALTERS 20 453 CHARACTER 1 INDEXES_REQUIRE_REBUILD 23 453 CHARACTER 1 LARGE_RIDS 10 453 CHARACTER 1 LARGE_SLOTS 11 493 BIGINT 8 DICTIONARY_SIZE 15

Compression Catalog Information – syscat.tables

New Larger RIDs – More Records/Data Page RID = Record Identifier = (page_number, slot_number)

SQL Access Plans: Explain Tables – db2exfmt • Create explain tables i.e. db2 –tvf $HOME/sqllib/misc/EXPLAIN.DDL • Generate or explain the access plan: db2 explain all for <statement> db2exfmt -d <dbname> -n % -e % -s % -w -l -# 0 -o <resultsFile> -g TIC Objects Used in Access Plan: --------------------------- Schema: BMINOR Name: STAFF2 Type: Table Time of creation: 2006-11-22-09.34.28.735089 Last statistics update: 2006-11-22-09.37.21.861142 Number of columns: 7 Number of rows: 35 Width of rows: 41 Number of buffer pool pages: 6 Number of data partitions: 1 Distinct row values: No Tablespace name: IBMDB2SAMPLEREL Tablespace overhead: 7.500000 Tablespace transfer rate: 0.060000 Source for statistics: Single Node Prefetch page count: 32 Container extent page count: 32 Table overflow record count: 0 Table Active Blocks: -1 Average Row Compression Ratio: 1.63207 Percentage Rows Compressed: 100 Average Compressed Row Size: 27 Access Plan: ----------- Total Cost: 50.4599 Query Degree: 1 Rows RETURN ( 1) Cost I/O | 35 TBSCAN ( 2) 50.4599 6 | 35 TABLE: BMINOR STAFF2

Compression – Log Records • Data will be stored in log records as compressed • INSERT and DELETE log records will consume less storage • Although UPDATE log records also contain compressed data, the overall amount of logged data may be more than the uncompressed version

Compression and UPDATE Logging • There are three different types of UPDATE log records that can be written by DB2: • Full before and after row image logging • Most number of bytes written • Only type of UPDATE logging for DATA CAPTURE CHANGES • Full XOR Logging • XOR differences between before and after row images – less logging than full before/after image logging • Partial XOR • XOR differences between before and after images for UPDATEs where there is no row size increase • When a compressed row is updated, even if the uncompressed record lengths do not change, the compressed images can change • There could be a different number of symbols used to represent the new updated version of the row, hence need Full XOR log record

Compression and UPDATE Processing • With the delivery of larger RIDs in DB2 9, more data records can be packed onto a data page • If the data page is tightly packed with compressed records and there is no free space on a page for size increasing UPDATEs, a pointer-overflow record will be created – a pointer to the new record will be created on the existing page while the actual record will reside on another page For data that is heavily updated, reserve free space within a page via ALTER TABLE PCTFREE

Log Consumption – Column Ordering Summary • Columns which are updated frequently (changing value) should be: • grouped together • defined towards or at the end of the table definition • These recommendations are actually independent of • Row compression • Row format (default or null/value compression) • The benefits would be: • better performance • less bytes logged • less log pages written • smaller active log requirement for transactions performing a large number of updates.

What Tables Should be Compressed? Some considerations: • Compression is about storage savings, so look at the largest tables first • Generate potential savings via INSPECT ROWCOMPESTIMATE • Compress all tables where there is a 50% savings or more • What is the typical table activity – read only? Some write? Heavy Write? • Characterize current system load: I/O and CPU • Largely I/O bound systems will benefit • Compress read-only tables • Benchmark typical workload against a compressed table(s) and compare to uncompressed case • Determine in advance AVGROWSIZE and number of rows per page to determine need for large RIDs • Collect stats and gauge compression effectiveness as data evolves

Limitations of Data Row Compression • Row compression is only supported for tables supported by table REORG. The following tables cannot be compressed: • RCT tables (range-clustered tables) • Catalog tables • Declared global temporary tables • System temporary tables • Row compression is only applicable to rows in data objects table space!! (the key to this is if it is stored with tables, or somewhere else) • Pure XML in separate object – XML data is not compressed • LOBs are stored in separate object – LOB data is not compressed • LONGs are stored in separate object - LONG data is not compressed • Indexes are not compressed • Row compression support and table data replication support will not be compatible. • DATA CAPTURE CHANGES option is not compatible with the COMPRESS YES option. If the compress row attribute is NO for the table but the table has a dictionary, the DATA CAPTURE CHANGES option is also not allowed since there may be rows that are compressed in the table. • Compression is not practical for a table smaller than 100KB, since the space savings may not offset the storage requirements for the dictionary.

Compression Ratios (Customer Financial Data) Compression Type 32KB Page Count Space Required on Disk No compression 5893888 179.9GB Row compression 1392446 42.5GB % Pages Saved: 76.4%

Compression Savings – Customer Data Warehouse Data warehouse estimated reduction in size from 2.873TB to 1.453TB Table % of Table Sampled Number of Rows Sample Size (GB) Sample Size Compressed (GB) Compression Ratio Total Est. Disk Savings (GB) T1 100 159259747 71.49 13.23 81.5% 58 T2 10 77390826 21.48 5.23 75.7% 162 T3 100 191362309 141.46 36.18 74.4% 105 T4 20 19474261 11.80 3.05 74.2% 43 T5 10 23370458 25.78 8.54 66.9% 172 T6 10 39573969 23.24 6.32 72.8% 169

TPCH - Space Savings per Table

TPCH - Throughput Comparison Number of compressed disks = ½ number uncompressed disks *higher is better

Query Performance • Sample Warehouse Query Workload • I/O bound system: observed overall improvements in end-to-end workload execution time (20-30%) • CPU bound system: performance neutral or degradations on the order of 10% in end-to-end workload • Compression trades I/O time for CPU time • If the system is CPU bound, or if an agent is already consuming an entire CPU on its own, then compression will likely have a negative impact • If there is CPU headroom and I/O waits are happening compression can be beneficial

Backup Compression and Data Row Compression • Test backup compression in addition to tables with row compression • Backup compression can be expensive and may not provide much added value in additional savings to backup image size • Time/size/value depends on the percentage of table space content with row compression. E.g. Are all tables compressed? Are indexes or long data stored in the same table space?

LOAD 10%, REORG, Continue LOADing LOAD 100%, Full Table REORG

In Summary • The storage savings and performance impact of data row compression are intimately tied to the characteristics of the data within the database, the configuration of the database (layout and tuning) as well as application workload. • In general terms, the net performance impact of compression depends on whether the database is CPU or I/O bound. • Data row compression can significantly reduce table size at the expense of additional CPU utilization. • Test drive DB2 9 to a get sense for what impact you can expect with respect to both storage and performance.

Additional References • Chris Eaton blogs- http://blogs/ittoolbox.com/database/technology • IBM developerWorks – http://www.ibm/developerworks/ Introducing DB2 9: DB2 9 unveiled http://www-128.ibm.com/developerworks/db2/library/long/dm-0607ahuja2/#download Introducing DB2 9, Part 1: Data compression in DB2 9 http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0605ahuja/ Row compression in DB2 9 - Analysis of a DSS database environment http://www-128.ibm.com/developerworks/db2/library/long/dm-0610chang/ Shrink your database using DB2 9 for Linux, UNIX, and Windows row Compression - New capabilities for reducing your disk storage in DB2 9 http://www-128.ibm.com/developerworks/edu/dm-dw-dm-0611read-i.html

Author: Bill Minor, Toronto Lab Speaker: Christopher Tsounis, Sr. Consulting I/T Specialist