Tuning the Optimizer Statistics

Tuning the Optimizer Statistics • Eric Miner • Senior Engineer • Data Server Technology • eric.miner@sybase.com

There are Two Kinds of Optimizer Statistics • Table/Index level- describes a table and its index(es) • Page/row counts, cluster ratios, deleted and forwarded rows • Some are updated dynamically as DML occurs • page/ row counts, deleted rows, forwarded rows, cluster ratios • Stored in systabstats • Column level - describes the data to the optimizer • Histogram (distribution), density values, default selectivity values • Static, need to be updated or written directly • Stored in sysstatistics • This presentation deals with the column level statistics

The Histogram Values Describes the distribution of values in the column Belongs to a column, not an index Used in costing SARGs A step is the point in the column where a value is read to obtain a ‘boundary value’ A cell represents the rows that fall between two steps Each cell has a weight which is the fraction of rows in the column it represents - read as a percentage of all rows There are approximately the same number of rows in each cell - except Frequency count cells What Are These Column Level Statistics Used For?

Some Quick Definitions Range cell density: 0.0037264745412389 Totaldensity: 0.3208892191740000 Range selectivity: default used (0.33) In between selectivity: default used (0.25) Histogram for column: “A" Column datatype: integer Requested step count: 20 Actual step count: 10 StepWeight Value 1 0.00000000 <= 154143023 2 0.05263700 <= 154220089 3 0.05245500 <= 171522361 4 0.00000000 < 800000000 5 0.34489399 = 800000000 6 0.04968300 <= 859388217 7 0.00000000 < 860000000

A cell can represent either a single value or multiple values of the column Range cell - more than one value 2 0.05263982 <= 31423 3 0.05263316 <= 63045 All values between 31424 and 63045 are in cell 3 Frequency count cell - only one value (very accurate) 5 0.00000000 < 170016 6 0.15908001 = 170016 7 0.05263815 <= 201861 8 0.10263316 = 201862 9 0.05264576 <= 317462 Cells 6 and 8 represent only one value Cell (step) 1 represents the NULL values in the column What Are These Column Level Statistics Used For? cont.

Statistics On Inner Columns of Composite Indexes Stats on inner columns of composite indexes Think of a composite index as a 3D object, columns with statistics are transparent, those without statistics are opaque • Columns with statistics give the optimizer a clearer picture of an index – sometimes good, sometimes not • This is a fairly common practice • Does add maintenance • update index statistics most commonly used to do this update index statistics tab_name [ind_name]

Statistics On Inner Columns of Composite Indexes cont. Index on columns E and B – No statistics on column b select * from TW4 where E = "yes" and b >= 959789065 and id >= 600000 and F > "May 14, 2002“ and A_A = 959000000 Beginning selection of qualifying indexes for table TW4', varno = 0, objectid 464004684. The table (Allpages) has 1000000 rows, 24098 pages, Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436. No statistics available for B, using the default range selectivity to estimate selectivity. Estimated selectivity for B, selectivity = 0.330000.

Statistics On Inner Columns of Composite Indexes cont. The best qualifying index is ‘E_B' (indid 7) costing 49264 pages, with an estimate of 191 rows to be returned per scan of the table FINAL PLAN (total cost = 481960): varno=0 (TW4) indexid=0 () path=0xfbccc120 pathtype=sclause method=NESTED ITERATION Table: TW4 scan count 1, logical reads:(regular=24098 apf=0 total=24098) physical reads: (regular=16468 apf=0 total=16468), apf IOs used=0

Statistics On Inner Columns of Composite Indexes cont. Statistics are now on column B Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436. Estimated selectivity for B, selectivity = 0.022199, upper limit = 0.074835. The best qualifying index is ‘E_B' (indid 7) costing 3317 pages,with an estimate of 13 rows to be returned per scan of the table FINAL PLAN (total cost = 55108): varno=0 (TW4) indexid=7 (E_B) path=0xfbd1da08 pathtype=sclause method=NESTED ITERATION Table: TW4 scan count 1, logical reads:(regular=4070 apf=0 total=4070), physical reads: (regular=820 apf=0 total=820),

Statistics On Non-Indexed Columns and Joins Stats on non-indexed columns Can’t help with index selection but can affect join ordering • Columns with statistics give the optimizer a clearer picture of the column – no hard coded assumptions have to be used • When costing joins of non-indexed columns having statistics may result in better plans than using the default values • Without statistics there will be no Total density or histogram that the optimizer can use to cost the column in the join • Yes, in some circumstances histograms can be used in costing joins – if there is a SARG on the joining column and that column is also in the join table then the SARG from the joining table can be used to filter the join table • If there is no SARG on the join column or on the joining column the Total density value (with stats) or the default value (w/o stats) will be used

Statistics On Non-Indexed Columns and Joins cont. SARG example select ....from TW1, TW4 where TW1.A = TW4.A and TW1.A = 10 Selecting best index for the JOIN CLAUSE: TW4.A = TW1.A TW4.A = 10 Estimated selectivity for a, selectivity = 0.003726,upper limit = 0.049683. Histogram values used select ....from TW1, TW4 where TW1.A = TW4.A and TW1.B = 10 Selecting best index for the JOIN CLAUSE: TW4.A = TW1.A Estimated selectivity for a, selectivity = 0.320889. Total density value used

Statistics On Non-Indexed Columns and Joins - Example select * from TW1,TW2 where TW1.A=TW2.A and TW1.A =805975090 A simple join with a SARG on the join column of one table Table TW2 column A has no statistics, TW1 column A does Selecting best index for the JOIN CLAUSE: (for TW2.A) TW2.A = TW1.A TW2.A = 805975090  Inherited from SARG on TW1 But, can’t help…no stats Estimated selectivity for A, selectivity = 0.100000. The best qualifying access is a table scan, costing 13384 pages, with an estimate of 50000 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacement Join selectivity is 0.100000.Inherited SARG from other table doesn’t help in this case

Statistics On Non-Indexed Columns and Joins – Example cont. Without statistics on TW2.A the plan includes a reformat with TW1 as the outer table FINAL PLAN (total cost = 2855774): varno=0 (TW1) indexid=2 (A_E_F) path=0xfbd46800 pathtype=sclause method=NESTED ITERATION varno=1 (TW2) indexid=0 () path=0xfbd0bb10 pathtype=join method=REFORMATTING • Not the best plan – but the optimizer had little to go on

Statistics On Non-Indexed Columns and Joins – Example cont. Table TW2 column A now has statistics. The inherited SARG on TW1.A can now be used to help filter the join on TW2.A Selecting best index for the JOIN CLAUSE: TW2.A = TW1.A TW2.A = 805975090 Estimated selectivity for A, selectivity = 0.001447, upper limit = 0.052948. The best qualifying access is a table scan, costing 13384 pages, with an estimate of 724 rows to be returned per scan of the table, using no data prefetch (size 2K I/O), in data cache 'default data cache' (cacheid 0) with MRU replacement Join selectivity is 0.001447.

Statistics On Non-Indexed Columns and Joins – Example cont. With statistics on TW2.A reformatting is not used and the join order has changed FINAL PLAN (total cost = 1252148): varno=1 (TW2) indexid=0 () path=0xfbd0b800 pathtype=sclause method=NESTED ITERATION varno=0 (TW1) indexid=2 (A_E_F) path=0xfbd46800 pathtype=sclause method=NESTED ITERATION

The Effects of Changing the Number of Steps (Cells) The Number of Cells (steps) Affects SARG Costing – As the Number Of Steps Changes Costing Does Too Cell weights and Range cell density are used in costing SARGs • Cell weight is used as column’s ‘upper limit’ Range cell density is used as ‘selectivity’ for Equi-SARGs – as seen in 302 output • Result(s) of interpolation is used as column ‘selectivity’ for Range SARGs • Increasing the number of steps narrows the average cell width, thus the weight of Range cells decreases • Can also result in more Frequency count cells and thus change the Range cell density value • More cells means more granular cells

The Effects of Changing the Number of Steps (Cells) cont. Average cell width = # of rows/(# of requested steps –1) • Table has 1 million rows, requested 20 steps - • 1,000,000/19 = 52,632 rows per cell • 1,000,000/199 = 5,025 rows per cell • What does this mean? • As you increase the number of steps (cells) they become narrower – representing fewer values • We’ll see that this has an effect on how the optimizer estimates the cost of a SARG • “update statistics ……. using X values“create index ….. using X values

The Effects of Changing the Number of Steps (Cells) cont. Changing the number of steps – effects on Equi-SARGs select A from TW2 where B = 842000000 With 20 cells (steps) in the histogram Range cell density: 0.0012829768785739 9 0.05263200 <= 825569337 10 0.05264200 <= 842084405 SARG value falls into cell 10 Estimated selectivity for B, selectivity = 0.001283, upper limit = 0.052642. Range cell weight of density qualifying cell

The Effects of Changing the Number of Steps (Cells) cont. With 200 cells (steps) in the histogram Range cell density: 0.0002303825911991 77 0.00507200 <= 839463989 78 0.00506000 <= 842019895 • SARG value falls into cell 78 Estimated selectivity for B, selectivity = 0.000230, upper limit = 0.005060. In this case more cells result in a lower estimated selectivity • Increasing the number of steps has decreased the average width and lowered the Range cell density and the average cell weight. • Range cell density decreased because Frequency count cells appeared in the histogram

The Effects of Changing the Number of Steps (Cells) cont. Changing the number of steps – effects on Range SARGs - select * from TW2 where B between 825570000 and 830000000 With 20 cells (steps) in the histogram Range cell density: 0.0012829768785739 9 0.05263200 <= 825569337 10 0.05264200 <= 842084405 • SARG values fall into cell 10 Estimated selectivity for B, selectivity = 0.014121, upper limit = 0.052642. • Here ‘selectivity’ is the product of interpolation, ‘upper limit’ is the weight of the qualifying cell. • Interpolation estimates how much of cell will qualify for the range SARG

The Effects of Changing the Number of Steps (Cells) cont. select * from TW2 where B between 825570000 and 830000000 With 200 cells (steps) in the histogram Range cell density: 0.0002303825911991 67 0.00505200 <= 825505843 68 0.00503000 <= 825570611 69 0.00508000 <= 825635378 70 0.00504000 <= 825690418 71 0.00506400 <= 825702450 72 0.00503200 <= 825767218 73 0.00510200 <= 825831945 74 0.00425800 <= 825833785 75 0.00598400 <= 839462921 Estimated selectivity for B, selectivity = 0.029624, upper limit = 0.034606. • The SARG values now span multiple cells • Interpolation estimates amount of cells 68 and 75 to use since not all of those two cells qualify

How many steps should I request? It will depend on your data and your queries Increase requested steps to get Frequency count cells when there are highly duplicated values FC only represents one value - very accurate weight Range SARGs will estimate what portion of a cell qualifies for the SARG More cells means narrower cells (represent fewer values) Narrower cells mean more accurate estimates Can have an affect on equi-SARGs - lower selectivity Some Statistics Related FAQs cont.

Removing Statistics Can Effect Query Plans Sometimes no statistics are better then having them This will usually be an issue when very dense columns are involved Histogram for column: “E" Step Weight Value 1 0.00000000 < "no" 2 0.47256401 = "no" 3 0.00000000 < "yes" 4 0.52743602 = "yes“ This can also show up when you have ‘spikes’ (Frequency count cells) in the distribution

Removing Statistics Can Effect Query Plans cont. select count(*) from TW4 where E = “yes” and C = 825765940 The table…has 1000000 rows, 24098 pages, Estimated selectivity for E, selectivity = 0.527436, upper limit = 0.527436. Estimating selectivity of index ‘E_AA_B', indid 6 scan selectivity 0.52743602,filter selectivity 0.527436 527436 rows, 174107 pages The best qualifying index is ‘E_AA_B' (indid 6) costing 174107 pages, with an estimate of 526 rows FROM TABLE TW4 Nested iteration. Table Scan.

Removing Statistics Can Effect Query Plans cont. delete statistics TW4(E) Estimated selectivity for E, selectivity = 0.100000. Estimating selectivity of index ‘E_AA_B', indid 6 scan selectivity 0.100000,filter selectivity 0.100000 100000 rows, 20584 pages The best qualifying index is ‘E_AA_B (indid 6) costing 20584 pages, with an estimate of 92 rows FROM TABLE TW4 Nested iteration. Index : E_AA_B Forward scan. Positioning by key.

Maintaining Tuned Statistics Tuned statistics will add to your maintenance Any statistical value you write to sysstatistics either via optdiag or sp_modifystats will be overwritten by update statistics • Keep optdiag input files for reuse • If needed get an optdiag output file, edit it and read it in • Keep scripts that run sp_modifystats • Rewrite tuned statistics after running update statistics that affects the column with the modified statistics

Monitoring Table/Index Level Fragmentation Using The Statistics Can Be Both An Optimizer and Space Management Concern The more fragmentation the less efficient page reads are • Deleted rows – fewer rows per page, affects costing • Forwarded rows – 2 I/O each, optimizer adds to costing • Empty data/leaf pages – more reads may be necessary • Clustering can get worse • Watch the DPCR of the table or APL clustered index • In general the Cluster Ratios are not a good indicator of fragmentation since they are often normally low • Use optdiag outputs to monitor these values

Monitoring Table/Index Level Fragmentation Using The Statistics cont. • ASE 12.0 and above check the ‘Space utilization’ value • ‘Large I/O efficiency’ is another value to watch Empty data page count: 0 Forwarded row count: 0.0000000000000000 Deleted row count: 0.0000000000000000 Derived statistics: Data page cluster ratio: 0.9994653835872761 Space utilization: 0.9403543288085808 Large I/O efficiency: 1.0000000000000000 • ‘Space utilization’ and ‘Large I/O efficiency’ are not used by the optimizer • The further from 1 the more fragmentation there is

Maintaining the Statistics When data changes the statistics become out of date In general up to date statistics are needed to get the best query plans • Statistics are usually updated using update statistics commands • The more statistics you have the more maintenance • It’s a trade off between the gain in query performance and the increased statistics maintenance • There’s no point in updating statistics if the table is static

Update statistics has been extended to allow for placement of statistics on columns update statistics table_name (col_name) update index statistics table_name [ind_name] update all statistics table_name Specify the requested number of steps (cells) to use when building the column’s histogram update statistics table_name (col_name) using 200 values Update Statistics

Column and table/index values have to be read in order to gather the statistics What does it do? Reads the column to gather information for density and histogram, writes the column level statistics While reading the column it gathers index/table level statistics – row & page count, forwarded rows, deleted rows, the cluster ratios, etc. Takes a sample value every X rows for a histogram boundary value - (based on the number of rows and requested steps) If same value for multiple steps save it to make an FC How Update Statistics Works

Values have to be in sorted order for statistics gathering If it’s the leading column of an index no sort is necessary Just scan index leaf for statistics If not the leading column of an index - create a worktable, read values in, sort and scan for statistics update statistics tab_name (col_name)- a table scan will be done to read the column update index statistics (ind_name)- then only an index scan (with a sort of the inner columns) The sort is done in a worktable in tempdb. update index and update all statistics will use a lot of tempdb space unless sampling is used How Update Statistics Works cont.

“Update statistics will result in improved performance” Only guarantees up to date statistics Due to distribution statistics may not give a ‘pretty’ picture of the column “Always use update all statistics” Rarely need statistics on all columns of a table Can take a VERY long time to run, makes maintenance difficult at best Should consider adding stats to composite index columns Some Statistics Related Myths & Legends

Some useful tools for working with the statistics Some are by Sybase some are by users Optdiag - read, write and simulate the statistics Well known and documented sp_modifystats - make modifications to density values (more functionality coming soon - 11.9.2.4, 12.0.0.4, 12.5) sp__optdiag (that’s a double underscore) - by Kevin Sherlock Displays the statistics ala optdiag output - very handy http://www.sypron.nl/download.html Statistics Tools

Sampling for Update Statistics A new feature in 12.5.0.3 Designed to dramatically reduce the time it takes to update statistics – can dramatically speed up the running of update statistics • ‘Opens’ your maintenance window • Decreases the cost of using ASE Randomly selected pages are read instead of reading all pages to gather the column level statistics – less I/O • The percentage of pages to be sampled can be specified update statistics tab_name with sampling = X percent • X is the percentage of pages you want to sample • Can be between 1 and 100

Definitions • Column Level Statistics – those statistics that describe the values in the column to the optimizer – an attribute of a colum (i.e.; the histogram and density values) • Sampling – randomly reading rows from a specified percentage (subset) of pages rather than all pages of the table in order to gather column level statistics • Sampling Rate – the specified percentage of pages to read • Full Scan – to gather statistics by reading all pages of the object (table or index) • Major Attribute of an Index – the ‘leading’ column of an index as listed in the create index command

Sampling for Update Statistics cont. • Unofficial tests show that a sampling rate of 10% on a 1 million row numeric column reduces the time for update statistics to run from 9 minutes to 30 seconds

Sampling for Update Statistics cont. • The Resulting histogram will be based on the values that are sampled • It will differ from a histogram obtained from a ‘full scan’ update statistics • The lower the specified percentage of sampling the more the histogram will differ from a full scan histogram • Test your queries against sampled statistics. In most cases you won’t see any major changes • Density values not updated by sampling • In most cases this won’t be an issue.

Why Sampling for Update Statistics? • As datasets have grown the time it takes to run update statistics has also grown – Dramatically!! • This became more of an issue with ‘update index statistics’ introduced in 11.9.x due to extra sort in worktable • TCO and auto-tuning/admin require a faster way to run update statistics • Without a faster update statistics neither efforts would succeed • Speeding up update statistics is a long standing Customer feature request • Random page sampling is the most I/O efficient method • Dramatically decreased the time to run update statistics

Why Sampling for Update Statistics? cont. • Some time test results – Not official, not for general release • ‘your mileage may vary’ • Timings are from tests run by Sybase QA • 1 million row int colum – timings based on ‘elapsed time’ 20% sampling rate – Full scan time :2465850 Sampling time : 398783 Percentage of savings time(elapsed time):83% 10% sampling rate – Full scan time :2139013 Sampling time : 153130 Percentage of savings time(elapsed time):92% • Variations in full scan time are taken into account

How Does It Work? • Specify the percentage of pages to read via update statistics • ‘with sampling = X percent’ • Percent value can be between 1 and 100 • ‘with’ extensions must follow ‘using’ – • with sampling = x percent and/or with consumers = x must follow using X values update statistics authors(auth_id) using 40 values with percent = 10 • Sampling reads all rows from each page read • Row values are moved to the worktable to be sorted and the statistics gathered • This saves tempdb space since the sampled sets of values are smaller than if the whole column was read into the worktable

How Does It Work? cont. Specific update statistics syntax and their affects update statistics table_name [index_name] with sampling = X percent • Will full scan index pages to update/create statistics on the major attribute(s) of the specified index or all indexes on the table ignoring the specified sampling rate – sampling will not be done

How Does It Work? cont. update index statistics tab_name [ind_name] with sampling = X percent • Will full scan index pages to update/create statistics for the major attribute(s) of the indexes or specified index on the table, ignoring sampling. • For minor index attributes will use sampling to scan the requested percentage of pages, read those values into a worktable, sort and gather statistics from there. • The space used in tempdb will decrease as the sampling rate decreases update statistics tab_name (col_name) with sampling = X percent • Will use sampling to update/create statistics for the specified column using the specified sampling rate. This applies to all columns whether major attributes of an index or not • Will not affect multi-column density values

How Does It Work? cont. update all statistics table_name with sampling = X percent • Will full scan index pages to gather statistics for the major attribute of all indexes – will not use sampling on these columns • Will use sampling to gather statistics for all columns that are not the major attribute of an index • The space used in tempdb will decrease as the sampling rate decreases

How Does It Work? cont. • Sampling is not used for create index • Since a full scan is required to build an index there is no additional cost for building the statistics

Trade Offs • A sampled set of anything is not as accurate as examining the most effective sampling rate for a given dataset • A histogram created with sampling is not likely to match a histogram created via a full scan • Histogram boundary values will vary • Cell weights will vary • Minimum and maximum histogram boundary values will vary • Since cell weight(s) and Range cell density are used to cost all SARGs a histogram from a sampled set will have an affect on SARG costing • Variations in the upper and lower histogram values may result in ‘out of bounds costing’ by the optimizer • The smaller the sampling rate the greater the variance is likely to be

Trade Offs cont. • If there are existing density values they will not be overwritten. If there are no density values a default value of 0.100000 will be used for both Range cell and Total density values • There is currently no information saved about the use of sampling (whether or not it was used and the sampling rate) • Different cell types may appear • As the sampling rate decreases it is possible that Frequency count and/or Range cells may appear where they didn’t exist prior to sampling • The same pages will be resampled if the dataset is static and the same sampling rate used

Examples of Variations in the Histogram Full scan histogram - Step Weight Value 1 0.00000000 <= 154218543 2 0.05315000 <= 805909305 3 0.05305000 <= 808793353 4 0.05311000 <= 822687028 5 0.05304000 <= 825700873 6 0.05314000 <= 839464505 7 0.05292000 <= 842544649 8 0.05305000 <= 858863369 <edited> 20 0.04621000 <= 960051465 • Note boundary values, cell weights and the upper and lower boundary values • Variations within the histogram are the main issue that needs to be tested

Examples of Variations in the Histogram cont. 10% sampled histogram - Step Weight Value 1 0.00000000 <= 154218799 2 0.05253968 <= 805909300 3 0.05253968 <= 808728585 4 0.05253968 <= 822686772 5 0.05269841 <= 825636617 6 0.05349206 <= 839464498 7 0.05253968 <= 842543113 8 0.05253968 <= 858797321 <edited> 20 0.04888889 <= 960050979 • Note variations in the boundary values, cell weights and the upper and lower boundary values

Tuning and Troubleshooting • “Trial-and-error” testing/tuning will need to be done to determine the most optimal sampling rate for a given dataset • In most cases variations in the statistics will have no affectIn other cases small variations may change query plans • There is no ‘rule of thumb’ on what sampling rate to use • In some cases the same sampling rate may be fine across all or most tables/columns. • In some cases sampling may not result in efficient plans • Use showplan and traceon 302/310 outputs to track changes to the query plan as the sampling rate changes • Using sample queries get above outputs from statistics gathered by a full scan. Update statistics with the sampling rate, rerun query and compare outputs

Tuning the Optimizer Statistics

Tuning the Optimizer Statistics

Presentation Transcript

PowerPlus Optimizer # 1

Xtreme SQL Tuning: The Tuning Limbo

windows registry optimizer

System Optimizer Software

Tuning the Guts

Schedule Optimizer

Query Optimizer Overview

Query Optimizer Overview

The Tuning Protocol

The Tuning Protocol

SEARCH OPTIMIZER

LEO – The Learning Optimizer

E116 Tuning the Optimizer Statistics

PMSP Portfolio Optimizer

Dive into the Query Optimizer

Selinger Optimizer

Backlogged Errand Optimizer

Session No. DM214 Why, When and How Should I Change the Optimizer Statistics?

Oracle Optimizer

PC OPTIMIZER

The Optimizer

SEO & IMAGE OPTIMIZER