1 / 44

Estimation, Statistics and “Oh My!”

Estimation, Statistics and “Oh My!”. Dave Ballantyne Clear Sky SQL. The ‘Me’ slide. Freelance Database Developer/Designer Specializing in SQL Server for 15+ years SQLLunch Lunchtime usergroup London & Cardiff , 2 nd & 4 th Tuesdays TSQL Smells script author

Download Presentation

Estimation, Statistics and “Oh My!”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimation, Statistics and “Oh My!” Dave Ballantyne Clear Sky SQL

  2. The ‘Me’ slide • Freelance Database Developer/Designer • Specializing in SQL Server for 15+ years • SQLLunch • Lunchtime usergroup • London & Cardiff , 2nd & 4th Tuesdays • TSQL Smells script author • http://tsqlsmells.codeplex.com/ • Email : dave@ClearSkySql.co.uk • Twitter : @DaveBally

  3. “Oh my!” • This is also me • Far to often …. • Estimates are central • Statistics provide estimates

  4. The plan • Every Journey starts with a plan • Is this the ‘best’ way to Lyon? • Fastest • Most efficient • Shortest • SQL Server make similar choices

  5. Estimation • SQL Server is a cost based optimizer • Cost based = compares predicted costs • Therefore estimations are needed • Every choice is based on these estimations • Has to form a plan • And the plan cannot change in execution if ‘wrong’

  6. Estimation – Per execution

  7. Estimation • Costs are not actual costs • Little or no relevance to the execution costs • Cost is not a metric • 1 <> 1 anything • Their purpose: • Pick between different candidate plans for a query

  8. How is estimation calculated ? - Statistics • Included within a index • Auto Updated and created • Optimizer decides “It would be useful if I knew..” • Only on single column • Not in read-only databases • Can be manually updated • Auto-Creation can operate Async

  9. Statistics Data • DBCC SHOW_STATISTICS(tbl,stat) • WITH STAT_HEADER • Display statistics header information • WITH HISTOGRAM • Display detailed step information • WITH DENSITY_VECTORS • Display only density of columns • Density = rows / count of distinct values • WITH STATS_STREAM • Binary stats blob ( not supported )

  10. Statistics – WITH STAT_HEADER Total rows in table Rows read and sampled Density (Rows/Distinct values) (exc boundaries) not used Avg byte len for all columns No of steps in histogram – Max 200

  11. Statistics – WITH HISTOGRAM • Each step contains data on a range of values • Range is defined by • <= RANGE_HI_KEY • > Previous range RANGE_HI_KEY 6 rows of data = 470 17 Rows of data > 407 and < 470 Density (17 / 9) 9 Distinct values > 407 and < 470 Row 3 > 2 and <=407 Row 4 > 407 and <=470

  12. Statistics in practice RANGE_HI_KEY <= Predicate AND > Previous RANGE_HI_KEY As Predicate == RANGE_HI_KEY Estimate = EQ_ROWS

  13. Statistics in practice RANGE_HI_KEY <= Predicate And > previous RANGE_HI_KEY As Predicate < RANGE_HI_KEY Estimate = AVG_RANGE_ROWS

  14. Statistics • Greater accuracy on Range boundary values • Based upon the ‘maxdiff’ algorithm • Relevant for leading column • Estimate for Smiths • But not Smiths called John • Additional Columns cumulative density only • 1/(Count of Distinct Values)

  15. WITH DENSITY_VECTORS =19,517 1 / 19,517 = ~5.123738E-05 Density vector = 1/(Count of Distinct Values)

  16. DENSITY_VECTORS in practice

  17. DENSITY_VECTORS in practice 211 * 5.123728E-05 = ~1.02331

  18. DENSITY_VECTORS in practice • All Diazs will estimate to the same: • As will all Smiths,Jones & Ballantynes • The statistics do not contain detail on FirstName • Only how many distinct values there are • And assumes these are evenly distributed • Not only across a single Surname • But ALL Surnames

  19. Multiple Statistics • So far we have only used a single statistic for estimations • For this query: • To provide the best estimate the optimizer ideally needs to know about LastName and FirstName

  20. Multiple Statistics - Usage LastName = ‘Sanchez’ FirstName = ‘Ken’ Correlating Multiple Stats AND conditions Find the intersection

  21. Multiple Statistics • And logic • Intersection Est = ( Density 1 * Density 2) • Or logic • Row Est 1 + Row Est 2 –(Intersection Estimate) • Avoid double counting

  22. Multiple Stats – In action 10% * 10 % = 1% 10% * 20 % = 2% No Correlation in the data is assumed

  23. Aged Statistics • To keep statistics fresh they get ‘aged’ • 0 to > 0 • <= 6 Rows (For Temp Tables) • 6 Modification • <= 500 Rows • 500 Modifications • >= 501 Rows • 500 + 20% of table • Will cause statistics to be updated on next use • Will cause statements to be recompiled on next execution • Temp tables in stored procedures more complex

  24. Large Tables • Trace flag 2371 -Dynamically lower statistics update threshold for large tables • >25,000 rows • 2008r2 (SP1) & 2012

  25. Filtered Statistics Rows Sampled Number of filtered rows sampled Filter Expression = Filter Unfiltered Rows = Total rows in table before filter When density vector is not accurate enough Manually created statistics only Additional where clause can be utilised

  26. Filtered Statistics Filter is matched and histogram is used Density of London * Density of Ramos

  27. Sampled Data • For ‘large’ data sets a smaller sample can be used • Here 100% of the rows have been sampled • Here ~52% of the rows have been sampled • Statistics will assume the same distribution of values through the entire dataset

  28. Non Literal values • Also Auto/Forced Parameterization

  29. Non Literal Values On Equality The Average Density Is Assumed Remember the Density Vector ? 19972 (Total Rows )* 0.0008285004 = 16.5468

  30. Non Literal Values Stored Procedures

  31. Parameter Sniffing • Enables a better plan to be built • (most of the time) • Uses specific values rather than average values • Values can be seen in properties pane • Erratic execution costs are often Parameter Sniffing problems

  32. Optimize For • Force a value to be used in optimization • A literal value • Or UNKNOWN • Falls back to density information

  33. Forcing Recompiling • OPTION(RECOMPILE) • Recompile on every execution • Because the plans aren’t cached • No point as by definition the plan wont be reused • Uses variables as if literals • More accurate estimates

  34. Unbalanced data • The Achilles heel • A plan is fixed • But… The facts are: • More Smiths than Ballantynes • More Customers in London than Leeds

  35. Unbalanced Data This is known as the ‘Plan Skyline’ problem

  36. Unbalanced Data

  37. Unbalanced Data • But wait… • It gets worse • That was only EQ_ROWS • RANGE_ROWS ??

  38. Unbalanced Data • Variations in plans • Shape • Which is the ‘primary’ table ? • Physical Joins • Index (non)usage • Bookmark lookups • Memory grants • 500 rows need more memory than 50 • Parallel plans

  39. Unbalanced data Demo • Can the engine resolve this ? • No! • We can help though • And without recompiling • Aim is to prevent ‘car-crash’ queries • Not necessarily provide a ‘perfect’ plan

  40. Which stats are used ? • “There is always a trace flag” • Paul White ( @SQL_Kiwi) • TF 9292 • Show when a statistic header is read • TF 9204 • Show when statistics have been fully loaded • TF 8666 • Display internal debugging info in QP

  41. TF 9292 and 9204

  42. TF8666

  43. References Statistics Used by the Query Optimizer in Microsoft SQL Server 2008 Plan Caching in SQL Server 2008 SQL Server internals 2008 book (MSPress)

  44. Q&A My Email : dave@ClearSkySQL.co.uk Twitter : @DaveBally

More Related