400 likes | 537 Views
This guide discusses the measurement of SQL execution outliers to enhance performance tracking. By analyzing elapsed times for SQL queries, especially MERGE operations into the orders table, performance metrics such as average execution time, percentiles, and latency can be better understood. The document covers the concept of typical latency, highlights the significance of bad execution time, and explains how to measure percentiles to capture critical performance issues, providing a clearer picture of user experience and query efficiency.
E N D
Measuring SQL Execution Outliers(to track performance better) Maxym Kharchenko
A very important SQL MERGE INTOorders_tableUSING dual ON (dual.dummyIS NOT NULL AND id = :1ANDp_id = :2 ANDorder_id = :3AND relevance = :4AND … Typical elapsed time: 100 ms *Bad* elapsed time: > 200 ms
SQL latency metrics Elapsed Elapsed Time Time (s) Executions per Exec (s) %Total %CPU %IO SQL Id ---------------- -------------- ------------- ------ ------ ------ ------------- 635.5 10,090 0.1 31.5 16.5 77.6 fskp2vz7qrza2 Module: MYmodule merge into orders_table using dual on (dual.dummy is not null and id = :1 and p_id = :2 and order_id = :3 and relevance = :4 and …
What exactly is “average” ? Average
Most typical value “average” = “most typical” 95 % of all executions
You can make predictions with “average” Probability: >= 200ms: 0.6 % Average: 100 ms
People feel *BAD* variance not the average
Percentiles “average”
Percentiles “average” 99th percentile
Average: (what we think) typical latency is: 102 ms p99: The worst 1% of executions is at least as bad as: 532 ms
Application side tracing App start_exec = time() Elapsed = time() – start_exec Db Exec: 4fucahsywt13m:19731969 • “True” user experience • Precise(captures “everything”) • (Lots of)DIY by developers • Captures *not only* db time
Server side (10046) tracing App start_exec = time() Elapsed = time() – start_exec Db Exec: 4fucahsywt13m:19731969 • Precise(captures “everything”) • Detailed: breakdown by events and SQL “stages” • Cumbersome to process (lots of individual trace files and “events”)
Sampling • v$sql.elapsed_time
Sampling with number_generator as ( select level as l from dual connect by level <= 1000 ), target_sqls as ( select /*+ ordered no_merge use_nl(s) */ … from number_generator i, gv$sql s
Sampling SQL> @sqlc fdcz4kx11era5 Gets Ela (ms) LAST C# Plan hash EXECUTIONS pExecpExec Active---- ----------- ------------ ----------- ----------- ------------ 2 245875337 1,700,541 444.62 137.57 +0 00:00:01 7 245875337 2 23.50 21.39 +0 01:15:16 3 245875337 1 26.00 10.38 +27 04:42:52
Sampling SQL> @ssql fdcz4kx11era5 2 1000 Elapsed CPU IO App CCS Ex TIME TIMETIMETIMETIMEPct- --- ------------ -------- ------------ -------- -------- ----- 1 330 0 0 0 0 0 1 340 1,000 0 0 0 3.33 1 786 999 0 0 0 6.67 1 1,518 2,000 188 0 0 10* 2 11,963 1,999 11,103 0 0 13.33 1 14,851 4,999 10,908 0 0 16.67 1 15,724 2,000 14,780 0 0 20 1 16,471 2,000 15,163 0 0 23.33… 1 90,256 5,999 87,365 0 0 86.67 1 97,171 2,000 93,585 0 27 90 1 120,635 1,999 117,660 0 0 93.33 1 142,201 6,999 138,853 0 0 96.67 1 167,552 4,998 165,333 0 0 100
Sampling SQL> @ssql2 fdcz4kx11era5 2 50000 avg 10 Elapsed CPU IO Pct Execs TIME TIMETIME--- -------- ------------------------------ ----------- ----------- p0 148 .23-7.11 .89 2.30 p10 148 7.18-14.03 1.11 9.44 p20 146 14.03-20.26 1.48 15.82 p30 143 20.39-29.01 1.86 22.92 p40 146 29.1-40.73 1.91 32.63 p50 143 40.77-55.21 2.37 45.50 p60 142 55.22-77.92 3.15 63.09 p70 145 77.99-113.33 3.58 90.72 p80 141 113.41-173.64 4.46 136.22 p90 138 174.34-634.15 6.83 245.30
Sampling SQL> @ssql3 fdcz4kx11era5 2 50000 avg 10 Elapsed CPU IO Bucket Range (ms) Execs Graph TIME TIMETIME------ -------------------- -------- ---------- ----------- ----------- ----------- 1 .19-51.81 686 ########## 22.39 1.51 20.91 2 51.81-103.44 303 #### 76.37 2.89 73.75 3 103.44-155.07 198 ## 127.59 3.55 124.23 4 155.07-206.69 91 # 174.25 4.68 169.82 5 206.69-258.32 46 224.91 5.47 220.11 6 258.32-309.95 22 267.26 6.90 261.46 7 309.95-361.57 7 339.04 9.00 331.30 8 361.57-413.2 8 264.19 6.90 258.24 9 413.2-464.83 3 318.62 6.00 311.41 10 464.83-516.45 2 492.26 10.00 483.53
The scripts are here http://intermediatesql.com
Sampling with i_gen as ( select level as l from dual connect by level <= &REPS ), target_sqls as ( select /*+ ordered no_mergeuse_nl(s) */ … from i_geni, gv$sql s • SQL access to data • Simplified time breakdown • Can capture “hours” • Slightly imprecise (captures 90-95 % of runs) • x$ data: “suspect” ?
Monitoring SQL> descv$session sql_id sql_exec_start sql_exec_id v$sql_monitor /*+ MONITOR */
Monitoring NAME VALUE DESCRIPTION ------------------------------ ------- ------------------------------------------------------------ _sqlmon_binds_xml_format default format of column binds_xml in [G]V$SQL_MONITOR _sqlmon_max_plan 480 Maximum number of plans entry that can be monitored. Defaults to 20 per CPU _sqlmon_max_planlines 300 Number of plan lines beyond which a plan cannot be monitored _sqlmon_recycle_time 60 Minimum time (in s) to wait before a plan entry can be recycled _sqlmon_threshold 5 CPU/IO time threshold before a statement is monitored. 0 is disabled • Precise(captures “everything”) • SQL access to data • Capture size is limited (think: “seconds”)
Can I find worst performers in ASH ? 1 2 3 4 5 6 7 8 9 10 11 1, 2, 3, 7 3, 5, 7, 9 7
Takeaways • Percentiles are better performance metrics than averages • Percentile calculation: requires capturing (most of) individualSQL runs • A number of ways exist to capture and measure individual SQL runs