Measuring SQL Execution Outliers (to track performance better)

Measuring SQL Execution Outliers(to track performance better) Maxym Kharchenko

500 ms

A very important SQL MERGE INTOorders_tableUSING dual ON (dual.dummyIS NOT NULL AND id = :1ANDp_id = :2 ANDorder_id = :3AND relevance = :4AND … Typical elapsed time: 100 ms *Bad* elapsed time: > 200 ms

SQL Latency

SQL latency metrics Elapsed Elapsed Time Time (s) Executions per Exec (s) %Total %CPU %IO SQL Id ---------------- -------------- ------------- ------ ------ ------ ------------- 635.5 10,090 0.1 31.5 16.5 77.6 fskp2vz7qrza2 Module: MYmodule merge into orders_table using dual on (dual.dummy is not null and id = :1 and p_id = :2 and order_id = :3 and relevance = :4 and …

What exactly is “average” ?

What exactly is “average” ? Average

Most typical value “average” = “most typical” 95 % of all executions

You can make predictions with “average” Probability: >= 200ms: 0.6 % Average: 100 ms

Averageis a pretty decent metric

As long as distribution is normal

Measured Execution Times

What if the real distribution is not normal?

People feel *BAD* variance not the average

Percentiles “average”

Percentiles “average” 99th percentile

Average: (what we think) typical latency is: 102 ms p99: The worst 1% of executions is at least as bad as: 532 ms

SQL latency (but now with: p99)

Ok, so how do we measure percentiles ?

You need to capture individual query times

Application side tracing App start_exec = time() Elapsed = time() – start_exec Db Exec: 4fucahsywt13m:19731969 • “True” user experience • Precise(captures “everything”) • (Lots of)DIY by developers • Captures *not only* db time

Server side (10046) tracing App start_exec = time() Elapsed = time() – start_exec Db Exec: 4fucahsywt13m:19731969 • Precise(captures “everything”) • Detailed: breakdown by events and SQL “stages” • Cumbersome to process (lots of individual trace files and “events”)

Sampling • v$sql.elapsed_time

Sampling with number_generator as ( select level as l from dual connect by level <= 1000 ), target_sqls as ( select /*+ ordered no_merge use_nl(s) */ … from number_generator i, gv$sql s

Sampling SQL> @sqlc fdcz4kx11era5 Gets Ela (ms) LAST C# Plan hash EXECUTIONS pExecpExec Active---- ----------- ------------ ----------- ----------- ------------ 2 245875337 1,700,541 444.62 137.57 +0 00:00:01 7 245875337 2 23.50 21.39 +0 01:15:16 3 245875337 1 26.00 10.38 +27 04:42:52

Sampling SQL> @ssql fdcz4kx11era5 2 1000 Elapsed CPU IO App CCS Ex TIME TIMETIMETIMETIMEPct- --- ------------ -------- ------------ -------- -------- ----- 1 330 0 0 0 0 0 1 340 1,000 0 0 0 3.33 1 786 999 0 0 0 6.67 1 1,518 2,000 188 0 0 10* 2 11,963 1,999 11,103 0 0 13.33 1 14,851 4,999 10,908 0 0 16.67 1 15,724 2,000 14,780 0 0 20 1 16,471 2,000 15,163 0 0 23.33… 1 90,256 5,999 87,365 0 0 86.67 1 97,171 2,000 93,585 0 27 90 1 120,635 1,999 117,660 0 0 93.33 1 142,201 6,999 138,853 0 0 96.67 1 167,552 4,998 165,333 0 0 100

Sampling SQL> @ssql2 fdcz4kx11era5 2 50000 avg 10 Elapsed CPU IO Pct Execs TIME TIMETIME--- -------- ------------------------------ ----------- ----------- p0 148 .23-7.11 .89 2.30 p10 148 7.18-14.03 1.11 9.44 p20 146 14.03-20.26 1.48 15.82 p30 143 20.39-29.01 1.86 22.92 p40 146 29.1-40.73 1.91 32.63 p50 143 40.77-55.21 2.37 45.50 p60 142 55.22-77.92 3.15 63.09 p70 145 77.99-113.33 3.58 90.72 p80 141 113.41-173.64 4.46 136.22 p90 138 174.34-634.15 6.83 245.30

Sampling SQL> @ssql3 fdcz4kx11era5 2 50000 avg 10 Elapsed CPU IO Bucket Range (ms) Execs Graph TIME TIMETIME------ -------------------- -------- ---------- ----------- ----------- ----------- 1 .19-51.81 686 ########## 22.39 1.51 20.91 2 51.81-103.44 303 #### 76.37 2.89 73.75 3 103.44-155.07 198 ## 127.59 3.55 124.23 4 155.07-206.69 91 # 174.25 4.68 169.82 5 206.69-258.32 46 224.91 5.47 220.11 6 258.32-309.95 22 267.26 6.90 261.46 7 309.95-361.57 7 339.04 9.00 331.30 8 361.57-413.2 8 264.19 6.90 258.24 9 413.2-464.83 3 318.62 6.00 311.41 10 464.83-516.45 2 492.26 10.00 483.53

The scripts are here http://intermediatesql.com

Sampling with i_gen as ( select level as l from dual connect by level <= &REPS ), target_sqls as ( select /*+ ordered no_mergeuse_nl(s) */ … from i_geni, gv$sql s • SQL access to data • Simplified time breakdown • Can capture “hours” • Slightly imprecise (captures 90-95 % of runs) • x$ data: “suspect” ?

Monitoring SQL> descv$session sql_id sql_exec_start sql_exec_id v$sql_monitor /*+ MONITOR */

Monitoring NAME VALUE DESCRIPTION ------------------------------ ------- ------------------------------------------------------------ _sqlmon_binds_xml_format default format of column binds_xml in [G]V$SQL_MONITOR _sqlmon_max_plan 480 Maximum number of plans entry that can be monitored. Defaults to 20 per CPU _sqlmon_max_planlines 300 Number of plan lines beyond which a plan cannot be monitored _sqlmon_recycle_time 60 Minimum time (in s) to wait before a plan entry can be recycled _sqlmon_threshold 5 CPU/IO time threshold before a statement is monitored. 0 is disabled • Precise(captures “everything”) • SQL access to data • Capture size is limited (think: “seconds”)

Can I find worst performers in ASH ? 1 2 3 4 5 6 7 8 9 10 11 1, 2, 3, 7 3, 5, 7, 9 7

Can I find worst performers in ASH ?

Takeaways • Percentiles are better performance metrics than averages • Percentile calculation: requires capturing (most of) individualSQL runs • A number of ways exist to capture and measure individual SQL runs

Thank you!

Measuring SQL Execution Outliers (to track performance better)