Revolutionizing Oracle Performance Analysis: Discoveries and Strategies

Oracle Operational Timing DataPaper 36571 Cary Millsap (cary.millsap@hotsos.com) Hotsos Enterprises, Ltd. OracleWorld 2003 / San Francisco, California USA 8:30a–9:30a Wednesday 10 September 2003

Additional resources for you… • Paper update • hotsos.com • Questions • Hotsos booth in exhibit hall • Book • Coming soon (est. July)

Motive

Three great “discoveries” are revolutionizing the effectiveness of Oracle performance analysts today. • User action focus • Response time focus • Amdahl’s Law • Recommendation • Eli Goldratt’s The Goal • “Goal readers are now doing the best work of their lives.”

Great Discovery #1: user action focus. You can’t extrapolate detail from an aggregate. • Some problems can’t be detected by observing system-wide data • Hence, system-wide focus is not reliable • This is one reason “tuning” is so hard

Example: What’s the problem here? It’s obvious, right? • CIO’s #1 problem • “We need PYUGEN to execute 2x faster” • DBA’s viewpoint • “The system’s number one wait event is latch free”

Wrong. You can not trust system-wide data to guide your way. Resource profile for PYUGEN Response Time Component Duration # Calls Dur/Call ----------------------------- --------------- ----------- ------------ SQL*Net message from client 984.0s 49.6% 95,161 0.010340s SQL*Net more data from client 418.8s 21.1% 3,345 0.125208s db file sequential read 279.3s 14.1% 45,084 0.006196s CPU service 248.7s 12.5% 222,760 0.001116s unaccounted-for 27.9s 1.4% latch free 23.7s 1.2% 34,695 0.000683s log file sync 1.1s 0.1% 506 0.002154s SQL*Net more data to client 0.8s 0.0% 15,982 0.000052s log file switch completion 0.3s 0.0% 3 0.093333s enqueue 0.3s 0.0% 106 0.002358s SQL*Net message to client 0.2s 0.0% 95,161 0.000003s other 0.2s 0.0% ----------------------------- --------------- ----------- ------------ Total 1,985.4s 100.0%

Great Discovery #2: response time focus. You can’t tell how long something tookby counting how many times it happened. • Try it…

Example: What’s the problem here? It’s obvious, right? Response Time Component # Calls ------------------------------ --------- CPU service 18,750 SQL*Net message to client 6,094 SQL*Net message from client 6,094 db file sequential read 1,740 log file sync 681 SQL*Net more data to client 108 SQL*Net more data from client 71 db file scattered read 34 direct path read 5 free buffer waits 4 log buffer space 2 direct path write 2 log file switch completion 1 latch free 1

Wrong. You can not trust call counts to guide your way. Response Time Component Duration # Calls Dur/Call ------------------------------ ------------------ --------- ----------- SQL*Net message from client 166.60s 91.7% 6,094 0.027338s CPU service 9.65s 5.3% 18,750 0.000515s unaccounted-for 2.22s 1.2% db file sequential read 1.59s 0.9% 1,740 0.000914s log file sync 1.12s 0.6% 681 0.001645s SQL*Net more data from client 0.25s 0.1% 71 0.003521s SQL*Net more data to client 0.11s 0.1% 108 0.001019s free buffer waits 0.09s 0.0% 4 0.022500s SQL*Net message to client 0.04s 0.0% 6,094 0.000007s db file scattered read 0.04s 0.0% 34 0.001176s log file switch completion 0.03s 0.0% 1 0.030000s log buffer space 0.01s 0.0% 2 0.005000s latch free 0.01s 0.0% 1 0.010000s direct path read 0.00s 0.0% 5 0.000000s direct path write 0.00s 0.0% 2 0.000000s ------------------------------ ------------------ --------- ----------- Total 181.76s 100.0%

Great Discovery #3: Amdahl’s Law. Performance improvement is proportional tohow much a program uses the thing you improved. • Gene Amdahl’s realization, formalized in 1967 • Many “tuning” efforts are misplaced • This is one reason so many “successful” performance improvement projects create no impact

These great discoveries lead us to a simple, obviously efficient Oracle optimization strategy… Work first to reduce the biggest response time componentof a business’ most important user action. • Sounds easy • So why don’t people do it? • Because it’s hard to get response time data for a specific user action

Oracle Operational Timing Data

So why do Oracle analysts use counts instead of timings? • Because counts are easier to obtain • Counts are easy • Select from V$SESSTAT union V$SESSION_EVENT at t0 • Select again at t1 • Then compute results[t1] – results[t0] • Maximum error is ±1 event (not a big deal) • But times are hard • Same method produces potentially enormous duration errors • Example…

Accounting for response time with SQL queries of Oracle fixed view data is hard. $ perl vprof.pl --username=system --password=manager 8 Press <Enter> to mark time t0: Press <Enter> to mark time t1: Resource Profile for Session 8 t0 = 15:31:50.507000 t1 = 15:32:02.696000 interval duration = 12.189000s accounted-for duration = 23.510000s Response Time Component Duration (seconds) Calls Dur/Call ---------------------------------- -------------------- ------------ ---------- SQL*Net message from client 22.78 186.9% 2 11.390000 db file sequential read 0.60 4.9% 199 0.003015 CPU service 0.09 0.7% 2 0.045000 db file scattered read 0.04 0.3% 4 0.010000 SQL*Net message to client 0.00 0.0% 2 0.000000 unaccounted-for -11.32 -92.9% 1 -11.321000 ---------------------------------- -------------------- ------------ ---------- Total 12.19 100.0% Source: Millsap, C. 2003. Optimizing Oracle Performance, Sebastopol CA: O’Reilly (July 2003)

One easy way out of the problem is to use Oracle’s extended SQL trace data. • Easy to activate • Low overhead if used sensibly • Produces a chronological record • Contains all the performance data you need in most cases • Doesn’t require any special tools to get started

The easiest way to activate extended SQL trace is to add a few statements to your application’s source code. alter session set timed_statistics=true alter session set max_dump_file_size=unlimited alter session set tracefile_identifier='POX20031031a' alter session set events '10046 trace name context forever, level 8' /* code to be traced goes here */ alter session set events '10046 trace name context off'

Raw trace data lines are not so hard to understand. PARSE #54:c=20000,e=11526,p=0,cr=2,cu=0,mis=1,r=0,dep=1,og=0,tim=1017039304725071 EXEC #1:c=10000,e=12137,p=0,cr=22,cu=0,mis=0,r=1,dep=0,og=4,tim=1017039275981174 FETCH #3:c=10000,e=306,p=0,cr=3,cu=0,mis=0,r=1,dep=2,og=4,tim=1017039275973158 WAIT #1: nam='SQL*Net message to client' ela= 40 p1=1650815232 p2=1 p3=0 WAIT #1: nam='SQL*Net message from client' ela= 1709 p1=1650815232 p2=1 p3=0 WAIT #34: nam='db file sequential read' ela= 14118 p1=52 p2=2755 p3=1 WAIT #44: nam='latch free' ela= 1327989 p1=-1721538020 p2=87 p3=13

The five fields I’ll speak about today are the cursor id, c, e, nam, and ela (see MetaLink 39817.1 for more info).

There are two types of wait event: events issued within db calls, and calls issued between db calls. • Events issued between db calls are recognizable by nam • SQL*Net message from client • SQL*Net message to client • single-task message • pipe get • rdbms ipc message • pmon timer • smon timer • Most other events are issued within db calls

A db call’s elapsed time is approximately its total CPU consumption plus the total duration of its wait events. WAIT #4: nam='db file sequential read' ela= 13060 p1=1 p2=53903 p3=1 WAIT #4: nam='db file sequential read' ela= 6978 p1=1 p2=4726 p3=1 FETCH #4:c=0,e=21340,p=2,cr=3,cu=0,mis=0,r=0,dep=1,og=4,tim=1033064137953092

A whole trace file’s elapsed duration is the sum of the db call durations plus the sum of the between-call stuff. PARSE #9:c=0,e=0,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=4,tim=1716466757 WAIT #9: nam='SQL*Net message to client' ela= 0 p1=1413697536 p2=1 p3=0 WAIT #9: nam='SQL*Net message from client' ela= 3 p1=1413697536 p2=1 p3=0 ... PARSE #9:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=1716466760

Extended SQL trace is reliable for diagnosing the root cause of virtually any performance problem. • Method is reliable for even more problem types by isolating the “un-measured” components • M – measurement intrusion effect • E – quantization error • N – time spent not executing • U – un-instrumented Oracle code (systematic error + bugs) • S – CPU consumption that is counted twice

What you can do with the timing data is spectacular. • Performance problems cannot hide from you • No more solving the wrong problem • No more stabs in the dark or trial-and-error “tuning” • No more multi-month performance diagnosis projects • No more CTD [Vaidyanatha and Deshpande (2000)] You either solve the right problem quicklyor prove that solving it is not worth the effort.

Performance problems cannot hide from this method. • Partial laundry list of problems we’ve solved • System-wide problems, and specific user-action problems • Query mistakes: bad SQL, faulty indexing, data density, … • Application mistakes: excessive parsing, no arrays, … • Serialization mistakes: locks, latches, memory buffers, … • Network mistakes: protocol selection, bad segments, … • Disk I/O mistakes: poorly sized caches, imbalances, … • Capacity shortages: swapping, paging, …

Why I use extended SQL trace data instead of Oracle’s V$ fixed views… • I don’t use V$ data anymore, because it’s a mess • Can’t poll fast enough with SQL • Too complicated to attach directly to SGA • Too many data sources (V$SESSTAT, V$SESSION_EVENT, V$LATCH, V$LOCK, V$FILESTAT, V$WAITSTAT, V$SQL, …) • No notion of e; therefore, can’t isolate M, E, N, U, S • No read consistency • Statistics are unreliable (CPU time used by this session, SECONDS_IN_WAIT, …) • Statistics are susceptible to overflow • Hard (impossible?) to determine recursive SQL relationships

Where do Oracle’s timing statistics actually come from? procedure dbcall { e0 = gettimeofday; # mark the wall time c0 = getrusage; # obtain resource usage statistics ... # execute the db call (may call wevent) c1 = getrusage; # obtain resource usage statistics e1 = gettimeofday; # mark the wall time e = e1 – e0; # elapsed duration of dbcall c = (c1.utime + c1.stime) – (c0.utime + c0.stime); # total CPU time consumed by dbcall } procedure wevent { ela0 = gettimeofday; # mark the wall time ... # execute the wait event (syscall) ela1 = gettimeofday; # mark the wall time ela = ela1 – ela0; # elapsed duration of wevent }

Understanding where Oracle timings come from motivates all sorts of interesting research projects… • How to quantify measurement intrusion effect • How to bound quantization error • How to bound un-instrumented call effects • How to estimate the amount of kernel-mode time double-counting • …

Questions & Answers

Oracle Operational Timing DataPaper 36571 Cary Millsap (cary.millsap@hotsos.com) Hotsos Enterprises, Ltd. OracleWorld 2003 / San Francisco, California USA 8:30a–9:30a Wednesday 10 September 2003

Revolutionizing Oracle Performance Analysis: Discoveries and Strategies

Revolutionizing Oracle Performance Analysis: Discoveries and Strategies

Presentation Transcript

Oracle Customer Data Management

Oracle Data Encryption

Data hierarchy in Oracle

Data types Timing in VHDL

Oracle Customer Data Management

Oracle Data Integration Solutions Product Update: Oracle GoldenGate and Oracle Data Integrator

Introducing Oracle Data Integrator and Oracle GoldenGate

Operational Data Store

Deploying Oracle GoldenGate with Oracle Data Guard

Energy Day Timing Issues and Operational Concerns

Oracle Data Integrator Agents

Oracle Transparent Data Encryption

Oracle Integrated Operational Planning Align Operational Plans with Financial Goals

ORACLE Data Mining

Oracle Data Guard

Oracle Data Visualizations

Operational Data Store

Purpose and timing of paper

DSP Board data bus timing