DATABASE MYSTERIES: BATCH JOB DIAGNOSTICS

DATABASE MYSTERIES: BATCH JOB DIAGNOSTICS Chris Lawson May 4, 2007 "It is the brain, the little gray cells on which one must rely.” -- Hercule Poirot

Our Agenda for Today • A fun trivia quiz • Overview: different types of diagnostics • Preparing for Oracle Statspack • Running Statspack reports • Limitations of Statspack • Some helpful Sql scripts “Use the little gray cells Hastings” • Case study: Diagnosing “eco_out” batch job. • Awarding of prize for correct trivia quiz answer And now, the Trivia Quiz . . .

TRIVIA QUIZOracle’s Statspack does not show … • Snapshot start/end times • High disk i/o Sql • High # of execution Sql • High # of parse Sql • Init.ora parameters • “ITL” locking events • Top 5 wait events • Database privilege problems • Database memory statistics “I say, Poirot!” Vote now--prize at end of presentation!

Here are 2 Kinds of Diagnostics Diagnostics that identify problems: • Simple run-time summaries, provided by application, show runtime patterns. • Graphs show server load or disk i/o over time. Diagnostics that help resolve problems: • Oracle-specific utilities, such as Statspack, that identify problem areas. • Custom scripts that identify long-running transactions & poorly tuned Sql.

Run-Time Summaries • Most batch job/data warehouse jobs provide runtime stats. • Some even show the runtimes for each step. • BillPay lists batch job stats in table Wfjob_Ctl: SQL> desc Wfjob_Ctl Name Null? Type ------------------------- -------- --------------- BATCH_DATE NOT NULL VARCHAR2(8) JOB_NAME NOT NULL VARCHAR2(40) JOB_RUN_NUMBER NOT NULL NUMBER JOB_STATUS VARCHAR2(2) JOB_RETURN_CODE NUMBER JOB_START_TIME DATE JOB_END_TIME DATE HOSTNAME VARCHAR2(20)

Sample Diagnostic: Find All Slow Batch Jobs Show jobs in past ½ day running > 20 mins: Select Job_name, To_char(job_start_time,'mon-dd Hh24:mi') STTME, To_char(job_end_time,'mon-dd Hh24:mi') ENDTME, Round((job_end_time-job_start_time)*24*60) MINS From Wfjob_ctl Where Job_start_time > Sysdate - .5 And (Job_end_time-job_start_time)*24*60 > 20 Order By 2

Find Slow Batch Jobs List jobs in past ½ day running longer than 20 mins: JOB_NAME STTME ENDTME MINS ------------------------------- ------------ ------------ ------- BP.westbp.gen_dailyrun.C dec-18 04:34 dec-18 06:08 94 BP.westbp.sam_gen_tkconf.C dec-18 06:00 dec-18 06:36 36 BP.westbp.prc_instalerts.C dec-18 06:01 dec-18 06:33 32 BP.westbp.cfin_prc_svcrs2.C dec-18 08:11 dec-18 08:31 20

Sample Report: Find ecoout Runtimes Runtimes for ecoout job for last 3 days: select job_name,to_char(job_start_time, 'DAY') wekday, TO_CHAR(job_start_time,'MON-DD hh24:mi') STTME, TO_CHAR(job_end_time,'MON-DD hh24:mi') ENDTME, ROUND((job_end_time-job_start_time)*24*60) MINS from wfjob_ctl WHERE JOB_START_TIME > SYSDATE - 3 and job_name like '%ecoout_gen_pmt%' order by job_start_time JOB_NAME WEKDAY STTME ENDTME MINS ------------------------ --------- ------------ ------------ ---- BP.westbp.ecoout_gen_pmt SATURDAY DEC-16 08:01 DEC-16 08:48 48

Find alert Runtimes Runtimes for all alerts job for last 2 days: select job_name, TO_CHAR(job_start_time,'MON-DD hh24:mi') STTME, TO_CHAR(job_end_time,'MON-DD hh24:mi') ENDTME, ROUND((job_end_time-job_start_time)*24*60) MINS from wfjob_ctl WHERE JOB_START_TIME > SYSDATE - 2 and job_name like '%alrt%' order by job_start_time JOB_NAME STTME ENDTME MINS ---------------------------- ------------ ------------ ----- BP.westbp.alrt_prc_daily.C DEC-16 10:43 DEC-16 11:39 56 BP.westbp.alrt_cln_pmt.C DEC-16 11:39 DEC-16 11:44 5 BP.westbp.alrt_cln_daily.C DEC-16 11:39 DEC-16 13:08 89 BP.westbp.alrt2_gen_pmtmd.C DEC-18 03:31 DEC-18 03:46 16 BP.westbp.alrt2_gen_ebnr.C DEC-18 03:31 DEC-18 03:38 7

Another Type of Diagnostic:Server Load Charts • This is a good way to spot trends and can help in capacity planning. • We’ve used these graphs to launch investigation after odd spikes. • Server load graphs help isolate the time period having trouble. • This method is typically a visual inspection.

Problem Identification: Server Load Graphs

Spikes Help Identify Problem

Bill-Pay Late January Problem Root cause: CSA users began running up to 8 bad SQL simultaneously

Introduction to Oracle Statspack • Statspack is an Oracle utility that lists performance statistics for a given database. This is called a “Statspack Report.” • You define the time period of interest by specifying the start-time and end-time. • The stats come from taking regular “snapshots” of the database--typically once per hour (changeable.)

Oracle Statspack: Snaps • The snaps taken are listed in Oracle table Stats$Snapshot • Here’s how to find the snaps taken in last day: Select Snap_id, To_char(snap_time,'dd-mon-yy-hh24:mi') Snaptime From Stats$snapshot Where Snap_time > Sysdate 1 Order By 1 SNAP_ID SNAPTIME -------- --------------- 4237 21-dec-06-02:49 4238 21-dec-06-03:49 4239 21-dec-06-04:49 * * *

Starting a Statspack Report SQL> @spreport Specify the Begin and End Snapshot Ids ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Enter value for begin_snap: 1001 Enter value for end_snap: 1002 Specify the Report Name ~~~~~~~~~~~~~~~~~~~~~~~ The default report file name is sp_1001_1002. To use this name, press <return> to continue, otherwise enter an alternative. Enter value for report_name: TESTREPORT

Statspack Key Sections • Top 5 resource-consuming events • All events causing waits • Top SQL ordered by logical i/o • Top SQL ordered by physical i/o • Top SQL ordered by # of executions • Aggregate db statistics • Most-read objects • Most-locked objects

Statspack: Top 5 Events Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time ---------------------------- ---------- ----------- -------- db file sequential read 6,672,182 25,852 62.21 CPU time 11,887 28.61 db file scattered read 2,611,855 2,831 6.81 log file sync 63,184 208 .50 ARCH wait on SENDREQ 261 199 .48 Single-block reads Multi-block reads Write of transaction after Commit

Statspack: List of “Top Sql” • If you know the time period, Statspack will usually zero-in on the problem. • Shows “resource hogs” that consume lots of disk, logical reads, physical reads, or huge number of executions. • Nearly all bad Sql will be detected—but a few can slip-through undetected! • E.g., what if user is blocked for a long time? That consumes neither disk or much logical i/o, so it’s missed. • More exceptions later.

Statspack: Top Sql Example Top resource consumer—logical reads CPU Elapsed Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 195,093,132 109,303 1,784.9 17.5 8750.55 97170.32 3351508081 select cur1.pae_id, cur1.mpa_id, cur1.mpa_nickname, nvl(cur1.mpasta_id, -1) mpa_status, cur3.pmt_start_date last_pay_date, cur3.pmt_amt pmt_amt_complete, cur2.pmt_start_date schedule_date, * * * CPU per Elap per Executions Rows Processed Rows per Exec Exec (s) Exec (s) Hash Value ------------ --------------- ---------------- ----------- ---------- ---------- 589,109 589,066 1.0 0.00 0.00 3012166360 Module: p_direct_check_inbound_pkg INSERT INTO CBPAYMENT_INSTRUCTION_AUDIT (PMTINSTR_ID, PMT_ID, IN STR_ID, MEM_ID, PAE_ID, MPA_ID, INSTRSTA_ID, PMTINSTRBAT_ID, WF_ * * * Top resource consumer—# of executions

Another type of Report: sprepsql • This report shows you actual execution plan for a particular Sql run previously. • You first identify the Sql by its “hash value,” shown in the first statspack report: Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 22,988,572 1,530 15,025.2 4.1 1939.33 1957.49 178837896 UPDATE CBNOTICE_MEMBER SET NTCSTA_ID = 3, NTCMEM_MODIFIER_ID = :1, NTCMEM_MODIFIED_DATE = :2 WHERE NTC_ID = :3 AND MEM_ID = :4 Note this value

sprepsql: Sample Outputfor an Update Statement • Statement Total Per Execute Total • --------------- --------------- ------ • Buffer Gets: 20,474,697 20,474,697.0 3.17 • Disk Reads: 19,895,404 19,895,404.0 22.28 • Rows processed: 4,401 4,401.0 • CPU Time(s/ms): 3,808 3,807,950.0 • Elapsed Time(s/ms): 6,369 6,369,227.7 • UPDATE MEMBER_ACTIVATION MA SET PMT_ACTIVATION_INDICATOR = :B2 , • PMT_ACTIVATION_DATE = SYSDATE , MODIFIED_BY = :B1 , MODIFIED_DATE = SYSDATE • WHERE MA.PMT_ACTIVATION_INDICATOR IS NULL AND EXISTS (SELECT NULL FROM • * * * -------------------------------------------------------------------------------- | Operation | PHV/Object Name | Rows | Bytes| Cost | -------------------------------------------------------------------------------- |UPDATE STATEMENT |----- 1119347130 ----| | | 529076 | |UPDATE | | | | | | HASH JOIN SEMI | | 925K| 24M| 529076 | | TABLE ACCESS FULL |MEMBER_ACTIVATION | 954K| 18M| 732 | | TABLE ACCESS FULL |CBPAYMENT | 92M| 705M| 499796 | --------------------------------------------------------------------------------

Statspack Good Practices:Tips & Traps • Old Snapshots are occasionally purged, so you can’t go back forever! • Certain problems can fly “under the radar.” • Example: Statspack lists resource-intensive Sql. But what if the problem is due to the cumulative effect of millions of similar (not identical) Sql? Statspack will miss this because each Sql is below the reporting threshold.

Statspack Good Practices:Tips & Traps Ha! I knew that! Here’s one that most of us missed: Oracle’s statistics are reset if table stats are gathered, so Statspack will rate that Sql as perfect! SECS EXEC BUFFER_GETS DISK_READS SQL_TEXT ----- ---- ----------- ---------- ---------------------------------- 2 2 10694 10683 select count(*) from chris_waivers After Analyze of “Chris Waivers” SECS EXEC BUFFER_GETS DISK_READS SQL_TEXT ----- ---- ----------- ---------- ---------------------------------- 0 0 0 0 select count(*) from chris_waivers

Statspack Good Practices:Summary • Statspack is an excellent diagnostic aid in identifying possible bottlenecks—not solving them. • Statspack usually uncovers the resource drivers for a specified period of time. • Remember, however, it will not find everything--a few things can slip by. • Like any tool, it can’t fix the design for you.

My Favorite Diagnostic Scripts I use custom scripts far more often than Statspack. Here are some favorites: • Show all Sql running on database. • Show the object a user is waiting on. • List cumulative wait events for a user. • List resource-intensive Sql run recently. • Show average speed of disk i/o.

Script 1: Show all Active Sql • This gives you an idea of some of the critical Sql for an application. • If you see the same Sql over and over, it either is inefficient, or run extremely often. • On most OLTP systems, only a handful of sessions are usually active at one point—because everyone else has already got their answer.

Example: Show all Active Sql Select DISTINCT Sid, username, substr(sql_text,1,200) stext From V$Session, V$Sql Where status = 'ACTIVE' And username is not null And v$session.sql_hash_value = hash_value and v$session.sql_address = v$sql.address and sql_text not like '%sql_text%' This eliminates Oracle background processes This eliminates this query itself SID USERNAME STEXT ----- ------------- ------------------------------------------------------- 128 BMAPP SELECT PM_ACCT_NO FROM CBPAYMENT_METHOD WHERE MEM_ID = :1 FOR UPDATE NOWAIT

Script 2: Show Object User is Waiting on • This is especially helpful in isolating bottlenecks for a particular Sql. • This shows you where your Sql is stuck—you may be surprised! • On FFIEC, this revealed that most queries were always reading one particular table (or its indexes.)

Actual Example: Show Object User is Waiting on SeleCT DISTINCT username,object_name, sql_text From V$Session, V$Sql, dba_objects o Where v$session.status = 'ACTIVE' And username is not null and o.object_id = row_wait_obj# And v$session.sql_hash_value = hash_value and v$session.sql_address = v$sql.address and username <> 'SYS' Scanning the primary key index USERNAME OBJECT_NAME SQL_TEXT ------------- ------------------ ---------------------------------------- ADMAPP PMT_PK WITH bills AS (SELECT /*+ Materialize */ bls.bls_external_billid bls_external_billid, blr.blr_bspbiller_id blr_bspbiller_id, bls.mem_id mem_id, bls.mpa_id mpa_id, bls.bls_id bls_id, TO_CHAR (bls.bls_first_seen, 'yyyyMMddHHmiss') bls_first_seen, bls.bls_notify_sent_to_bsp bls_notify_sent_to_bsp FROM cbbill_s

Script 3: Session Cumulative Wait Events • This summarizes the bottlenecks for the totality of a particular session. • Especially useful for a long-running batch job. • This shows how much time is due to each type of bottleneck. • This helps you avoid solving problems that aren’t really the bottleneck.

Actual Example:Session Cumulative Wait Events Select Sid, Event, Total_waits,time_waited/100 Timewaited, Average_wait/100 Avgwait, Round(100*total_waits/Time_waited) RATE From V$session_event Where Sid = 339 And Time_waited > 10000 And Event Not Like '%Net%‘ Order By Timewaited wait times are in centiseconds SID EVENT TOTAL_WAITS TIMEWAITED AVGWAIT RATE ---- ------------------------- ----------- ---------- ---------- ---------- 339 db file scattered read 2912329 534.3 0 5451 339 log file sync 138296 795.71 .01 174 339 enqueue 2594 6937.74 2.67 0 Blocked by another user

Script 4: Resource-Intensive Sql • Pretty similar to the Statspack Sql report. This often highlights serious problems. • Quickly shows you “resource hogs” for a database—big CPU or disk users. • You set a threshold—such as all SQL consuming over 1 million disk reads, or SQL that has run over 10 hours. • This script exemplifies why we don’t need to “guess” at the cause of performance problems. Although finding root cause is usually easy, applying the fix may not be.

Actual Example: Resource-Intensive Sql Select Round(elapsed_time/1000000/1) Secs, Rows_processed Rowct, Executions, Buffer_gets, Disk_reads, Sql_text From V$sql Where Elapsed_time > 9910000000 And Executions > 0 And Abs(buffer_gets)/(Rows_processed+.01) > 100 Order By Elapsed_time SECS ROWCT EXECUTIONS BUFFER_GETS DISK_READS ---- ----- ---------- ----------- ---------- 82136 0 20814258 62545409 3814708 select vcrypttrac0_."USER_NODE_LOG_ID" as USER1_, vcrypttrac0_."REQUEST_ID" as REQUEST2_25_, vcrypttrac0_."CLIENT_DEVICE_ID" as CLIENT3_25_, * * * Surprise! Sql doesn’t do anything! (actual production case)

Script 5: Show Disk i/o Rate • On many systems, disk i/o is a significant performance driver. • You want to eliminate the disk i/o first, but it’s still helpful to know roughly how fast you can perform a typical read. • Numbers typically range from 100-300 single-block reads per second • Note: Oracle uses the non-intuitive term, “sequential read” for a single read. • Multi-block reads are called “scattered reads.”

Example: Show Disk I/O Ratefor 2 Different Systems Select EVENT, TOTAL_WAITS, TIME_WAITED, Round(100*total_waits/Time_waited) Rate From V$system_event Where Time_waited > 1000 And Event Like '%Db File Sequential Read%' BILLPAY EVENT TOTAL_WAITS TIME_WAITED RATE ------------------------- ----------- ----------- ---------- db file sequential read 7485200484 2059793947 363 Why is disk so slow? FFIEC EVENT TOTAL_WAITS TIME_WAITED RATE ------------------------ ----------- ----------- ---------- db file sequential read 663982417 2481360621 27

Case Study: Diagnosing “eco” Batch • Called “ecoout_gen_pmt,” it is critical Billpay batch, starting about 03:30 daily. • Tight SLA—Especially for “big payment” days. Job typically runs several hours. Business is concerned that SLA is at risk, especially for Tuesday runs.

Eco-PaymentBatch Runtimes Querying wfjob_ctlshows historical runtimes and confirms issue: WEKDAY STTME ENDTME MINS --------- ------------ ------------ ------ TUESDAY JUL-11 03:47 JUL-11 06:33 165 TUESDAY JUL-18 03:48 JUL-18 09:48 360 TUESDAY JUL-25 03:41 JUL-25 06:04 143 TUESDAY AUG-01 03:51 AUG-01 07:55 244 SLA failures

Ecoout: Identify Root Cause The “resource-intensive” script finds this Sql running during the time period: UPDATE CBPAYMENT PMT_UPDATE SET (COLS) = (SELECT “stuff” FROM CBMEMBER M, CBMEMBER_PAYEE_ACCOUNT MPA , CBELEMENT E, ETC. Script shows that Sql ran for 3 hours, consuming 25 million logical reads to process 250k rows This suggests problem with the way Oracle executed this transaction.

Eco-out Root Cause • Running “Explain plan” for Sql shows Oracle combines 2 indexes: • “Bitmap conversion from ROWID.” • Execution plan shows these indexes Used: • Pmtinstr_instrsta_id_idx • Pmtinstr_instr_idx ?? Non-Selective! • So Oracle combines two indexes, one of which is a terrible choice. • The optimizer should simply do the obvious: just use the one “good” index. • BTW: This has happened more than one.

Eco-out Root Cause • This optimizer tactic is a notorious weakness in Oracle 9i--some shops turn-off this feature. • In some Billpay batch jobs, we issue an “Alter Session” command to disallow this feature. • Our fix: Sql hint to specify the desired index: /*+index(I PMTINSTR_INSTRSTA_ID_IDX) */ So what happened after Sql hint applied?

Eco-PaymentNew Runtimes

CHALLENGE QUESTIONOracle’s Statspack does not show … • Snapshot start/end times • High disk i/o Sql • High # of execution Sql • High # of parse Sql • Init.ora parameters • Database privilege problems • Database memory statistics • “ITL” locking events • Top wait events

PRIVILEGE PROBLEMS QUIZ ANSWER “This is NOT part of Statspack.”

“Case Closed” :Good Diagnostic Aids • Consider logs that show job runtime • Statspack: • spreport shows resource usage • sprepsql shows execution plan • In Oracle 10g, use AWR • Custom scripts that show resource usage such as high CPU or disk i/o • Don’t forget server CPU load graphs.

Case Closed:Tips & Traps • There isn’t just “one way” to always identify performance bottlenecks. • Statspack cannot detect everything; some things fly “under the radar.” • No one has a script to cover every possible problem. • So, develop a “toolkit” of approaches—custom Sql scripts, graphs, tools, etc.

Questions? “The world is full of obvious things which nobody by any chance ever observes.” Sherlock Holmes, The Hound of the Baskervilles.

DATABASE MYSTERIES: BATCH JOB DIAGNOSTICS