E N D
2. Objectives Using DB2 IFI data to monitor and manage various DB2 system resources. Examples include DBM1 storage and latch contention.
Designing and deploying specialized DB2 “thread killer” utilities to enhance both DB2 application and system availability.
The nuts and bolts of replacing a vendor supplied Call Attach product with an in-house written solution.
Leveraging IFI metadata to quickly develop customized DB2 tools.
Setting up the re-useable framework that will serve as the foundation for all of the tools.
This presentation will describe the numerous DB2 tools that have been written at Progressive Insurance. These tools enable us to manage our large and diverse DB2 workload. In addition, the presentation will detail methods that the members of the audience can employ to transfer these ideas to their own organizations. We will cover the basic framework that we have used to construct all of the tools. The tools that will be discussed will include a Call Attach replacement, a DBM1 storage monitor, a latch contention monitor, various “thread killer” utilities, a method to create DB2 IFI metadata, and a WLM boost technique. This presentation will describe the numerous DB2 tools that have been written at Progressive Insurance. These tools enable us to manage our large and diverse DB2 workload. In addition, the presentation will detail methods that the members of the audience can employ to transfer these ideas to their own organizations. We will cover the basic framework that we have used to construct all of the tools. The tools that will be discussed will include a Call Attach replacement, a DBM1 storage monitor, a latch contention monitor, various “thread killer” utilities, a method to create DB2 IFI metadata, and a WLM boost technique.
3. Overview Our Environment
Challenges and Opportunities
The Development Framework and IFI Metadata
The Tools
Call Attach Replacement
Thread Cancel Utilities
DB2 System Monitors
Tying it all Together
Questions
4. Our Environment
5. Our Environment We Currently run Ten Way Datasharing on Eight LPARs
The Majority of the Work comes from CICS and Distributed Transactions.
The Datasharing Group Supports Both Operational and Warehouse Applications
Our Applications Make Extensive use of Dynamic SQL
“Speed to Market” is Paramount
6. Challenges and Opportunities Company growth and increasing software costs
User submitted work conflicting with warehouse utility jobs
DBM1 virtual storage constraints
DB2 lock/latch issues
Tiered Support Organization We ran into serious DBM1 virtual storage constraint problems at Version 7. That’s the primary reason that we expanded out to ten members. We have not experienced any of these problems since the migration to Version 8.
The same holds true for latch contention. We able to eliminate most of our latch contention problems while still at version 7. Most of this was due to application tuning.
Our organization is moving toward a tiered support structure. This is forcing us to build specialized tools in order to help our first responders diagnose problems more quickly.We ran into serious DBM1 virtual storage constraint problems at Version 7. That’s the primary reason that we expanded out to ten members. We have not experienced any of these problems since the migration to Version 8.
The same holds true for latch contention. We able to eliminate most of our latch contention problems while still at version 7. Most of this was due to application tuning.
Our organization is moving toward a tiered support structure. This is forcing us to build specialized tools in order to help our first responders diagnose problems more quickly.
7. The Development Framework PROGRAM ARCHITECTURE and SERVICES
Parameter Processing
Definition
Parsing
Validation
Extraction
Console Communications
WTO
Modify/Stop
Issuing Console Commands The development framework is basically a collection of reusable code that allows us to build new utilities or tools.
You can define the command syntax, set up the new tool to run as a started task or as a batch job, load all modules that the new tool will require, define messages, etc. The development framework is basically a collection of reusable code that allows us to build new utilities or tools.
You can define the command syntax, set up the new tool to run as a started task or as a batch job, load all modules that the new tool will require, define messages, etc.
8. The Development Framework PROGRAM ARCHITECTURE and SERVICES
Program Loading
Stack Processing
Dynamic File Allocation
Message Processing
Definition
Parameter Substitution
Destination
9. IFI Metadata We use assembler ADATA from all IFCID DSECTs as our starting point.
This data is then post-processed and loaded into DB2 tables.
The tables allow for versioning of the metadata.
The metadata has been enhanced with the creation of a table that defines the relationship of each IFCID and it’s sections.
10. IFI Metadata The development framework has been extended to support access to the IFI metadata and IFI data.
The IFI metadata can be extracted and processed to create new DB2 tools.
Additional programs provide an API that is used to obtain IFI data.
11. Call Attach Replacement The CAF API was written as a replacement for a third party product
//TDBA972A JOB (Z00000,N129),'CAF SAMPLE',
// CLASS=7,MSGCLASS=X
//CAFS EXEC PGM=DB2PROG
//SYSOUT DD SYSOUT=*
//SYSPRINT DD SYSOUT=*
//DBAIN DD *
PLAN(TSTCP000) This was written as a replacement for a third party product. We had to replace it due to increasing software costs. The product had been integrated into thousands of production batch jobs.
This tool is a stub (DSNHLI) replacement and allows batch programs to run without TSO.
Most of the work is performed on the first call that the application makes to DB2. After that, we simply pass the calls to DB2. In addition, we check for bad SQLCODEs on the way back. This check can allow the stub to force abends for severe SQLCODEs – we have chosen not to implement this at the present time. This was written as a replacement for a third party product. We had to replace it due to increasing software costs. The product had been integrated into thousands of production batch jobs.
This tool is a stub (DSNHLI) replacement and allows batch programs to run without TSO.
Most of the work is performed on the first call that the application makes to DB2. After that, we simply pass the calls to DB2. In addition, we check for bad SQLCODEs on the way back. This check can allow the stub to force abends for severe SQLCODEs – we have chosen not to implement this at the present time.
12. Call Attach Replacement CAF SERVICES
Dynamic Initialization
Hierarchical Selection of Plan and Subsystem
Subsystems in Data Sharing Group on Same Lpar Selection
16 choices depending on bits 40 to 43 of TOD - .000255 to .003840 second range
SQL Error Logging (not ad nauseam) and Error Translation
SVC update of TCBUSER for anchor
Wait option for DB2 if not up
Terminate option for thread on DB2 Quiesce
Abend on Severe Error Option
RRS Option
Multi Threads for Multiple TCB usage The CAF replacement was enhanced over the years to support RRS and multitasking.
We run two DB2 members from the same group on the same LPAR. Our CAF can randomly choose which member to connect to. This allows us to load balance batch across the members. IBM supplies a USERMOD which extends this concept to other attach types.
The default subsystem (or group) is contained in a data only CSECT. Thus the load libraries are group specific. This was implemented in order to mimic the product that was being replaced. The group name, or any other default, can be overridden at run time.
The CAF replacement was enhanced over the years to support RRS and multitasking.
We run two DB2 members from the same group on the same LPAR. Our CAF can randomly choose which member to connect to. This allows us to load balance batch across the members. IBM supplies a USERMOD which extends this concept to other attach types.
The default subsystem (or group) is contained in a data only CSECT. Thus the load libraries are group specific. This was implemented in order to mimic the product that was being replaced. The group name, or any other default, can be overridden at run time.
13. Call Attach Replacement A basic architecture for a CAF stub replacement is documented in the Application Programming and SQL Guide.
Our CKCD module is a modified version of the one that’s in the IBM documentation.
Other Module’s functions:
PRCIALOC – Dynamic allocation
PRCIMSG* - Message processing
PRCICAF* - Initialization modules
PRCIPARS – Parm syntax processing
PRCITHDC – Actual thread create
PRCISQLI – Called by the stub, drives the initialization process. After the first call, this module passes the calls to DB2.
A basic architecture for a CAF stub replacement is documented in the Application Programming and SQL Guide.
Our CKCD module is a modified version of the one that’s in the IBM documentation.
Other Module’s functions:
PRCIALOC – Dynamic allocation
PRCIMSG* - Message processing
PRCICAF* - Initialization modules
PRCIPARS – Parm syntax processing
PRCITHDC – Actual thread create
PRCISQLI – Called by the stub, drives the initialization process. After the first call, this module passes the calls to DB2.
14. Call Attach Replacement The CAF replacement will report on negative SQLCODEs. The first five errors will be listed for any SQLCODE, after that, the messages will be suppressed. This feature was a part of the vendor product, so we had to include it our version.
We also added the program name and statement number to all of the displays.The CAF replacement will report on negative SQLCODEs. The first five errors will be listed for any SQLCODE, after that, the messages will be suppressed. This feature was a part of the vendor product, so we had to include it our version.
We also added the program name and statement number to all of the displays.
15. Call Attach Replacement The CAF API has been enhanced to exploit zIIP engines
//TDBA972A JOB (Z00000,N129),'CAF SAMPLE',
// CLASS=7,MSGCLASS=X
//CAFS EXEC PGM=DB2PROG
//SYSOUT DD SYSOUT=*
//SYSPRINT DD SYSOUT=*
//DBAIN DD *
PLAN(TSTCP000) CONNECT(REM_DB2)
This is the latest enhancement that has been added to our CAF replacement. It is an attempt to use DB2 to DB2 DDF to exploit zIIP engines for large user submitted queries. We run our warehouses with in our operational data sharing group and we do not use SYSPLEX query parallelism. We came up with this idea as a workaround.This is the latest enhancement that has been added to our CAF replacement. It is an attempt to use DB2 to DB2 DDF to exploit zIIP engines for large user submitted queries. We run our warehouses with in our operational data sharing group and we do not use SYSPLEX query parallelism. We came up with this idea as a workaround.
16. Thread Cancel Utilities We have several tools to cancel DB2 threads, based on . . .
DB2 Object Usage
DBM1 Thread Storage
DB2 WORKDB Usage
17. DB2 Object Usage Cancel The utility supports thread cancel by tablespace or indexspace
CANCEL THREADS FOR TABLESPACE (DB.TS, ...) -
SETSTATUS(UTIL), -
WAITSTATUS(UTIL), -
INCLUDE_USER(PROD*,TDBA*,..),-
INCLUDE_JOB(P*...),-
EXCLUDE_USER(PROD*,TDBA*,..),-
EXCLUDE_JOB(P*...),-
DISPLAY
Te thread cancel utility was written for one main reason – to prevent the DBAs from receiving calls in the middle of the night !!
We have many large data warehouses and user queries were constantly getting in the way of DB2 utilities. The thread cancel runs in batch ahead of the utilities and cancels all of these queries.
The thread cancel accepts a list of tablespaces and/or indexes and it will cancel any thread that holds a lock on any related object. For example, if a user holds a lock on one table in a segmented tablespace, that user will be cancelled.
We do not support wildcarding in the syntax – we decided that the utility should be more granular. It does handle up to 100 tablespaces and indexes per invocation.
It will cancel users with locks on DSNDB06 and the TEMP or WORK tablespaces. We did not build in support for cancelling threads with locks on DSNDB01 objects.Te thread cancel utility was written for one main reason – to prevent the DBAs from receiving calls in the middle of the night !!
We have many large data warehouses and user queries were constantly getting in the way of DB2 utilities. The thread cancel runs in batch ahead of the utilities and cancels all of these queries.
The thread cancel accepts a list of tablespaces and/or indexes and it will cancel any thread that holds a lock on any related object. For example, if a user holds a lock on one table in a segmented tablespace, that user will be cancelled.
We do not support wildcarding in the syntax – we decided that the utility should be more granular. It does handle up to 100 tablespaces and indexes per invocation.
It will cancel users with locks on DSNDB06 and the TEMP or WORK tablespaces. We did not build in support for cancelling threads with locks on DSNDB01 objects.
18. DB2 Object Usage Cancel Supports CANCEL or REPORT
Allows QUERY of DB2 locks held by a user or users
Will issue DB2 commands to set objects in UT status
Executes DB2 commands to put objects back in RW status
Will not CANCEL production batch jobs
DISPLAY option issues display use, locks and claimers The REPORT option allows you to find threads without cancelling them.
We do not allow production batch jobs to be cancelled. They will be listed in the utilities output.
The DISPLAY option was put in as a debug aid. It will perform three displays for each object – use, locks and claimers.
The DB2 objects can also be started for UT access. The utility will build a file with start commands that can be used to place the objects back in RW status once all DB2 utilities have completed. The REPORT option allows you to find threads without cancelling them.
We do not allow production batch jobs to be cancelled. They will be listed in the utilities output.
The DISPLAY option was put in as a debug aid. It will perform three displays for each object – use, locks and claimers.
The DB2 objects can also be started for UT access. The utility will build a file with start commands that can be used to place the objects back in RW status once all DB2 utilities have completed.
19. Query Locks This option allows us to display locks that are held by specific users. This was implemented so that we could figure out how to construct the IFI requests to find holders of the various kinds of locks.
The lock types are documented in member DSNDQW02 of SDSNMACS.
We look for these lock types:
Tablespace locks
Tablespace mass-delete locks
Table locks
Table mass-delete locks
Index locks
Tablespace partition lock (only one for V8)
This option allows us to display locks that are held by specific users. This was implemented so that we could figure out how to construct the IFI requests to find holders of the various kinds of locks.
The lock types are documented in member DSNDQW02 of SDSNMACS.
We look for these lock types:
Tablespace locks
Tablespace mass-delete locks
Table locks
Table mass-delete locks
Index locks
Tablespace partition lock (only one for V8)
20. DB2 Object Usage Cancel This slide is the output of a REPORT on a QMF save data tablespace. We query the DB2 catalog to find all related objects. We then issue a number of IFI query lock requests to identify any and all users who hold locks on the objects. The IFI requests are global, so we will find users on any member. This slide is the output of a REPORT on a QMF save data tablespace. We query the DB2 catalog to find all related objects. We then issue a number of IFI query lock requests to identify any and all users who hold locks on the objects. The IFI requests are global, so we will find users on any member.
21. DBM1 Thread Storage Cancel This process is designed to stop rouge queries from gobbling up our DBM1 storage.
STGKILL -
SUBSYSTEM(DB5F) -
DBM1_STG%(92) -
CHASE ( -
DELTA(20) -
INTERVAL(30) )
We have had a number of DBM1 virtual storage problems in the past. While working through these issues, we discovered that individual threads can consume large amounts of DBM1 storage.
This utility will cancel or identify user threads that are chewing up large amounts of DBM1 storage.
No threads will be cancelled unless we determine that the subsystem may be nearing trouble.
We have not had as many issues since migrating to version 8. At version 7, we often had DBM1 storage spikes that would come and go within a matter of minutes.
The utility wakes up every minute and checks DBM1 and thread storage. The CHASE parm allows us to survey the situation for a period of time – we don’t cancel based on snapshot information. We have had a number of DBM1 virtual storage problems in the past. While working through these issues, we discovered that individual threads can consume large amounts of DBM1 storage.
This utility will cancel or identify user threads that are chewing up large amounts of DBM1 storage.
No threads will be cancelled unless we determine that the subsystem may be nearing trouble.
We have not had as many issues since migrating to version 8. At version 7, we often had DBM1 storage spikes that would come and go within a matter of minutes.
The utility wakes up every minute and checks DBM1 and thread storage. The CHASE parm allows us to survey the situation for a period of time – we don’t cancel based on snapshot information.
22. DBM1 Thread Storage Cancel Uses IFCID 217 data to report on and cancel threads that are consuming large amounts of DBM1 storage.
The focus is on above the line but below the bar storage.
Has proved to be less useful for us at version 8.
Rouge threads tend to be ones that process a large amount of dynamic SQL without a commit OR runaway queries with large sorts.
We have seen threads consume 200+ meg of DBM1 storage in several minutes. IFCID 217 could be more friendly if it contained the thread’s ACE. As it is we have to issue IFI requests based on the accounting information that is supplied in the IFCID 217 data.
This IFCID also reports on system threads and can be used to check for storage creeps that are related to these system threads.
IFCID 217 could be more friendly if it contained the thread’s ACE. As it is we have to issue IFI requests based on the accounting information that is supplied in the IFCID 217 data.
This IFCID also reports on system threads and can be used to check for storage creeps that are related to these system threads.
23. WORKDB Usage Cancel This process is designed to prevent runaway queries from consuming all of our DB2 sort space.
SRT KILL -
SUBSYSTEM(DB5F) -
GROUP(DSNDB0F) -
GROUP_ATTACH(DB0F) -
SRT _WORK_%_TOT(95) -
CUBW_STORAGE(40) -
CICS_INCLUDE(CIC1) -
* TRACE -
REPORT -
CHASE ( -
DELTA(20) -
INTERVAL(30) )
Our mix of operational and large warehouse workloads often leads to disruptions when user queries use up all of the WORKDB space on a member. We have had numerous batch abends because of this.
This utility attempts to identify and cancel user threads before they use up all WORKDB space.
This tool is designed to cancel user queries and it will not cancel production batch jobs or CICS transactions. The CICS_INCLUDE is used to monitor the CICS regions where we run QMF. Our mix of operational and large warehouse workloads often leads to disruptions when user queries use up all of the WORKDB space on a member. We have had numerous batch abends because of this.
This utility attempts to identify and cancel user threads before they use up all WORKDB space.
This tool is designed to cancel user queries and it will not cancel production batch jobs or CICS transactions. The CICS_INCLUDE is used to monitor the CICS regions where we run QMF.
24. WORKDB Usage Cancel At initialization the utility calculates the approximate amount of WORKDB space that is available.
Each invocation estimates the amount of space that each thread is consuming.
Threads that exceed certain thresholds will be cancelled – unless they are production jobs.
The usual culprits are Cartesian joins or queries with unusual GROUP BY’s.
Currently being reworked to use IFCID 342. When the utility starts up, it calculates the amount of available WORKDB space for the member and it uses heuristics to target specific user queries.
An unusual GROUP BY would be a group by a highly unique key on a large table.
Cartesian joins are the most typical culprit. Since we have large warehouses and a significant user base, we do not really use the DB2 governor to stop runaway queries.
We are currently looking at using the new IFCID 342 for this utility. When the utility starts up, it calculates the amount of available WORKDB space for the member and it uses heuristics to target specific user queries.
An unusual GROUP BY would be a group by a highly unique key on a large table.
Cartesian joins are the most typical culprit. Since we have large warehouses and a significant user base, we do not really use the DB2 governor to stop runaway queries.
We are currently looking at using the new IFCID 342 for this utility.
25. DB2 System Monitors These monitors run under our interval processor started task
DBM1 Storage Monitor
WLM Boost Utility
Replication Utility
26. Interval Processor The SCANNER – a generic interval processor
SCANDEF DB2_SUBSYSTEM((CONNRETRY_INTVL(60)))
*
QUEUE NUM(1),MAXDEPTH(2),TRIGGER(1),TICKER(1),MAXCPU(1),MAXELAPS(70)
*
PROCESS NAME(WLM_BOOST) -
INIT( -
INIT_PGM (PRCSWBSI), SYNTAX(DB9PWBST)), -
PROCEDURE( -
PROGRAM (PRCSWBST), OPTIONS(RESIDENT)), -
INVOCATION( -
QUEUE(1),INTERVAL(60),COMMAND(BOOST)) -
LIMITS(MAXCPU(2),MAXELAPS(60))
27. Interval Processor SCAN FACILITY
Interval Processing with Priority
Process Definition
Initialization
Process
Invocation
Limits
Three Tier MVS Task Structure
Main Task - Console Communications
Queue Task - Schedule Processes to Queue
Process Task - Manage Process Execution
28. Interval Processor SCAN FACILITY
Each Queue has its own sysout for reporting
Abend Handling
Recovery
Dump to a Queue Specific Sysout
Max Consecutive before disablement
RRS Rollback if defined Recoverable
Limits
Max Elapsed/Execution
Max CPU/Execution
29. DBM1 Storage Monitor This utility is a more specialized version of the MEMU rexx exec that is available from IBM.
STORAGE INTERVAL(60), -
STORAGE_WARNING(THRESHOLD(84),NODUMP), -
STORAGE_CRITICAL(THRESHOLD(92),DUMP), -
MAXDBAT(110), -
CTHREAD(120), -
REMAINING_ACTIVE_LOGS(10), -
LATCH_CONTENTION_THRESHOLD(15000), -
ABOVE_THE_BAR_THRESHOLD(2048), -
LATCH_RATES(01000, - . . . )
30. DBM1 Monitor – Features Has a warning and critical threshold for DBM1 storage usage
Checks for CTHREAD and MAXDBAT spikes
Tracks latch contention rates
Produces storage profile, similar to the output of the MEMU rexx
Issues WTOs that are used for email and paging
31. What We’ve Learned Large DB2 UTILITY LISTDEFs can consume a lot of DBM1 storage
Individual queries can use 300M of DBM1 storage
DB2 Latch contention rates need to be closely monitored
A small number of CPU intensive queries can consume large amounts of DBM1 storage
Thread spikes do not necessarily cause storage issues
32. WLM Boost Utility WLM Boost: used to issue –DIS THD(*) SERVICE WAIT commands when thread spikes occur.
BOOST -
SUBSYSTEM(DB6P) -
GROUP(DSNDB0P) -
GROUP_ATTACH(DB0P) -
THREADS(120) -
CHASE ( -
ITERATIONS(3) -
INTERVAL(5) -
)
33. Replication Utility Extract WEBSPHERE REPLICATION EVENT PUBLISHER XML from MQ queue and move to a dataset
EXTRACTQ QUEUE_MGR(MQB1), -
QUEUE(MN.QLOCAL.CMS_MAGNIFY_DATA), -
TRANSLATE(TO_EBCDIC), -
RESTART_QUEUE(MN.QLOCAL.CMS_MAGNIFY_XTRACT_RESTART), -
COMMIT_COUNT(100), -
DSN(TDBA003.TEST.EXTRACT(DDNAME(MAGEXTRD))), -
NEW,UNIT(SYSALLDA), -
DCB(RECFM(VB),LRECL(19000),BLKSIZE(19004),BUFNO(6)), -
SPACE(CYL(10,10)), -
VOLCNT(5), -
DSNLOG(XXXXXXXXXXX(LOGDDNAM(MQEXL))),START=WARM
34. Summary The combination of our development framework with IFI metadata has allowed us to quickly create tools that are tailored to our specific needs.
We would like to see IBM ship IFI metadata as a sample with DB2.
IBM should also continue to extend IFI to supply additional exception status information.
35. Thank You !!
Questions ?
36. Matt Ramsey, Mike Semega & Bob Vargo Progressive Insurance
Matt_Ramsey@Progressive.com, Mike_Semega@Progressive.com & Robert_S_Vargo@progressive.com