1 / 35

May 21, 2008 09:45 a.m. 10:45 a.m. Platform: DB2 for z

obedience
Download Presentation

May 21, 2008 09:45 a.m. 10:45 a.m. Platform: DB2 for z

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    2. Objectives Using DB2 IFI data to monitor and manage various DB2 system resources. Examples include DBM1 storage and latch contention. Designing and deploying specialized DB2 “thread killer” utilities to enhance both DB2 application and system availability. The nuts and bolts of replacing a vendor supplied Call Attach product with an in-house written solution. Leveraging IFI metadata to quickly develop customized DB2 tools. Setting up the re-useable framework that will serve as the foundation for all of the tools. This presentation will describe the numerous DB2 tools that have been written at Progressive Insurance. These tools enable us to manage our large and diverse DB2 workload. In addition, the presentation will detail methods that the members of the audience can employ to transfer these ideas to their own organizations. We will cover the basic framework that we have used to construct all of the tools. The tools that will be discussed will include a Call Attach replacement, a DBM1 storage monitor, a latch contention monitor, various “thread killer” utilities, a method to create DB2 IFI metadata, and a WLM boost technique. This presentation will describe the numerous DB2 tools that have been written at Progressive Insurance. These tools enable us to manage our large and diverse DB2 workload. In addition, the presentation will detail methods that the members of the audience can employ to transfer these ideas to their own organizations. We will cover the basic framework that we have used to construct all of the tools. The tools that will be discussed will include a Call Attach replacement, a DBM1 storage monitor, a latch contention monitor, various “thread killer” utilities, a method to create DB2 IFI metadata, and a WLM boost technique.

    3. Overview Our Environment Challenges and Opportunities The Development Framework and IFI Metadata The Tools Call Attach Replacement Thread Cancel Utilities DB2 System Monitors Tying it all Together Questions

    4. Our Environment

    5. Our Environment We Currently run Ten Way Datasharing on Eight LPARs The Majority of the Work comes from CICS and Distributed Transactions. The Datasharing Group Supports Both Operational and Warehouse Applications Our Applications Make Extensive use of Dynamic SQL “Speed to Market” is Paramount

    6. Challenges and Opportunities Company growth and increasing software costs User submitted work conflicting with warehouse utility jobs DBM1 virtual storage constraints DB2 lock/latch issues Tiered Support Organization We ran into serious DBM1 virtual storage constraint problems at Version 7. That’s the primary reason that we expanded out to ten members. We have not experienced any of these problems since the migration to Version 8. The same holds true for latch contention. We able to eliminate most of our latch contention problems while still at version 7. Most of this was due to application tuning. Our organization is moving toward a tiered support structure. This is forcing us to build specialized tools in order to help our first responders diagnose problems more quickly.We ran into serious DBM1 virtual storage constraint problems at Version 7. That’s the primary reason that we expanded out to ten members. We have not experienced any of these problems since the migration to Version 8. The same holds true for latch contention. We able to eliminate most of our latch contention problems while still at version 7. Most of this was due to application tuning. Our organization is moving toward a tiered support structure. This is forcing us to build specialized tools in order to help our first responders diagnose problems more quickly.

    7. The Development Framework PROGRAM ARCHITECTURE and SERVICES Parameter Processing Definition Parsing Validation Extraction Console Communications WTO Modify/Stop Issuing Console Commands The development framework is basically a collection of reusable code that allows us to build new utilities or tools. You can define the command syntax, set up the new tool to run as a started task or as a batch job, load all modules that the new tool will require, define messages, etc. The development framework is basically a collection of reusable code that allows us to build new utilities or tools. You can define the command syntax, set up the new tool to run as a started task or as a batch job, load all modules that the new tool will require, define messages, etc.

    8. The Development Framework PROGRAM ARCHITECTURE and SERVICES Program Loading Stack Processing Dynamic File Allocation Message Processing Definition Parameter Substitution Destination

    9. IFI Metadata We use assembler ADATA from all IFCID DSECTs as our starting point. This data is then post-processed and loaded into DB2 tables. The tables allow for versioning of the metadata. The metadata has been enhanced with the creation of a table that defines the relationship of each IFCID and it’s sections.

    10. IFI Metadata The development framework has been extended to support access to the IFI metadata and IFI data. The IFI metadata can be extracted and processed to create new DB2 tools. Additional programs provide an API that is used to obtain IFI data.

    11. Call Attach Replacement The CAF API was written as a replacement for a third party product //TDBA972A JOB (Z00000,N129),'CAF SAMPLE', // CLASS=7,MSGCLASS=X //CAFS EXEC PGM=DB2PROG //SYSOUT DD SYSOUT=* //SYSPRINT DD SYSOUT=* //DBAIN DD * PLAN(TSTCP000) This was written as a replacement for a third party product. We had to replace it due to increasing software costs. The product had been integrated into thousands of production batch jobs. This tool is a stub (DSNHLI) replacement and allows batch programs to run without TSO. Most of the work is performed on the first call that the application makes to DB2. After that, we simply pass the calls to DB2. In addition, we check for bad SQLCODEs on the way back. This check can allow the stub to force abends for severe SQLCODEs – we have chosen not to implement this at the present time. This was written as a replacement for a third party product. We had to replace it due to increasing software costs. The product had been integrated into thousands of production batch jobs. This tool is a stub (DSNHLI) replacement and allows batch programs to run without TSO. Most of the work is performed on the first call that the application makes to DB2. After that, we simply pass the calls to DB2. In addition, we check for bad SQLCODEs on the way back. This check can allow the stub to force abends for severe SQLCODEs – we have chosen not to implement this at the present time.

    12. Call Attach Replacement CAF SERVICES Dynamic Initialization Hierarchical Selection of Plan and Subsystem Subsystems in Data Sharing Group on Same Lpar Selection 16 choices depending on bits 40 to 43 of TOD - .000255 to .003840 second range SQL Error Logging (not ad nauseam) and Error Translation SVC update of TCBUSER for anchor Wait option for DB2 if not up Terminate option for thread on DB2 Quiesce Abend on Severe Error Option RRS Option Multi Threads for Multiple TCB usage The CAF replacement was enhanced over the years to support RRS and multitasking. We run two DB2 members from the same group on the same LPAR. Our CAF can randomly choose which member to connect to. This allows us to load balance batch across the members. IBM supplies a USERMOD which extends this concept to other attach types. The default subsystem (or group) is contained in a data only CSECT. Thus the load libraries are group specific. This was implemented in order to mimic the product that was being replaced. The group name, or any other default, can be overridden at run time. The CAF replacement was enhanced over the years to support RRS and multitasking. We run two DB2 members from the same group on the same LPAR. Our CAF can randomly choose which member to connect to. This allows us to load balance batch across the members. IBM supplies a USERMOD which extends this concept to other attach types. The default subsystem (or group) is contained in a data only CSECT. Thus the load libraries are group specific. This was implemented in order to mimic the product that was being replaced. The group name, or any other default, can be overridden at run time.

    13. Call Attach Replacement A basic architecture for a CAF stub replacement is documented in the Application Programming and SQL Guide. Our CKCD module is a modified version of the one that’s in the IBM documentation. Other Module’s functions: PRCIALOC – Dynamic allocation PRCIMSG* - Message processing PRCICAF* - Initialization modules PRCIPARS – Parm syntax processing PRCITHDC – Actual thread create PRCISQLI – Called by the stub, drives the initialization process. After the first call, this module passes the calls to DB2. A basic architecture for a CAF stub replacement is documented in the Application Programming and SQL Guide. Our CKCD module is a modified version of the one that’s in the IBM documentation. Other Module’s functions: PRCIALOC – Dynamic allocation PRCIMSG* - Message processing PRCICAF* - Initialization modules PRCIPARS – Parm syntax processing PRCITHDC – Actual thread create PRCISQLI – Called by the stub, drives the initialization process. After the first call, this module passes the calls to DB2.

    14. Call Attach Replacement The CAF replacement will report on negative SQLCODEs. The first five errors will be listed for any SQLCODE, after that, the messages will be suppressed. This feature was a part of the vendor product, so we had to include it our version. We also added the program name and statement number to all of the displays.The CAF replacement will report on negative SQLCODEs. The first five errors will be listed for any SQLCODE, after that, the messages will be suppressed. This feature was a part of the vendor product, so we had to include it our version. We also added the program name and statement number to all of the displays.

    15. Call Attach Replacement The CAF API has been enhanced to exploit zIIP engines //TDBA972A JOB (Z00000,N129),'CAF SAMPLE', // CLASS=7,MSGCLASS=X //CAFS EXEC PGM=DB2PROG //SYSOUT DD SYSOUT=* //SYSPRINT DD SYSOUT=* //DBAIN DD * PLAN(TSTCP000) CONNECT(REM_DB2) This is the latest enhancement that has been added to our CAF replacement. It is an attempt to use DB2 to DB2 DDF to exploit zIIP engines for large user submitted queries. We run our warehouses with in our operational data sharing group and we do not use SYSPLEX query parallelism. We came up with this idea as a workaround.This is the latest enhancement that has been added to our CAF replacement. It is an attempt to use DB2 to DB2 DDF to exploit zIIP engines for large user submitted queries. We run our warehouses with in our operational data sharing group and we do not use SYSPLEX query parallelism. We came up with this idea as a workaround.

    16. Thread Cancel Utilities We have several tools to cancel DB2 threads, based on . . . DB2 Object Usage DBM1 Thread Storage DB2 WORKDB Usage

    17. DB2 Object Usage Cancel The utility supports thread cancel by tablespace or indexspace CANCEL THREADS FOR TABLESPACE (DB.TS, ...) - SETSTATUS(UTIL), - WAITSTATUS(UTIL), - INCLUDE_USER(PROD*,TDBA*,..),- INCLUDE_JOB(P*...),- EXCLUDE_USER(PROD*,TDBA*,..),- EXCLUDE_JOB(P*...),- DISPLAY Te thread cancel utility was written for one main reason – to prevent the DBAs from receiving calls in the middle of the night !! We have many large data warehouses and user queries were constantly getting in the way of DB2 utilities. The thread cancel runs in batch ahead of the utilities and cancels all of these queries. The thread cancel accepts a list of tablespaces and/or indexes and it will cancel any thread that holds a lock on any related object. For example, if a user holds a lock on one table in a segmented tablespace, that user will be cancelled. We do not support wildcarding in the syntax – we decided that the utility should be more granular. It does handle up to 100 tablespaces and indexes per invocation. It will cancel users with locks on DSNDB06 and the TEMP or WORK tablespaces. We did not build in support for cancelling threads with locks on DSNDB01 objects.Te thread cancel utility was written for one main reason – to prevent the DBAs from receiving calls in the middle of the night !! We have many large data warehouses and user queries were constantly getting in the way of DB2 utilities. The thread cancel runs in batch ahead of the utilities and cancels all of these queries. The thread cancel accepts a list of tablespaces and/or indexes and it will cancel any thread that holds a lock on any related object. For example, if a user holds a lock on one table in a segmented tablespace, that user will be cancelled. We do not support wildcarding in the syntax – we decided that the utility should be more granular. It does handle up to 100 tablespaces and indexes per invocation. It will cancel users with locks on DSNDB06 and the TEMP or WORK tablespaces. We did not build in support for cancelling threads with locks on DSNDB01 objects.

    18. DB2 Object Usage Cancel Supports CANCEL or REPORT Allows QUERY of DB2 locks held by a user or users Will issue DB2 commands to set objects in UT status Executes DB2 commands to put objects back in RW status Will not CANCEL production batch jobs DISPLAY option issues display use, locks and claimers The REPORT option allows you to find threads without cancelling them. We do not allow production batch jobs to be cancelled. They will be listed in the utilities output. The DISPLAY option was put in as a debug aid. It will perform three displays for each object – use, locks and claimers. The DB2 objects can also be started for UT access. The utility will build a file with start commands that can be used to place the objects back in RW status once all DB2 utilities have completed. The REPORT option allows you to find threads without cancelling them. We do not allow production batch jobs to be cancelled. They will be listed in the utilities output. The DISPLAY option was put in as a debug aid. It will perform three displays for each object – use, locks and claimers. The DB2 objects can also be started for UT access. The utility will build a file with start commands that can be used to place the objects back in RW status once all DB2 utilities have completed.

    19. Query Locks This option allows us to display locks that are held by specific users. This was implemented so that we could figure out how to construct the IFI requests to find holders of the various kinds of locks. The lock types are documented in member DSNDQW02 of SDSNMACS. We look for these lock types: Tablespace locks Tablespace mass-delete locks Table locks Table mass-delete locks Index locks Tablespace partition lock (only one for V8) This option allows us to display locks that are held by specific users. This was implemented so that we could figure out how to construct the IFI requests to find holders of the various kinds of locks. The lock types are documented in member DSNDQW02 of SDSNMACS. We look for these lock types: Tablespace locks Tablespace mass-delete locks Table locks Table mass-delete locks Index locks Tablespace partition lock (only one for V8)

    20. DB2 Object Usage Cancel This slide is the output of a REPORT on a QMF save data tablespace. We query the DB2 catalog to find all related objects. We then issue a number of IFI query lock requests to identify any and all users who hold locks on the objects. The IFI requests are global, so we will find users on any member. This slide is the output of a REPORT on a QMF save data tablespace. We query the DB2 catalog to find all related objects. We then issue a number of IFI query lock requests to identify any and all users who hold locks on the objects. The IFI requests are global, so we will find users on any member.

    21. DBM1 Thread Storage Cancel This process is designed to stop rouge queries from gobbling up our DBM1 storage. STGKILL - SUBSYSTEM(DB5F) - DBM1_STG%(92) - CHASE ( - DELTA(20) - INTERVAL(30) ) We have had a number of DBM1 virtual storage problems in the past. While working through these issues, we discovered that individual threads can consume large amounts of DBM1 storage. This utility will cancel or identify user threads that are chewing up large amounts of DBM1 storage. No threads will be cancelled unless we determine that the subsystem may be nearing trouble. We have not had as many issues since migrating to version 8. At version 7, we often had DBM1 storage spikes that would come and go within a matter of minutes. The utility wakes up every minute and checks DBM1 and thread storage. The CHASE parm allows us to survey the situation for a period of time – we don’t cancel based on snapshot information. We have had a number of DBM1 virtual storage problems in the past. While working through these issues, we discovered that individual threads can consume large amounts of DBM1 storage. This utility will cancel or identify user threads that are chewing up large amounts of DBM1 storage. No threads will be cancelled unless we determine that the subsystem may be nearing trouble. We have not had as many issues since migrating to version 8. At version 7, we often had DBM1 storage spikes that would come and go within a matter of minutes. The utility wakes up every minute and checks DBM1 and thread storage. The CHASE parm allows us to survey the situation for a period of time – we don’t cancel based on snapshot information.

    22. DBM1 Thread Storage Cancel Uses IFCID 217 data to report on and cancel threads that are consuming large amounts of DBM1 storage. The focus is on above the line but below the bar storage. Has proved to be less useful for us at version 8. Rouge threads tend to be ones that process a large amount of dynamic SQL without a commit OR runaway queries with large sorts. We have seen threads consume 200+ meg of DBM1 storage in several minutes. IFCID 217 could be more friendly if it contained the thread’s ACE. As it is we have to issue IFI requests based on the accounting information that is supplied in the IFCID 217 data. This IFCID also reports on system threads and can be used to check for storage creeps that are related to these system threads. IFCID 217 could be more friendly if it contained the thread’s ACE. As it is we have to issue IFI requests based on the accounting information that is supplied in the IFCID 217 data. This IFCID also reports on system threads and can be used to check for storage creeps that are related to these system threads.

    23. WORKDB Usage Cancel This process is designed to prevent runaway queries from consuming all of our DB2 sort space. SRT KILL - SUBSYSTEM(DB5F) - GROUP(DSNDB0F) - GROUP_ATTACH(DB0F) - SRT _WORK_%_TOT(95) - CUBW_STORAGE(40) - CICS_INCLUDE(CIC1) - * TRACE - REPORT - CHASE ( - DELTA(20) - INTERVAL(30) ) Our mix of operational and large warehouse workloads often leads to disruptions when user queries use up all of the WORKDB space on a member. We have had numerous batch abends because of this. This utility attempts to identify and cancel user threads before they use up all WORKDB space. This tool is designed to cancel user queries and it will not cancel production batch jobs or CICS transactions. The CICS_INCLUDE is used to monitor the CICS regions where we run QMF. Our mix of operational and large warehouse workloads often leads to disruptions when user queries use up all of the WORKDB space on a member. We have had numerous batch abends because of this. This utility attempts to identify and cancel user threads before they use up all WORKDB space. This tool is designed to cancel user queries and it will not cancel production batch jobs or CICS transactions. The CICS_INCLUDE is used to monitor the CICS regions where we run QMF.

    24. WORKDB Usage Cancel At initialization the utility calculates the approximate amount of WORKDB space that is available. Each invocation estimates the amount of space that each thread is consuming. Threads that exceed certain thresholds will be cancelled – unless they are production jobs. The usual culprits are Cartesian joins or queries with unusual GROUP BY’s. Currently being reworked to use IFCID 342. When the utility starts up, it calculates the amount of available WORKDB space for the member and it uses heuristics to target specific user queries. An unusual GROUP BY would be a group by a highly unique key on a large table. Cartesian joins are the most typical culprit. Since we have large warehouses and a significant user base, we do not really use the DB2 governor to stop runaway queries. We are currently looking at using the new IFCID 342 for this utility. When the utility starts up, it calculates the amount of available WORKDB space for the member and it uses heuristics to target specific user queries. An unusual GROUP BY would be a group by a highly unique key on a large table. Cartesian joins are the most typical culprit. Since we have large warehouses and a significant user base, we do not really use the DB2 governor to stop runaway queries. We are currently looking at using the new IFCID 342 for this utility.

    25. DB2 System Monitors These monitors run under our interval processor started task DBM1 Storage Monitor WLM Boost Utility Replication Utility

    26. Interval Processor The SCANNER – a generic interval processor SCANDEF DB2_SUBSYSTEM((CONNRETRY_INTVL(60))) * QUEUE NUM(1),MAXDEPTH(2),TRIGGER(1),TICKER(1),MAXCPU(1),MAXELAPS(70) * PROCESS NAME(WLM_BOOST) - INIT( - INIT_PGM (PRCSWBSI), SYNTAX(DB9PWBST)), - PROCEDURE( - PROGRAM (PRCSWBST), OPTIONS(RESIDENT)), - INVOCATION( - QUEUE(1),INTERVAL(60),COMMAND(BOOST)) - LIMITS(MAXCPU(2),MAXELAPS(60))

    27. Interval Processor SCAN FACILITY Interval Processing with Priority Process Definition Initialization Process Invocation Limits Three Tier MVS Task Structure Main Task - Console Communications Queue Task - Schedule Processes to Queue Process Task - Manage Process Execution

    28. Interval Processor SCAN FACILITY Each Queue has its own sysout for reporting Abend Handling Recovery Dump to a Queue Specific Sysout Max Consecutive before disablement RRS Rollback if defined Recoverable Limits Max Elapsed/Execution Max CPU/Execution

    29. DBM1 Storage Monitor This utility is a more specialized version of the MEMU rexx exec that is available from IBM. STORAGE INTERVAL(60), - STORAGE_WARNING(THRESHOLD(84),NODUMP), - STORAGE_CRITICAL(THRESHOLD(92),DUMP), - MAXDBAT(110), - CTHREAD(120), - REMAINING_ACTIVE_LOGS(10), - LATCH_CONTENTION_THRESHOLD(15000), - ABOVE_THE_BAR_THRESHOLD(2048), - LATCH_RATES(01000, - . . . )

    30. DBM1 Monitor – Features Has a warning and critical threshold for DBM1 storage usage Checks for CTHREAD and MAXDBAT spikes Tracks latch contention rates Produces storage profile, similar to the output of the MEMU rexx Issues WTOs that are used for email and paging

    31. What We’ve Learned Large DB2 UTILITY LISTDEFs can consume a lot of DBM1 storage Individual queries can use 300M of DBM1 storage DB2 Latch contention rates need to be closely monitored A small number of CPU intensive queries can consume large amounts of DBM1 storage Thread spikes do not necessarily cause storage issues

    32. WLM Boost Utility WLM Boost: used to issue –DIS THD(*) SERVICE WAIT commands when thread spikes occur. BOOST - SUBSYSTEM(DB6P) - GROUP(DSNDB0P) - GROUP_ATTACH(DB0P) - THREADS(120) - CHASE ( - ITERATIONS(3) - INTERVAL(5) - )

    33. Replication Utility Extract WEBSPHERE REPLICATION EVENT PUBLISHER XML from MQ queue and move to a dataset EXTRACTQ QUEUE_MGR(MQB1), - QUEUE(MN.QLOCAL.CMS_MAGNIFY_DATA), - TRANSLATE(TO_EBCDIC), - RESTART_QUEUE(MN.QLOCAL.CMS_MAGNIFY_XTRACT_RESTART), - COMMIT_COUNT(100), - DSN(TDBA003.TEST.EXTRACT(DDNAME(MAGEXTRD))), - NEW,UNIT(SYSALLDA), - DCB(RECFM(VB),LRECL(19000),BLKSIZE(19004),BUFNO(6)), - SPACE(CYL(10,10)), - VOLCNT(5), - DSNLOG(XXXXXXXXXXX(LOGDDNAM(MQEXL))),START=WARM

    34. Summary The combination of our development framework with IFI metadata has allowed us to quickly create tools that are tailored to our specific needs. We would like to see IBM ship IFI metadata as a sample with DB2. IBM should also continue to extend IFI to supply additional exception status information.

    35. Thank You !! Questions ?

    36. Matt Ramsey, Mike Semega & Bob Vargo Progressive Insurance Matt_Ramsey@Progressive.com, Mike_Semega@Progressive.com & Robert_S_Vargo@progressive.com

More Related