The DB2 Detective –
Download
1 / 48

Sept. 2009 - PowerPoint PPT Presentation


  • 224 Views
  • Updated On :

The DB2 Detective – The Case of the Diligent DBA. Bill Minor IBM. Sept. 2009. Highlights. No matter the extent of preparation, it seems that unexpected outages pop up when least expected. As a result, unplanned downtime can cause

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sept. 2009' - kylee


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

The DB2 Detective –

The Case of the Diligent DBA

Bill MinorIBM

Sept. 2009


Slide2 l.jpg

Highlights

No matter the extent of preparation, it seems that unexpected outages pop

up when least expected. As a result, unplanned downtime can cause

significant impact to business operations. When misfortune strikes, there are

always two predominant questions, "What happened?" and "When will the

system be available again?" Mysteries can be resolved by good DB2

Detective work - identifying and circumventing actions that can lead to

problems. When unexpected problems do occur, it is vital to be able to act

decisively and swiftly.

The goal of this session is to introduce database inspection tools and provide

insights that can be used to help monitor the integrity and maintain the

consistency of your data. I will also provide guidance for salvage, repair,

recovery, and mitigation operations that are vital when crisis strikes and time

is of the essence.


Agenda l.jpg

Help with defining a problem

Help with collecting and understanding information/diagnostics related to a problem

Tools to help in the event of an outage

Tools to help avoid outages in the first place

Features to help avoid, control, and reduce the impact of an unexpected problem (Technology Preview or TP)

RAS (Reliability, Availability and Serviceability)

Agenda


In the beginning l.jpg
In the Beginning ….

  • ‘There is an outage’ !!!!!! Where do we start?

    …… Take a deep breath … (DB2 911)

  • Keep it simple to begin; fundamentals are essential

  • Order of Investigation/ ‘Case’:

    • Problem Description

      • Problem Determination

        • Problem Resolution


Tools in the detective kit l.jpg
‘Tools’ in the Detective Kit

  • The problem description

  • db2diag.log

  • db2trc

  • db2pd

  • db2dart and inspect

  • db2top

  • db2exfmt

  • admin functions

  • db2cos, db2support, db2fodc

  • Resiliency (mitigation/toleration/avoidance)


An outage l.jpg
An ‘Outage’

  • “Outage” – subjective terminology:

    • Data availability:

      • slow access

      • no access

    • Unexpected crashes

  • Notes:

    • There is always a trade-off between how much time can be spent investigating a problem and making system available again to users

    • Other factors: risk mitigation; establishing cause; development of a workaround (in absence of a ‘fix’ obviously don’t want to hit this again)

      An outage can be planned (i.e. maintenance) or unplanned (unexpected

      errors, terminations, slow or unavailable access to data, …).

      ‘Path to DBA diligence’

    • Handling and avoiding unplanned outages via: investigation, problem definition, monitoring, prevention, education, …… (hopefully something from this presentation ) …


What information needs to be collected l.jpg
What Information Needs to be Collected?

  • Each problem has it own set of one or more signatures

  • Problems (even those with the same source – so ‘same problem’) can manifest themselves in different ways

    • Systems and environments are different: hardware, software, different users, different data (scope and volume), …

  • There is a set of fundamental or initial information/diagnostics that needs to be collected for every problem .. No matter what

  • Keep in mind that additional information or diagnostics could be required depending on the initial analysis investigation

    • Goal: To minimize the amount of effort required in collecting diagnostics


The problem description l.jpg
The Problem Description

  • VERY VERY VERY IMPORTANT (may seem obvious)

    • Avoid taking the wrong path

    • Invest in:

      • Establishing “What happened?”

      • Getting the correct information and diagnostics

        REQUIRED FUNDAMENTALS:

  • What is the current state/situation?

    • Describe what happened and what needs to happen

    • Include recent problem or change history

    • Client/Server configuration

  • What changed? No really, what changed? (Usually something has but just not apparent how, why or when)

  • Include specific actions known to be performed at the time of the incident

  • Product version and release number (FP? Special build?)

  • SQL error codes

  • Scope: System Wide? Instance Wide? Database? Application(s)?


Performance issues what and when l.jpg

Identification of such an issue requires a baseline – a before and after comparison

Typically the issue is, ‘things are slow’ (really mean ‘slower’)

Slow query

After running RUNSTATS

After applying a fixpack

After migrating to a new release of DB2

After making a database configuration change

Performance of triggers, stored procedures, UDFs with SQL, crashes/traps during query execution

Differentiate between overall system performance and specific actions

Date/time correlation? Other maintenance activities?

Different systems? (db2pd –osinfo)

Performance Issues – What? and When?


Problem scope l.jpg
Problem ‘Scope’

  • Scope and Cause are required in order develop and execute action plan; action plan may be ‘straightforward’ i.e. restore the database

  • Manifest itself in one of the following outage scenarios:

    • Instance trap – db2start required

    • Database trap

    • Application failure


Diagnostics to collect at the outset l.jpg

DB2 Build Level

Operating System type and level

DB2 registry variable settings

MPP? SMP? Single-processor? Database partitioned? Which partition?

Syntax of command/statement associated with issue

DB2 error received

db2diag.log file which sufficient history

Notification log

DB2 Trap files (Call stacks), DB2 dump files

History File

db2 trace

Messages file

Core file, system error logs (Unix:errpt Windows: event logs)

Log files and log file control structures

Diagnostics to Collect at the Outset

Fundamental


Db2diag log file l.jpg

Where is it? $HOME/sqllib/db2dump/

What is it? ‘Supplementary’ diagnostic logging facility

How to read? Each entry:

Time ordered entries

Database specific; Database node/partition (i.e. DPF specific)

Process/thread ID; Application specific identifiers/handles

Function name where entry was generated

Messaging and perhaps some data

2009-03-03-10.26.14.523824-300 I37414837E502 LEVEL: Error

PID : 8083 TID : 54 PROC : db2sysc 20

INSTANCE: svtdbm3 NODE : 020 DB : MASUN1

APPHDL : 0-138 APPID: *N0.svtdbm3.090303150827

AUTHID : SVTDBM3

EDUID : 54 EDUNAME: db2agntp (MASUN1) 20

FUNCTION: DB2 UDB, buffer pool services, sqlbDMSCreatePool, probe:800

MESSAGE : ZRC=0x850F000C=-2062614516=SQLO_DISK "Disk full."

DIA8312C Disk was full.

Can be parsed with diag tool and/or admin function (examples in Backup Slides)

db2diag.log file


Db2trc trace facility l.jpg

Useful for problems/issues that can be recreated

Generates a flow and format of the code paths taken by DB2 while the trace is active

Traces all DB2 activity in an instance on a physical node (database partition). If many logical db2 nodes, all activity is traced

Trace to memory buffer or file (slower)

Trace to file for trace wraps

Always include corresponding db2diag.log file with a db2 trace

Examples:

db2trc on –l 8M … trace to memory

db2trc on –f mytrace.trc … trace to file

db2trc inf …. Information returned when trace is active)

db2trc flw mytrace.trc mytrace.flw ……

db2trc fmt mystrace.trc mytrace.fmt …… Readable versions of trace data

db2trc – Trace facility


Db2dart d atabase a nalysis and r eporting t ool l.jpg

Primarily to examine databases for architectural correctness

Check validity of meta-data structures, page headers, row headers, …

Runs against data as it exists on disk (primarily an offline tool)

Critical for investigating problems associated with data ‘corruption’

Granularity: database /db, tablespace /ts, table /t

EXAMPLES:

db2dart SAMPLE /db

db2dart SAMPLE /ts /tsi 4

db2dart SAMPLE /t /tsi 2 /oi 5

Marking problematic objects as invalid; recovering or extracting data

Consistency: Index key data checking

HWM – space tablespace space management

db2dart (Database Analysis and Reporting Tool)


Inaccessible table salvaging data l.jpg

Example: db2dart SAMPLE /ddel

Table ID or name, tablespace ID, first page, num of pages:

(suffic page number with 'p' for pool relative)

13 2 0 100

7 of 7 columns in the table will be dumped.

Column numbers and datatypes of the columns dumped:

0 SMALLINT

1 VARCHAR() -VARIABLE LENGTH CHARACTER STRING

2 SMALLINT

3 CHAR() -FIXED LENGTH CHARACTER STRING

4 SMALLINT

5 DECIMAL

6 DECIMAL

Default filename for output data file is TS2T13.DEL,

do you wish to change filename used? y/n

n

Filename used for output data file is TS2T13.DEL. If existing file, data will be appended to it.

Inaccessible Table – Salvaging Data


Marking indexes as invalid db2dart mi l.jpg

Example:

db2dart testdb /mi /tsi 2 /oi 4

Time of index rebuild dictated by setting of INDEXREC database cfg variable

Usually one of RESTART (default) or ACCESS

RESTART – database restarted

ACCESS – when index is first accessed

Messages written to db2diag.log file indicating index rebuild

Marking Indexes As Invalid – db2dart /MI


Checking database backup images l.jpg

Database backup images represent one of your ultimate recovery options

Hence, checking their integrity is prudent

db2ckbkp

Check the integrity of a database backup image

Can extract pages from the backup image to a file (-e option)

This may be part of a problem causal analysis or repair procedure

What ‘state’ is the data in the backup image?

Tivoli Storage Manager (TSM): db2adutil -verify

Checking Database Backup Images


History file l.jpg

Records information specific to the following database actions:

BACKUP, ROLLFORWARD, DROPPED TABLE, LOAD, CREATE TABLESPACE, RENAME TABLESPACE, REORG, ALTER TABLESPACE, ARCHIVE LOG

Dropped table recovery

CREATEor ALTER TABLESPACE option DROPPED TABLE RECOVERY

Allows recovery of dropped table data using table space-level restore and rollforward operations

Faster than database recovery and database available to users

History File


List history dropped table recovery l.jpg

Identify the table which was dropped actions:

Op Obj Timestamp+Sequence Type Dev Earliest Log Current Log Backup ID

-- --- ------------------ ---- --- ------------ ------------ --------------

D T 20090223125627 000000000000693f00020013

----------------------------------------------------------------------------

"BMINOR "."STAFF3" resides in 1 tablespace(s):

00001 USERSPACE1

----------------------------------------------------------------------------

Comment: DROP TABLE

Start Time: 20090223125627

End Time: 20090223125627

Status: A

----------------------------------------------------------------------------

EID: 9

DDL: CREATE TABLE "BMINOR "."STAFF3" ( "ID" SMALLINT NOT NULL , "NAME" VARCHAR(9) , "DEPT" SMALLINT , "JOB" CHAR(5) , "YEARS" SMALLINT , "SALARY" DECIMAL(7,2) , "COMM" DECIMAL(7,2) ) IN "USERSPACE1" ;

----------------------------------------------------------------------------

LIST HISTORY –Dropped Table Recovery


Dropped table recovery continued l.jpg

Restore a database- or table space-level backup image taken before the table was dropped

Create an export directory to which files containing the table data are to be written

Roll forward to a point in time after the table was dropped, using the RECOVER DROPPED TABLE option on the ROLLFORWARD DATABASE command (alternatively, roll forward to the end of the logs, so that updates to other tables in the table space or database are not lost)

Re-create the table using the CREATE TABLE statement from the recovery history file.

Import the table data that was exported during the rollforward operation into the table

Dropped Table Recovery(continued)


Performance explain plans l.jpg

Optimizer access plans change … could be legitimate, maybe not

Compare plans, stats, configuration: before vs after

Monitor

Example:

db2 –tvf $HOME/sqllib/misc/EXPLAIN.DDL

db2 explain all for “select * from bminor.staff where id=10”

db2exfmt -d sample -n % -e % -s % -w -l -# 0 -o explainPlan.out -g TIC

Performance – Explain Plans


Db2exfmt output example l.jpg

Access Plan: not

-----------

Total Cost: 11.5946

Query Degree: 1

Rows

RETURN

( 1)

Cost

I/O

|

6.28

FETCH

( 2)

11.5946

1

/---+---\

6.28 157

IXSCAN TABLE: BMINOR

( 3) STAFF

3.2496 Q1

0

|

157

INDEX: BMINOR

I1_STAFF

Q1

Objects Used in Access Plan:

---------------------------

Schema: BMINOR

Name: I1_STAFF

Type: Index

Time of creation: 2009-03-02-15.52.41.836445

Last statistics update:

Number of columns: 1

Number of rows: 157

Width of rows: -1

Number of buffer pool pages: 1

Distinct row values: No

Tablespace name: USERSPACE1

Tablespace overhead: 7.500000

Tablespace transfer rate: 0.060000

Source for statistics: Single Node

Prefetch page count: 32

Container extent page count: 32

Schema: BMINOR

Name: STAFF

Type: Table

Time of creation: 2009-03-02-15.50.29.727941

Last statistics update:

Number of columns: 7

Number of rows: 157

Width of rows: 40

Number of buffer pool pages: 1

Number of data partitions: 1

Distinct row values: No

Tablespace name: USERSPACE1

Tablespace overhead: 7.500000

Tablespace transfer rate: 0.060000

db2exfmt Output - Example


Db2pd monitor and troubleshoot db2 l.jpg
db2pd – Monitor and Troubleshoot DB2 not

  • Indispensable for monitoring and troubleshooting

  • Non-intrusively capable of obtaining various statistics, snapshot information, internal meta-data, … from an instance

  • Relevant options (some of my favorites):

    • -osinfo, -mempools, -memblocks, -edus (more to come), -stack, -applications, -transactions, -bufferpools, -logs, -pages, -tablespaces, -tcbstats index, -apinfo, -fvp, …


Achieving understanding via sql routines l.jpg

DB2 has a very rich and diverse set of SQL routines that can be used probe and monitor one’s system

For example:

PD_GET_DISG_HIST

PD_GET_LOG_MSGS

PDLOGMSGS_LAST24HOURS

DB_HISTORY

LOG_UTILIZATION

LONG_RUNNING_SQL

LOCKS_HELD

LOCKWAITS

CONTAINER UTILIZATION

TBSP_UTILIZATION

ADMIN_GET_TAB_INFO, ADMIN_GET_INDEX_INFO

APPL_PERFORMANCE

BP_HITRATIO, BP_READ_IO, BP_WRITE_IO

TOP_DYNAMIC_SQL

MON_GET_BUFFERPOOL, MON_GET_TABLE, MON_GET_TABLESPACE, ….

Achieving Understanding via SQL Routines


Db2top l.jpg

What is it? be used probe and monitor one’s system

DB2 dynamic monitoring tool. But so are snapshots?

Snapshot data is typically cumulative (point in time)

db2top provides delta values in real time

db2top integrates multiple types of snapshot data together

Diagnose performance and resource problems, optimize

Provides a unified, single-system view of a multi-partition database or single-partition database for the purposes of providing a dynamic real-time view of a running DB2 system

Availability: V8 FP17, V9.1 FP6, V9.5 FP2

developerworks whitepaper:

http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/

db2top


When unexpected problems occur db2cos db2 call out script l.jpg

Invoked automatically by Database invoked by default when the database manager cannot continue processing due to a panic, trap, segmentation violation or exception

Each default db2cos script will invoke db2pd commands to collect information in an unlatched manner

Order of events when trap/panic occurs:

Trap file is created

Signal handler is called

db2cos script is called

An entry is logged in the administration notification log

An entry is logged in the db2diag log file

Information is generated by db2pd (operating system, the Version and Service Level installed, the database manager and database configuration, as well as information about the state of the agents, memory pools, memory sets, memory blocks, applications, utilities, transactions, …)

When Unexpected Problems Occur – db2cos :DB2 Call Out Script


Db2support diagnostic data collection l.jpg

Collects environment data about either a client or server machine and places the files containing system data into a compressed file archive

Collects db and dbm cfg info, reg var settings, db2level, db2diag.log, … all into one file called db2support.zip

Collects up diagnostics generated by db2cos* scripts

db2supp_opt.zip within db2support.zip

This file contains a host of optimizer related info: query text, exfmt output (explain of query), db2look and db2batch output, catalog table info, …

To assist with Optimizer (plan) problems

db2support . –db <dbname> -st “select * from bminor.staff”

db2support – Diagnostic Data Collection


Db2pdcfg l.jpg

Basically a tool which can set or change flags which influence the database system behavior for problem determination purposes

Control behaviors of cos and fodc (more to come on this one)

Neat option: db2pdcfg –flushbp –db <dbname>

Allow faster failover from primary to secondary in HADR setups

Monitor and tune bufferpools for number of dirty pages (use inconjunction with –pages and –bufferpools)

Sync in-memory tablespace meta-data structures with on-disk versions

Execute before using db2dart /DHWM against active database

db2pdcfg


First occurrence data capture fodc l.jpg

A facility (V9.5) to capture diagnostic information when a problem first occurs

Automatic mode: invoked by DB2 engine when outage or error condition is detected

Traps, panics, corruptions

Manual mode: user initiated – db2fodc

Hangs, performance issues

Current setting: db2pdcfg –fodc status

db2pdcfg can be used to change defaults

Information collected and placed in subdirectory under DIAGPATH

FODC_<outagetype>_<timestamp>

Run db2support after db2fodc to collect all the fodc information and package into one zip file that can be sent to Support

First Occurrence Data Capture (FODC)


Db2fodc example l.jpg

Example: db2fodc –hang –db SAMPLE problem first occurs

2009-03-03-11.33.19.657457-300 E1683E654 LEVEL: Warning

PID : 619 TID : 47350926685104PROC : db2fodc

INSTANCE: bminor NODE : 000

FUNCTION: DB2 UDB, RAS/PD component, pdDb2FODCMain, probe:30

MESSAGE : ADM14003W FODC has been invoked by the user from db2fodc tool for

symptom "hang" and diagnostic information has been recorded in

directory

"/home/bminor/sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318".

Please look in this directory for detailed evidence about what

happened and contact IBM support if necessary to diagnose the

problem.

File Listing for ../sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318

6116.1.000.stack.txt 6116.17.000.stack.txt 6116.27.000.stack.txt 6116.31.000.stack.txt DB2CONFIG

6116.11.000.stack.txt 6116.18.000.stack.txt 6116.28.000.stack.txt 6116.32.000.apm.bin

6116.12.000.stack.txt 6116.19.000.stack.txt 6116.29.000.apm.bin 6116.32.000.dump.bin

6116.13.000.stack.txt 6116.20.000.stack.txt 6116.29.000.dump.bin 6116.32.000.stack.txt DB2PD

6116.14.000.stack.txt 6116.21.000.stack.txt 6116.29.000.stack.txt 6116.33.000.apm.bin DB2SNAPS

6116.15.000.dump.bin 6116.22.000.stack.txt 6116.30.000.apm.bin 6116.33.000.dump.bin DB2TRACE

6116.15.000.stack.txt 6116.23.000.stack.txt 6116.30.000.dump.bin 6116.33.000.stack.txt OSCONFIG

6116.16.000.apm.bin 6116.24.000.stack.txt 6116.30.000.stack.txt 6116.34.000.apm.bin OSSNAPS

6116.16.000.dump.bin 6116.25.000.stack.txt 6116.31.000.apm.bin 6116.34.000.dump.bin OSTRACE

6116.16.000.stack.txt 6116.26.000.stack.txt 6116.31.000.dump.bin 6116.34.000.stack.txt

db2eventlog.000 db2fodc_hang.log

db2fodc Example


Resiliency l.jpg
Resiliency problem first occurs

  • Avoiding and dealing with ‘errors’ that can result in unplanned outages


Online data consistency checking v9 5 l.jpg

Index to Data consistency problems can be categorized as: problem first occurs

Orphaned index entry:

An index entry points to a deleted or non-existent record slot

Invalid index entry:

An index entry points to a valid slot on a data page, but index key data

does not match record data

Missing key:

A data row is not indexed in at least one of the indexes on the table

Duplicate Record IDs (RIDs):

A data row is indexed more than once

These are classified as logical corruptions.

Online Data Consistency Checking (V9.5)


Key data checking via db2 inspect v9 5 l.jpg

New problem first occursINDEXDATA cross-object checking clause to the INSPECT utility

Example:

db2 inspect check table name STAFF indexdata results keep insp.out

NOTE:

The INDEXDATA option only examines the logical inconsistency between

index and data. Therefore, it is recommended that you first run INDEX and

DATA checking separately, to rule out any physical corruption, before

running INDEXDATA checking.

Key-Data Checking via DB2 INSPECT (V9.5)


Threaded engine model for unix in v9 5 l.jpg

DB2 has always had a threaded engine model on Windows platforms

In V9.5, DB2 Unix engine has now evolved from process-based to thread-based model

All threads within a process can share the memory allocated by, or within, that process

Several advantages: basically performance and memory management

From a DBA perspective:

Better administration and tuning of memory through administration

Resiliency improvements (upcoming slides)

Threaded Engine Model for Unix in V9.5


Db2 processes vs threads ps ef l.jpg

V9.1: platforms

bminor 6107356 4780132 0 13:00:53 - 0:00 db2sysc

bminor 6148108 6717616 0 13:01:03 - 0:00 db2agent (SAMPLE)

bminor 6180952 1458238 0 13:08:22 - 0:00 db2pclnr

bminor 6193316 1458238 0 13:08:22 - 0:00 db2pclnr

bminor 6660150 6107356 0 13:00:53 - 0:00 db2licc

bminor 6717616 6107356 0 13:00:53 - 0:00 db2ipccm

bminor 884882 1458238 0 13:08:22 - 0:00 db2loggw (SAMPLE)

bminor 1237024 1458238 0 13:08:22 - 0:00 db2lfr (SAMPLE)

bminor 1458238 6107356 0 13:00:53 - 0:00 db2gds

bminor 1556708 6717616 0 13:01:07 - 0:14 db2stmm (SAMPLE)

bminor 1757202 1458238 0 13:08:22 - 0:00 db2dlock (SAMPLE)

bminor 1908750 1458238 0 13:08:22 - 0:00 db2pclnr

bminor 2195682 1 0 13:08:21 - 0:00 /home4/bminor/sqllib/bin/db2bp

V9.5:

bminor 10364 10362 1 13:15 pts/19 00:00:00 db2sysc

bminor 10377 10362 1 13:15 pts/19 00:00:00 db2acd,0,0,0,1,0,0,1,0,8a18

bminor 10385 1 0 13:15 pts/19 00:00:00/home/bminor/sqllib/bin/db2bp

DB2 Processes vs. Threads: ps -ef


Running threads db2pd edus l.jpg

Database Partition 0 -- Active -- Up 2 days 17:59:14 platforms

List of all EDUs for database partition 0

db2sysc PID: 1187

db2wdog PID: 1185

db2acd PID: 1201

EDU ID TID Kernel TID EDU Name USR SYS

=========================================================================================================

65 47726691608896 20490 db2agntdp (SAMPLE ) 0.000000 0.000000

64 47726695803200 999 db2agent (SAMPLE) 0.240000 0.110000

63 47726708386112 998 db2agent (SAMPLE) 0.720000 0.160000

61 47726712580416 308 db2evmli (DB2DETAILDEADLOCK) 0.000000 0.000000

60 47726716774720 307 db2wlmd (SAMPLE) 0.000000 0.000000

59 47726720969024 306 db2pfchr (SAMPLE) 0.000000 0.000000

57 47726737746240 304 db2pfchr (SAMPLE) 0.000000 0.000000

55 47726733551936 302 db2pclnr (SAMPLE) 0.000000 0.000000

54 47726729357632 301 db2pclnr (SAMPLE) 0.000000 0.000000

53 47726746134848 300 db2dlock (SAMPLE) 0.000000 0.000000

52 47726762912064 32767 db2lfr (SAMPLE) 0.000000 0.000000

51 47726741940544 32766 db2loggw (SAMPLE) 0.020000 0.030000

50 47726758717760 32765 db2loggr (SAMPLE) 0.000000 0.030000

49 47726699997504 1554 db2taskd (SAMPLE) 0.000000 0.000000

25 47726754523456 1365 db2stmm (SAMPLE) 8.020000 0.460000

16 47726767106368 1265 db2agent (SAMPLE) 1.420000 0.810000

15 47726771300672 1199 db2resync 0.000000 0.000000

14 47726775494976 1196 db2ipccm 0.080000 0.030000

13 47726779689280 1195 db2licc 0.000000 0.000000

12 47726783883584 1194 db2thcln 0.000000 0.000000

11 47726788077888 1192 db2alarm 0.020000 0.020000

1 47726528031040 1191 db2sysc 5.420000 5.620000

Running Threads: db2pd -edus


Memory protection via storage keys v9 5 l.jpg

DB2 is leveraging technology which is specific to the POWER6 platform and AIX (TL5300-06 or greater)

Hardware feature called Storage Keys

Allows for memory protection functionality

DB2 V9.5 has the capability to detect erroneous access to critical regions of it’s own memory, primarily bufferpool memory

This is pro-active protection:

When such access is detected it is stopped before a corruption can result

Offending UDF or Stored Procedure receives an error code indicating that an incorrect memory access was performed

Cannot protect against all types of bugs but it does provide an added layer of protection

Memory Protection via Storage Keys (V9.5)


Enablement of memory protection via storage keys v9 5 l.jpg

Registry variable: platform and AIX (TL5300-06 or greater)

DB2_MEMORY_PROTECT = [ NO | YES ]

Error DBI1301E on attempted set if:

Hardware, OS does not support storage keys

Large page support (i.e. DB2_LARGE_PAGE_MEM) is enabled

Enablement of Memory Protection viaStorage Keys (V9.5)


Higher availability with storage key support l.jpg

Thread Suspension platform and AIX (TL5300-06 or greater): An additional level of protection that will increase availability of systems when abnormal termination of a DB2 Kernel Thread occurs

DB2_THREAD_SUSPENSION = [ OFF | ON ]

What is it?

It allows you to control whether a DB2 instance sustains a trap by suspending a faulty DB2 kernel thread (a thread that has tried illegally to access memory protected with storage protection keys).

Higher Availability with Storage Key Support


Error toleration tp l.jpg

Goal: Reduce unplanned outages platform and AIX (TL5300-06 or greater)

Error toleration via:

Trap resilience

Physical read error toleration

Logical read error toleration

Registry variable:

DB2RESILIENCE=ON/TRUE or not defined (ie default behavior) activates error toleration support for the above

DB2RESILIENCE=OFF/FALSE disables these capabilities.

Error Toleration (TP)


Extended trap resilience tp l.jpg

Registry Variable: platform and AIX (TL5300-06 or greater)

DB2RESILIENCE= [ ON | OFF ]

When a DB2 server detects a trap the following actions will be taken by DB2:

First Occurrence Data Capture (FODC) will be activated

Execution stack is dumped

If Storage Key support is active then process thread suspension decisions

If Storage Key suspension decisions allow for thread suspension, suspend thread

If Storage Key support is not active or fails decision points to suspend thread, then proceed with non Storage Key survival decisions (ie. Utilize this new capability)

DB2 will determine if the current environment is safe to allow recovery

Extended Trap Resilience (TP)


Messaging l.jpg

2008-11-17-09.56.27.542572-300 I75439E563 LEVEL: Severe

PID : 28040 TID : 47643040409920PROC : db2sysc

INSTANCE: marvin NODE : 000 DB : TESTDB

APPHDL : 0-21 APPID: *LOCAL.marvin.081117145537

AUTHID : MARVIN

EDUID : 29 EDUNAME: db2agent (TESTDB)

FUNCTION: DB2 UDB, RAS/PD component, pdResilienceIsSafeToRollback,probe:800

DATA #1 : String, 37 bytes

Trap Sustainability Criteria Checking

DATA #2 : Hex integer, 8 bytes

0x00000000000A0400

DATA #3 : Boolean, 1 bytes

true

DATA #1 A generic description of the data following

DATA #2 The hex dump of the trap sustainment criteria

DATA #3 The decision on whether to attempt to sustain the trap (true/false)

Messaging:


Physical read error toleration tp l.jpg

Physical read errors are detected when retrieving DB2 pages from the storage medium (for disk drive).

This error could be caused by issues with the contents of the data on disk or by problems accessing the physical device on which the data resides. Currently, when DB2 detects either of these errors the DB2 instance is shutdown.

This solution will provide enhanced toleration of physical read errors such that if such an error is encountered, rather than force the DB2 instance offline, the SQL statement or operation being executed will fail, a new SQLCODE will be returned to the application when appropriate, logging of appropriate diagnostics will occur, and the DB2 instance will remain active.

Physical Read Error Toleration (TP)


Toleration of physical read errors l.jpg

When a DB2 server detects a physical read error there will be a maximum of two more attempts to read the page from disk and validate the DB2 page header. If at the end of the multiple attempts an error is still determined to exist then:

If DB2RESILIENCE=ON

Sqlcode -1655E will be returned to the application

The SQL statement or DB2 operation fails

Administration message ADM6006E will be placed in the DB2 Administration log at notify level URGENT

Administration message ADM6006E will be placed in the DB2 diagnostic log at level URGENT

First Occurrence Data Capture (FODC) will be activated when the error is detected and will be turned off when the sqlcode is returned to the application. This will allow collection of diagnostic data generated due to this failure detection for further analysis by DB2 Support.

The database instance remains available

Application connection remains available for continued operations

If DB2RESILIENCE=OFF then panic the instance (that is, revert to old behavior)

This algorithm will aid in reducing application errors which are a result of transient file system I/O errors.

Toleration of Physical Read Errors


Logical read error toleration tp l.jpg

Inconsistency detected by DB2 when processing data contained within a DB2 page

Instead of hard failure (database or instance shutdown), isolate the issue and allow overall database use to continue

Problematic page is removed from the bufferpool, relevant diagnostics are generated, notification is returned (messages to admin and db2diag.log and new sqlcode to application)

Leveraged within the engine where it makes sense

Logical Read Error Toleration (TP)


Messaging46 l.jpg

SQL1656E within a DB2 page

An error occurred while processing data. The operation could not be completed, but the database remains accessible. Contact IBM Software Support.

Explanation:

The failure to complete the operation is due to a detected DB2 page inconsistency. The SQL statement has failed or the operation has been aborted. The database remains accessible.

User response:

Contact IBM Software Support for instructions on what diagnostic data to collect to assist in resolving this issue. The application can retry the operation, although it may continue to fail.

sqlcode: -1656

sqlstate: 58004

ADM6007C

DB2 detected an error while processing page page-number from table space tbspace-id for object object-id of object type object-type.

Explanation

DB2 was not able to complete the operation, but the database remains accessible. This may indicate the presence of a serious problem that could result in a future outage.

User response

Please contact IBM Software Support and they will guide you through the proper corrective actions.

Messaging


Resources to help you l.jpg

Developerworks (Best Practices Papers, …) within a DB2 page

IBM Database Magazine

IDUG Web Site and Solutions Journal

DB2 LUW Chat with the Lab

http://www-01.ibm.com/software/data/db2/9/labchats.html

IBM DB2 Support home:

www.ibm.com/software/data/db2/support/db2_9

DB2 Fixes:

http://www-01.ibm.com/software/data/db2/support/db2_9/download.html

Conferences, RDUGs, Customer Lab Advocate, Early Adoption Programs, …

Resources To Help You


Slide48 l.jpg

The DB2 Detective – within a DB2 page The Case of the Diligent DBA

Bill Minor

IBM

[email protected]


ad