Cheetah - Agile & Fast Performance enhancements

Cheetah - Agile & Fast Performance enhancements

Agenda • Non-Blocking Checkpoints • Automatic Checkpoints • Recovery Time Objective • Automatic LRU Tuning • Automatic AIO VP Tuning • Support for Direct I/O

Cheetah Checkpoint Improvements

What is a checkpoint? • A checkpoint is a point in time where cached data (bufferpool) is flushed to disk to create a consistency point for fast recovery, backups, HDR…

What is an LRU? • The LRU are queues used to manage the bufferpool • An LRU is comprised of 2 lists • MLRU • Tracking modified pages in the queue • FLRU • Tracking free or unmodified pages in the queue

Existing characteristics of Checkpoints • Significant transaction blocking, even fuzzy checkpoints • Fuzzy checkpoints • Unpredictable checkpoint processing time • Unpredicatable recovery time

Existing characteristics of Checkpoints • Checkpoint tuning vs OLTP tuning. • Tune LRU very aggresive • causes constant flushing of the buffer pool • Reduces the write cache • flushers consuming CPU cycles • Increases buffer contention • Tune LRU less aggressive • checkpoints were longer • transactions blocked for longer periods • longer disaster recovery time • Wasn’t easy to figure out optimal tuning.

Non-Blocking Checkpoints

Non Blocking Checkpoints • Most checkpoints do not block transactions during buffer flushing. • Exceptions…. • Checkpoint running short on resources • Physical log 75% • At least one checkpoint per logical log space. • Admin, archive checkpoints • Fuzzy checkpoint completely removed • Phase A recovery has been removed • Physical logging activity to 7.3 amounts • Will need to increase size of physical log!!

Benefits of Non-Blocking Checkpoints? • Transaction processing continues during the disk flush portion of checkpoint processing • Allows LRU flushing to be relaxed • Dramatic transaction performance improvement. • More Frequent checkpoints • Shortens fast recovery

Interval checkpoint

Recommendations • Increase LRUMIN and LRUMAX to at least 60 and 70 • Make sure the physical log is large • Move Online • Can be larger than 2GB • Make sure the logical logs space is large • Check new onstat –g ckp

What do you do if Checkpoint Block? • Use automatic checkpoint feature • The server will automatically trigger checkpoints basing on resources remaining. • Increase the size of physical/logical log • The server will suggest which resource to increase and what size it should be • Make LRU flushing more aggressive • Increase I/O performance • More AIO VPs and cleaners • Improve performance of I/O subsystem

Automatic Checkpoints

Automatic Checkpoints • If potential transaction blocking detected • Caliculation based on.. • Physical, logical logs usage • Buffer flush speed • Transaction throughput • To help Automatic Checkpoints • Increase the physical log size • Increase the logical log size • Increase LRU flushing (Use automatic LRU Tunning) The server will make suggestions when resources are lacking Monitor online.log and onstat –g ckp

Automatic Checkpoints • Default is always on • onmode –wm AUTO_CKPTS=0 … turn off • onmode –wm AUTO_CKPTS=1 … turn on

Checkpoint Performance Advisory • During checkpoint IDS will evaluate checkpoint related configuration parameters and produce a performance advisory if they are not optimal setting to avoid transaction blocking. • Performance Advisory is in the second part of onstat –g ckp output and in online.log • Configuration parameters evaluated at checkpoint: • PHYSFILE • PHYSBUFF • LOGBUFF • LOGFILES and LOGSIZE

PHYSFILE – Physical log Size • 110% of the combined size of all bufferpools for optimum performance • Enables fast recovery to use all bufferpool resources • Depends on transactional workload and speed of the disks

PHYSBUFF - Physical buffer size • With RTO_SERVER_RESTART off, default value is 128KB • With RTO_SERVER_RESTART on, default value is 512 KB • If a smaller value is used, a message appears in the online.log.

Checkpoint Performance Advisory – Physical log • During checkpoint processing potential physical log overflow is detected. Performance advisory: Physical log is running out of room. Results: Blocking transactions until checkpoint is complete. Action: Increase physical log size.

Physical log and automatic checkpoints ON • If the physical log is less than 10MB (10000KB) or automatic checkpoints every 35 seconds, then automatic checkpoints are turned off Performance advisory: The physical log is too small for automatic checkpoints. Results: Automatic checkpoints are disabled. Action: Increase the physical log size to at least ## Kb.

LOGBUFF – Logical log buffer • Default value is 64KB • If value < 64 KB, a message appears in the online.log • Assumes buffered logging is used. If non-buffered logging is used, smaller buffers can be used

Checkpoint Performance Advisory – Logical log • During checkpoint processing system detects potential for reaching checkpoin per log span limit. Performance advisory: Logical log is running out of room. Results: Blocking transactions until checkpoint is complete. Action: Increase logical log size.

Long Transaction blocking checkpoints • Long transactions are triggering frequent checkpoints Performance advisory: Long transactions are triggering blocking checkpoints. Results: Blocking transactions until checkpoint is complete. Action: Increase logical log size.

Logical and automatic checkpoints ON • If the logical log is less than 20MB (20000KB) or auto checkpoint generated every 35 seconds. Performance advisory: The logical log space is too small for automatic checkpoints. Results: Automatic checkpoints are disabled. Action: Increase the logical log space to at least ## Kb.

Performance Warning Examples 23:28:26 Performance Advisory: The current size of the physical log buffer is smaller than recommended. 23:28:26 Results: Transaction performance might not be optimal. 23:28:26 Action: For better performance, increase the physical log buffer size to 128. 13:25:54 Performance Advisory: Based on the current workload, the physical log might be too small to accommodate the time it takes to flush the buffer pool. 13:25:54 Results: The server might block transactions during checkpoints. 13:25:54 Action: If transactions are blocked during the checkpoint, increase the size of the physical log to at least 14000 KB. 13:25:54 Performance Advisory: The physical log is too small for automatic checkpoints. 13:25:54 Results: Automatic checkpoints are disabled. 13:25:54 Action: To enable automatic checkpoints, increase the physical log to at least 14000 KB.

onstat –g ckp IBM Informix Dynamic Server Version 11.10.FB7TL -- On-Line -- Up 01:03:54 -- 39936 Kbytes AUTO_CKPTS=Off RTO_SERVER_RESTART=Off Critical Sections Physical Log Logical Log Clock Total Flush Block # Ckpt Wait Long # Dirty Dskflu Total Avg Total Avg Interval Time Trigger LSN Time Time Time Waits Time Time Time Buffers /Sec Pages /Sec Pages /Sec 24 16:04:11 Plog 26:0x2d50f8 0.4 0.4 0.4 1 0.0 0.4 0.4 709 709 750 10 638 8 25 16:04:31 Plog 28:0x108c 0.6 0.6 0.6 2 0.0 0.6 0.6 940 940 722 38 1276 67 26 16:05:03 *User 28:0x32b018 0.1 0.0 0.0 1 0.0 0.1 0.1 34 34 187 5 810 24 27 16:20:05 CKPTINTVL 28:0x32e018 0.0 0.0 0.0 0 0.0 0.0 0.0 1 1 0 0 3 0 28 16:21:38 Plog 29:0x1c676c 0.5 0.5 0.5 1 0.0 0.5 0.5 705 705 750 8 640 6 29 16:21:52 *User 29:0x3b9018 0.1 0.0 0.0 1 0.0 0.1 0.1 33 33 186 12 499 33 30 16:23:45 *Backup 29:0x3bd018 0.1 0.0 0.0 0 0.0 0.0 0.0 16 16 18 0 4 0 Max Plog Max Llog Max Dskflush Avg Dskflush Avg Dirty Blocked pages/sec pages/sec Time pages/sec pages/sec Time 200 200 1 405 10 1 The server is blocking transactions because the physical log is too small. Based on the current workload, to prevent the server from blocking future transactions, increase the size of the physical log to 14000 KB. Based on the current workload, the logical log space might be too small to accommodate the time it takes to flush the buffer pool. The server might block transactions during checkpoints. If the server blocks transactions, increase the size of the logical log space to at least 14000 KB.

onstat –g ckp

New SYSMASTER Tables • syscheckpoint • Keeps history on the last 20 checkpoints • sysckptinfo • Keeps info on automatic checkpoints

Recovery Time Objective (RTO)

Onconfig parameter • New onconfig parameter • RTO_SERVER_RESTART • Amount of time in seconds that Dynamic Server has to recover from a problem after you restart Dynamic Server and bring the server into online or quiescent mode. • Seed the logical recovery pages in physical log • Valid values are 60 – 1800 • Default is 0 (disabled)

RTO • Facts about RTO_SERVER_RESTART • Allows users to set target fast recovery time. • RTO_SERVER_RESTART and CKPTINTVL mutually exclusive. • If turned off, the system will use the CKPTINTVL to trigger checkpoints (the old style). • Valid values 60 - 1800 seconds (1–30 minutes). • Automatically adjust the checkpoint frequency to meet the RTO policy. • The server will fine tune with each fast recovery to improve the predictability. • This parameter can be updated with onmode –wf and –wm. • RTO_SERVER_RESTART=0 (off) is the default.

How does RTO_SERVER_RESTART work? • Estimate/Calculate the speed of fast recovery • Server boot time • Physical log recovery (RAS_PLOG_SPEED) • Logical log recovery (RAS_LLOG_SPEED) • Assume all updates fit into bufferpools(pages seeded in physlog) • Automatic checkpoints based on resource usage to meet RTO policy.

Auto LRU Tuning

Automatic LRU Tuning (lru_min/max_dirty) • With interval checkpoints, LRU flushing can be less aggressive. • so go ahead and relax… your lru_min/max_dirty • Can bring dramatic increases in performance. • LRU flushing will automatically adjust to be more aggressive • When a hot page is replaced, 1%. • When a foreground write occurs, 5% • Time to flush bufferpool> RTO_SERVER_RESTART, 10% more aggressive • Continues adjusting until optimal.

LRU_MAX_DIRTY and LRU_MIN_DIRTY • Default values • LRU_MAX_DIRTY 60% • LRU_MIN_DIRTY 50% • A good starting point when AUTO_LRU_TUNING is ON • LRU_MAX_DIRTY 80% • LRU_MIN_DIRTY 70%

Automatic LRU Tuning – Configuration • AUTO_LRU_TUNING • 0 or 1 • ON by default • Dynamically switch off LRU_TUNING • onmode –wm AUTO_LRU_TUNING=0 • Dynamically switch on LRU_TUNING • onmode –wm AUTO_LRU_TUNING=1,min=val,max=val • Dynamically set LRU parameters when lru tuning is on/off • onmode –wm AUTO_LRU_TUNING=min=val • onmode –wm AUTO_LRU_TUNING=max=val

Performance Advisory when auto LRU tuning ON • During checkpoint if buffers flush time exceeds RTO. Performance advisory: The time to flush the bufferpool ## Is longer than RTO_SERVER_RESTART ##. Results: The IDS server can't meet the RTO policy Action: Automatically adjusting LRU flushing to be more aggressive. Adjusting LRU for bufferpool - id ## size ##k Old max ## min ## New max ## min ##

…..when auto LRU tuning OFF Performance advisory: The time to flush the bufferpool ## Is longer than RTO_SERVER_RESTART ##. Results: The IDS server can't meet the RTO policy Action: Automatic LRU tuning is off. Either turn on automatic LRU tuning or change LRU flushing to be more aggressive.

Automatic AIO VP Tuning

Automatic Tuning of AIO VPs • For cooked chunks • Monitor I/O performance and add more AIO VPs and/or cleaners if needed • AUTO_AIOVPS configuration parameter • 0 or 1 • ON by default • Dynamically change it using onmode • onmode –wm/-wf AUTO_AIOVPS=1 • onmode –wm/-wf AUTO_AIOVPS=0

NUMAIOVPS or VPCLASS aio_num=# • Initial setting will be 2 AIO VPs per cooked chunk • If you add one cooked chunk, 2 more AIO VPs will be added up to a value of 128 • Changing the value in ONCONFIG does not have any impact if RTO_SERVER_RESTART is ON. • Possible to change the value dynamically using onmode -p

CLEANERS • Initial setting will be 1 cleaner thread per AIO VP • Value adjusted in conjunction with changes to the number of AIO VPs.

Additional Information on checkpoints http://www.ibm.com/developerworks/db2/library/techarticle/dm-0703lashley

Direct I/O for cooked files

Behavior of cooked files • Cooked file performance can be much slower than raw devices. File System Cache

The Solution with Cooked files • Direct I/O bypasses file system cache • Unix and Linux OS support Direct I/O • Performance close to that of raw devices File System Cache

When is DIO used • DIO not used by default on cooked files • Onconfig DIRECT_IO = 1 to turn on • When using DIO, kaio will be used by default. This can be switched off by setting KAIOOFF=1

What are the benefits of DIO? • File reads/writes bypass the operating system read and write caches. • Reducing CPU consumption and eliminating the overhead of copying data twice. • first between the disk and the file buffer cache • second from the file buffer cache to the application’s buffer. • Can reduce number of AIO VPs if KAIOOFF is not set.

Limitations • Can not be used for temporary dbspaces. • can only be used for dbspace chunks whose file systems support direct I/O for the page size

Cheetah - Agile & Fast Performance enhancements