1 / 44

Thomas E. Canty ServerCare, Inc. Session 126

chauncey
Download Presentation

Thomas E. Canty ServerCare, Inc. Session 126

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Thomas E. Canty ServerCare, Inc. Session #126

    2. Speaker Qualifications Thomas E. Canty, Senior Oracle DBA, ServerCare, Inc. 19 years of Oracle experience, starting with version 5 Has presented at IOUG, OpenWorld, NoCOUG, IASA, Has been a DBA, Developer, Architect, and IT Manager

    3. Outline Overview Network Optimization ARCn & LGWR Redo Transport Checkpoint, Redo Read/Apply & Recovery Wait Events 10g R2 & 11g Improvements Best Practices

    4. Data Guard Modes Maximum Performance Mode Least performance impact Default mode Maximum Protection Mode Emphasis on data safety Requires at least one secondary Maximum Availability Mode Emphasis on uptime Continues if secondary unavailable Performance Most choices for configuration options Protection Requires data to be written to local and standby redo log for a transaction to commit Will shut down if unable to write to standby redo log of at least one standby database Availability Compromise between Performance and Protection Requires data to be written to local and standby redo log for a transaction to commit Will not shut down if unable to write to standby redo logPerformance Most choices for configuration options Protection Requires data to be written to local and standby redo log for a transaction to commit Will shut down if unable to write to standby redo log of at least one standby database Availability Compromise between Performance and Protection Requires data to be written to local and standby redo log for a transaction to commit Will not shut down if unable to write to standby redo log

    5. Physical vs. Logical Standby Phys: Redo applied directly, Limited read only reporting on standby, No data type restrictions, Rolling upgrades not generally possible Logical: Redo converted to SQL before applied, unrestricted read only reporting, extra schemas read/write, data type restrictions, rolling upgrades Phys: Redo applied directly, Limited read only reporting on standby, No data type restrictions, Rolling upgrades not generally possible Logical: Redo converted to SQL before applied, unrestricted read only reporting, extra schemas read/write, data type restrictions, rolling upgrades

    6. Outline Overview Network Optimization ARCn & LGWR Redo Transport Checkpoint, Redo Read/Apply & Recovery Wait Events 10g R2 & 11g Improvements Best Practices

    7. Session Data Unit (SDU) In Oracle Net connect descriptor: sales.servercare.com= (DESCRIPTION= (SDU=32767) (ADDRESS=(PROTOCOL=tcp) (HOST=sales-server)(PORT=1521)) (CONNECT_DATA= (SID=sales.servercare.com))) Globally in sqlnet.ora: DEFAULT_SDU_SIZE=32767 Oracle Net buffers data into SDU sized chunks when sending data across the network When large amounts of data are being transmitted, like DG - increasing the size of the SDU buffer can improve performance and network utilization. Configure an Oracle Net connect descriptor or globally within the sqlnet.ora Also, the TDU parameter should be some multiple of the SDU. In practice SDU can’t equal MTU due to incompatibilities between buffer sizes of the network layers. When data flows between layers of incompatible buffer sizes, fragmentation occurs, and as a result, extra network traffic is generated. With this in mind, components of the stack can be tuned to minimize fragmentation, which reduces network traffic and thereby increases performance.Oracle Net buffers data into SDU sized chunks when sending data across the network When large amounts of data are being transmitted, like DG - increasing the size of the SDU buffer can improve performance and network utilization. Configure an Oracle Net connect descriptor or globally within the sqlnet.ora Also, the TDU parameter should be some multiple of the SDU. In practice SDU can’t equal MTU due to incompatibilities between buffer sizes of the network layers. When data flows between layers of incompatible buffer sizes, fragmentation occurs, and as a result, extra network traffic is generated. With this in mind, components of the stack can be tuned to minimize fragmentation, which reduces network traffic and thereby increases performance.

    8. Session Data Unit (SDU) (Cont.) On standby DB, set in listener.ora: SID_LIST_listener_name= (SID_LIST= (SID_DESC= (SDU=32767) (GLOBAL_DBNAME=sales.servercare.com) (SID_NAME=sales) (ORACLE_HOME=/u01/app/oracle/product/10.2.0/db_1)))

    9. TCP Socket Buffer Size Set TCP socket buffer size = 3 * BDP Data Guard broker config. – Set in sqlnet.ora Non Data Guard broker – set in connect descriptor BDP - Bandwidth Delay Product RTT- Round Trip Time

    10. TCP Socket Buffer Size Assume gigabit network with RTT 25 ms BDP= 1,000 Mbps * 25msec (.025 sec) 1,000,000,000 * .025 25,000,000 Megabits / 8 = 3,125,000 bytes In this example: socket buffer size = 3 * bandwidth * delay = 3,125,000 * 3 = 9,375,000 bytes sqlnet.ora: RECV_BUF_SIZE=9375000 SEND_BUF_SIZE=9375000 Controls how much network bandwidth used regardless of the bandwidth available in the network circuit. Socket buffer sizes need to be increased from their default values in order to improve utilization of available bandwidth. Especially when network latency high, larger socket buffer sizes are needed to fully utilize network bandwidth. The optimal socket buffer size is three times the size of the Bandwidth Delay Product (BDP). To compute the BDP, the bandwidth of the link and the network Round Trip Time (RTT) are required. . RTT - time required for a network communication to travel from production DB to standby and back. Measured in ms.Controls how much network bandwidth used regardless of the bandwidth available in the network circuit. Socket buffer sizes need to be increased from their default values in order to improve utilization of available bandwidth. Especially when network latency high, larger socket buffer sizes are needed to fully utilize network bandwidth. The optimal socket buffer size is three times the size of the Bandwidth Delay Product (BDP). To compute the BDP, the bandwidth of the link and the network Round Trip Time (RTT) are required. . RTT - time required for a network communication to travel from production DB to standby and back. Measured in ms.

    11. Network Queue Sizes Between kernel net. subsystems & NIC driver txqueuelen - transmit queue size netdev_max_backlog - receive queue size Assumes gigabit network with 100ms latency Set queues: ifconfig eth0 txqueuelen 10000 sysctl.conf: net.core.netdev_max_backlog=20000 You can regulate the size of the queue between the kernel network subsystems and the driver for network interface card. Any queue should be sized so that losses do not occur due to local buffer overflows. especially important for TCP, because losses on local queues cause TCP to fall into congestion control, which limits the TCP sending rates. Increase NIC device queue sizes from the default of 100 to 10,000 echo 20000 > /proc/sys/net/core/netdev_max_backlog echo 1 > /proc/sys/net/ipv4/route/flushYou can regulate the size of the queue between the kernel network subsystems and the driver for network interface card. Any queue should be sized so that losses do not occur due to local buffer overflows. especially important for TCP, because losses on local queues cause TCP to fall into congestion control, which limits the TCP sending rates. Increase NIC device queue sizes from the default of 100 to 10,000 echo 20000 > /proc/sys/net/core/netdev_max_backlog echo 1 > /proc/sys/net/ipv4/route/flush

    12. Overall Network Ensure sufficient bandwidth to standby Verify TCP_NODELAY set to YES (default) RHEL3 - increase /proc/sys/fs/aio-max-size on standby From 131072(default) to 1048576 Set RECV_BUF_SIZE & SEND_BUF_SIZE = 3 * Bandwidth Delay Product (BDP) Use Session Data Unit (SDU) size of 32767 Increase send & receive queue sizes TXQUEUELENGTH NET_DEV_MAX_BACKLOG For RHEL 3 only it is recommended to set aio-max-size to 1048576 since Oracle uses I/Os of up to 1MB. It controls the maximum I/O size for asynchronous I/Os. Deprecated in 2.6 kernel. Not on slide - aio-max-size set in sysctl.conf on LinuxFor RHEL 3 only it is recommended to set aio-max-size to 1048576 since Oracle uses I/Os of up to 1MB. It controls the maximum I/O size for asynchronous I/Os. Deprecated in 2.6 kernel. Not on slide - aio-max-size set in sysctl.conf on Linux

    13. Outline Overview Network Optimization ARCn & LGWR Redo Transport Checkpoint, Redo Read/Apply & Recovery Wait Events 10g R2 & 11g Improvements Best Practices

    14. ARCn Redo Transport Redo transport services control the automated transfer of redo data from a database destination to one or more destinations. Redo transport services also manage the process of resolving any gaps in the archived redo log files due to a network failure. 1. Archiver reads 10MB from archive redo log, issues a network send to the RFS (Remote File Server) process on the standby 2. RFS process receives & performs I/O into either the standby redo log or archive redo logs, depending config 3. Once the I/O has completed the RFS sends an acknowledgement back to ARCH. Phys. Stdby. - Managed Recovery Process (MRP) Log. Stdby. - Logical Standby Process (LSP) 4. Archiver reads the next 10MB and then repeats processRedo transport services control the automated transfer of redo data from a database destination to one or more destinations. Redo transport services also manage the process of resolving any gaps in the archived redo log files due to a network failure. 1. Archiver reads 10MB from archive redo log, issues a network send to the RFS (Remote File Server) process on the standby 2. RFS process receives & performs I/O into either the standby redo log or archive redo logs, depending config 3. Once the I/O has completed the RFS sends an acknowledgement back to ARCH. Phys. Stdby. - Managed Recovery Process (MRP) Log. Stdby. - Logical Standby Process (LSP) 4. Archiver reads the next 10MB and then repeats process

    15. ASYNC LGWR Redo Transport 1. LGWR process writes redo to the online redo log at the production database 2. Logwriter Network Server (LNS) process on the production DB reads the online redo log and sends data to RFS process on the standby DB 3. The RFS process receives the redo being sent by LNS 4. Right away, RFS process sends acknowledgement back to LNS that the redo has been received. Commit completed. 5. The RFS Process writes the redo to a standby redo log 1. LGWR process writes redo to the online redo log at the production database 2. Logwriter Network Server (LNS) process on the production DB reads the online redo log and sends data to RFS process on the standby DB 3. The RFS process receives the redo being sent by LNS 4. Right away, RFS process sends acknowledgement back to LNS that the redo has been received. Commit completed. 5. The RFS Process writes the redo to a standby redo log

    16. SYNC LGWR Redo Transport 1. LGWR process writes redo to the online redo log at the production database. 2. Logwriter Network Server (LNS) process on the production DB sends data to RFS process on the standby DB. 3. RFS process receives the redo & completes the I/O into the standby redo log. 4. RFS process sends acknowledgment back to LNS that the redo has been received. 5. LNS tells LGWR process that the all the redo has been successfully received. 1. LGWR process writes redo to the online redo log at the production database. 2. Logwriter Network Server (LNS) process on the production DB sends data to RFS process on the standby DB. 3. RFS process receives the redo & completes the I/O into the standby redo log. 4. RFS process sends acknowledgment back to LNS that the redo has been received. 5. LNS tells LGWR process that the all the redo has been successfully received.

    17. Optimize ARCn Transport Increase MAX_CONNECTIONS to 5 on standby (if possible) default (2), maximum (5) Increase LOG_ARCHIVE_MAX_PROCESSES Larger than MAX_CONNECTIONS Up to network bandwidth default (2), maximum (30) Max connections Specifies the maximum number of network connections that can be used to transmit redo data to this destination. Max processes maximum number of active ARCH processes. makes it possible to send multiple archive logs in parallel to handle peaks in workload or to more quickly resolve log archive gaps caused by network or standby failures.Max connections Specifies the maximum number of network connections that can be used to transmit redo data to this destination. Max processes maximum number of active ARCH processes. makes it possible to send multiple archive logs in parallel to handle peaks in workload or to more quickly resolve log archive gaps caused by network or standby failures.

    18. Optimize LGWR Transport Decrease NET_TIMEOUT (default 180 secs.) Be careful! - Not too low New COMMITS COMMIT IMMEDIATE WAIT (default) COMMIT NOWAIT COMMIT NOWAIT BATCH NET_TIMEOUT specifies the number of seconds the log writer process on the primary system waits for status from the network server (LNSn) process before terminating the network connection important to note: the redo information is written to disk immediately, and the session waits for the process to complete before returning. IMMEDIATE - The redo information is written to disk immediately (forcing a disk I/O for each commit). WAIT - Oracle does not return from the commit until the commit has completed successfully. new: BATCH - Oracle buffers the redo information. The log writer will write this redo information to disk, but in its own time. Thus, several I/Os can be "batched". NOWAIT - Oracle doesn't wait for the commit to complete, but returns right away. Therefore, applications or transactions that can utilize COMMIT NOWAIT will have a significant improvement in response time and database throughput over applications or transactions that utilize the default COMMIT WAIT behavior.NET_TIMEOUT specifies the number of seconds the log writer process on the primary system waits for status from the network server (LNSn) process before terminating the network connection important to note: the redo information is written to disk immediately, and the session waits for the process to complete before returning. IMMEDIATE - The redo information is written to disk immediately (forcing a disk I/O for each commit). WAIT - Oracle does not return from the commit until the commit has completed successfully. new: BATCH - Oracle buffers the redo information. The log writer will write this redo information to disk, but in its own time. Thus, several I/Os can be "batched". NOWAIT - Oracle doesn't wait for the commit to complete, but returns right away. Therefore, applications or transactions that can utilize COMMIT NOWAIT will have a significant improvement in response time and database throughput over applications or transactions that utilize the default COMMIT WAIT behavior.

    19. All Redo Transport Standby redo logs Use fastest disks No RAID5 Don’t multiplex Use the recommended number of SRLs (maximum# of online logfiles + 1) * maximum# of threads RAID10 or RAID1RAID10 or RAID1

    20. Outline Overview Network Optimization ARCn & LGWR Redo Transport Checkpoint, Redo Read/Apply & Recovery Wait Events 10g R2 & 11g Improvements Best Practices

    21. Checkpoint Phase Checkpoint occurs During log switch LOG_CHECK_TIMEOUT expiration LOG_CHECKOUT_INTERVAL reached Reduce log switch interval Resize redo log to 1GB - primary and secondary Recommended - checkpoint every 15 minutes 1st, the Checkpoint Phase on Primary On the primary database, the checkpoint rate needs to be monitored and the log group size adjusted to ensure that these do not place excessive burden on the system. The checkpoint occurs whenever there is either a log switch, expiration of the LOG_CHECK_TIMEOUT or when the LOG_CHECKOUT_INTERVAL has been reached. Log Group Size To reduce the log switch interval it is generally recommended to resize the redo log file to 1GB on both primary and secondary. Ideally it is recommended that the checkpoint occur every 15 minutes. This will reduce the repeated updating of the file headers which occurs during the switch. Using the query below, we can determine the frequency of the checkpoint by comparing the output over a period of time, to ensure that the checkpoint does not occur too often. 1st, the Checkpoint Phase on Primary On the primary database, the checkpoint rate needs to be monitored and the log group size adjusted to ensure that these do not place excessive burden on the system. The checkpoint occurs whenever there is either a log switch, expiration of the LOG_CHECK_TIMEOUT or when the LOG_CHECKOUT_INTERVAL has been reached. Log Group Size To reduce the log switch interval it is generally recommended to resize the redo log file to 1GB on both primary and secondary. Ideally it is recommended that the checkpoint occur every 15 minutes. This will reduce the repeated updating of the file headers which occurs during the switch. Using the query below, we can determine the frequency of the checkpoint by comparing the output over a period of time, to ensure that the checkpoint does not occur too often.

    22. Checkpoint Phase (Cont.) Determine checkpoint frequency COL NAME FOR A35; SELECT NAME, VALUE, TO_CHAR(SYSDATE, ‘HH:MI:SS’) TIME FROM V$SYSSTAT WHERE NAME = 'DBWR checkpoints'; NAME VALUE TIME ----------------------------------- ---------- -------- DBWR checkpoints 264 08:15:43 SQL> / NAME VALUE TIME ----------------------------------- ---------- -------- DBWR checkpoints 267 08:34:06 And look at your archive logs to determine if they happen more than just at log switches.And look at your archive logs to determine if they happen more than just at log switches.

    23. Redo Read (Secondary) Obtain read rate for the standby redo log SQL> ALTER SYSTEM DUMP LOGFILE '/u01/oradata/docprd/sredo01.log’ validate; System altered. $vi docprd_ora_3560.trc Mon Mar 12 08:59:52 2007 ……………… ----- Redo read statistics for thread 1 ----- Read rate (ASYNC): 4527Kb in 0.58s => 6.90 Mb/sec Longest record: 19Kb, moves: 0/7586 (0%) Change moves: 4340/18026 (24%), moved: 2Mb Longest LWN: 92Kb, moves: 1/1365 (0%), moved: 0Mb Last redo scn: 0x0000.01272351 (19342161) 2nd, the Redo Read Phase The overall recovery rate will always be bounded by the rate at which redo can be read; so ensure that the redo read rate surpasses your required recovery rate. Can also use ‘dd’: /bin/time dd if=/redo_logs/t_log8.f of=/dev/null bs=4096k On the standby database, an important step in the redo apply phase is the reading of these redo logs into the database buffer cache. Due to this, the read rate is an important metric in the redo apply phase. If the read rate is low then this can adversely effect the total recovery time. 2nd, the Redo Read Phase The overall recovery rate will always be bounded by the rate at which redo can be read; so ensure that the redo read rate surpasses your required recovery rate. Can also use ‘dd’: /bin/time dd if=/redo_logs/t_log8.f of=/dev/null bs=4096k On the standby database, an important step in the redo apply phase is the reading of these redo logs into the database buffer cache. Due to this, the read rate is an important metric in the redo apply phase. If the read rate is low then this can adversely effect the total recovery time.

    24. Redo Apply (Secondary) Goal Redo apply rate (secondary) > Redo create rate (primary) Carefully consider enabling DB_BLOCK_CHECKING LOW, MEDIUM and FULL options Possible performance impact 3rd, the Redo Apply Phase DB_BLOCK_CHECKING – performed in memory LOW - Block checking is performed after any in-memory block change. MEDIUM - All in-memory block change checking is performed as well as semantic block checking for all non index organized-table blocks. FULL - Block checking is performed for all in-memory block changes as well as semantic block checking for all non index organized-table blocks and index blocks. When one of these three forms of block checking is enabled, Oracle Database verifies that the block is self-consistent whenever the corresponding types of block change occur. If the block is inconsistent, then it is marked corrupt, an ORA-1578 error is returned, and a trace file is created containing details of the problem. Without block checking enabled, corruption can go undetected until the block is accessed again. Block checking for the SYSTEM tablespace is always enabled, no matter what setting is chosen for DB_BLOCK_CHECKING. While DB_BLOCK_CHECKSUM is set to true by default, DB_BLOCK_CHECKING is not on by default. Although DB_BLOCK_CHECKSUM will catch most block corruptions, Oracle recommends turning on DB_BLOCK_CHECKING on the primary database and also on the secondary, if the secondary meets performance expectations. It can be set to LOW, MEDIUM or FULL and will have a performance impact on the database. Oracle estimates the impact between one and 10 percent, so be cautious. The time required for the actual application of the redo logs on the standby database will dictate if there is going to be a lag on the standby when the primary fails over to the secondary. The rate at which the redo logs are applied can be obtained from the step below which were outline in the document on Metalink titled MAA_WP_10gRecoveryBestPractices.pdf. The SQL should be run when the actual recovery is taking place.3rd, the Redo Apply Phase DB_BLOCK_CHECKING – performed in memory LOW - Block checking is performed after any in-memory block change. MEDIUM - All in-memory block change checking is performed as well as semantic block checking for all non index organized-table blocks. FULL - Block checking is performed for all in-memory block changes as well as semantic block checking for all non index organized-table blocks and index blocks. When one of these three forms of block checking is enabled, Oracle Database verifies that the block is self-consistent whenever the corresponding types of block change occur. If the block is inconsistent, then it is marked corrupt, an ORA-1578 error is returned, and a trace file is created containing details of the problem. Without block checking enabled, corruption can go undetected until the block is accessed again. Block checking for the SYSTEM tablespace is always enabled, no matter what setting is chosen for DB_BLOCK_CHECKING. While DB_BLOCK_CHECKSUM is set to true by default, DB_BLOCK_CHECKING is not on by default. Although DB_BLOCK_CHECKSUM will catch most block corruptions, Oracle recommends turning on DB_BLOCK_CHECKING on the primary database and also on the secondary, if the secondary meets performance expectations. It can be set to LOW, MEDIUM or FULL and will have a performance impact on the database. Oracle estimates the impact between one and 10 percent, so be cautious. The time required for the actual application of the redo logs on the standby database will dictate if there is going to be a lag on the standby when the primary fails over to the secondary. The rate at which the redo logs are applied can be obtained from the step below which were outline in the document on Metalink titled MAA_WP_10gRecoveryBestPractices.pdf. The SQL should be run when the actual recovery is taking place.

    25. Redo Apply (Cont.) Determine Log Block Size (LEBSZ) SELECT LEBSZ FROM X$KCCLE WHERE ROWNUM=1; Get recovery blocks - at least two snapshots Managed Recovery Case SELECT PROCESS, SEQUENCE#, THREAD#, block#, BLOCKS, TO_CHAR(SYSDATE, 'DD-MON-YYYY HH:MI:SS') time from v$MANAGED_STANDBY WHERE PROCESS='MRP0'; Determine the recovery rate (MB/sec) for a specific archive sequence number Managed Recovery Case: ((BLOCK#_END - BLOCK#_BEG) * LOG_BLOCK_SIZE) / (TIME_END - TIME_BEG) * 1024 * 1024 Here how we find redo apply rate.Here how we find redo apply rate.

    26. Redo Apply (Cont.)

    27. Recovery Parallel Recovery (before 10.1.0.5) Set to number of CPUs recover managed standby database parallel <#>; PARALLEL_EXECUTION_MESSAGE_SIZE Can increase to 4096 or 8192 Uses additional shared pool memory Problems if set too high DB_CACHE_SIZE Can set secondary DB_CACHE_SIZE >= primary Must set to primary before changing roles Finally, Recovery Phase Parallel Recovery Application of the redo logs on the standby includes reading the blocks into the database buffer cache and the recovery slave processes applying the redo logs to the database. Prior to 10.1.0.5, the PARALLEL option needed to be specified for usage of the parallel recovery by issuing the “RECOVER MANAGED STANDBY DATABASE PARALLEL” command. However in versions above this, the PARALLEL option is used by default which is equal to the number of CPU’s on the system. The number of the parallel salves however can further be increased by specifying the number of degree in the command as ‘RECOVER MANAGED STANDBY DATABASE PARALLEL 5;’ Parallel Execution Message Size controls the size of the buffer which is used to pass the messages between the slaves and the query coordinator. If the message is larger than the default 2k value, it will be passed in chunks, resulting in some performance loss. For most systems modification of the default value for the PARALLEL_EXECUTION_MESSAGE_SIZE parameter to 8k can improve the recovery time tremendously. DB Cache Size Setting the DB_CACHE_SZIE to a larger value than then the primary also significantly improves the recovery time as larger number of blocks can be placed in the buffer cache. However before the roles are switched the SGA component size must mirror what is on the primary database.Finally, Recovery Phase Parallel Recovery Application of the redo logs on the standby includes reading the blocks into the database buffer cache and the recovery slave processes applying the redo logs to the database. Prior to 10.1.0.5, the PARALLEL option needed to be specified for usage of the parallel recovery by issuing the “RECOVER MANAGED STANDBY DATABASE PARALLEL” command. However in versions above this, the PARALLEL option is used by default which is equal to the number of CPU’s on the system. The number of the parallel salves however can further be increased by specifying the number of degree in the command as ‘RECOVER MANAGED STANDBY DATABASE PARALLEL 5;’ Parallel Execution Message Size controls the size of the buffer which is used to pass the messages between the slaves and the query coordinator. If the message is larger than the default 2k value, it will be passed in chunks, resulting in some performance loss. For most systems modification of the default value for the PARALLEL_EXECUTION_MESSAGE_SIZE parameter to 8k can improve the recovery time tremendously. DB Cache Size Setting the DB_CACHE_SZIE to a larger value than then the primary also significantly improves the recovery time as larger number of blocks can be placed in the buffer cache. However before the roles are switched the SGA component size must mirror what is on the primary database.

    28. Outline Overview Network Optimization ARCn & LGWR Redo Transport Checkpoint, Redo Read/Apply & Recovery Wait Events 10g R2 & 11g Improvements Best Practices

    29. Arch Wait Events - Primary ARCH wait on ATTACH Time for all arch processes to spawn RFS connection ARCH wait on SENDREQ Time for all arch processes to write received redo to disk + open & close remote archived redo logs ARCH wait on DETACH Time for all arch processes to delete RFS connection Use AWR to find wait times. Use v$event… views. v$event_nameUse AWR to find wait times. Use v$event… views. v$event_name

    30. LGWR SYNC Wait Events - Primary LGWR wait on ATTACH Time for all log writer processes to spawn RFS connection LGWR wait on SENDREQ Time for all log writer processes to write received redo to disk + open & close the remote archived redo logs LGWR wait on DETACH Time for all log writer processes to delete RFS conn.

    31. LGWR ASYNC Wait Events - Primary LNS wait on ATTACH Time for all network servers to spawn RFS connection LNS wait on SENDREQ Time for all network servers to write received redo to disk + open & close the remote archived redo logs LNS wait on DETACH Time for all network servers to delete RFS conn. LGWR wait on full LNS buffer Time for log writer (LGWR) process awaiting for network server (LNS) to free ASYNC buffer space LGWR wait on full LNS buffer Time for log writer (LGWR) process awaiting for network server (LNS) to free ASYNC buffer space Not relevant with LGWR SYNC=PARALLEL LGWR wait on full LNS buffer Time for log writer (LGWR) process awaiting for network server (LNS) to free ASYNC buffer space Not relevant with LGWR SYNC=PARALLEL

    32. Wait Events on Secondary RFS Write Time to write to standby redo log or archive log + non I/O work like redo block checksum validation RFS Random I/O Time to write to a standby redo log to occur RFS Sequential I/O Time to write to an archive log to occur

    33. Outline Overview Network Optimization ARCn & LGWR Redo Transport Checkpoint, Redo Read/Apply & Recovery Wait Events 10g R2 & 11g Improvements Best Practices

    34. 10g R2 Improvements Multiple archive processes can transmit a redo log in parallel to the standby database MAX_CONNECTIONS attribute of the LOG_ARCHIVE_DEST_n controls the number of these processes Parallel Recovery for Redo apply is automatically set equal to number of CPUs 10.1.0.5 and 10.2.0.1 Fast-Start Failover Automatically fails over to a previously chosen physical standby database reducing the time for the redo transmission to the secondary This is beneficial during batch loads Prior to this, for parallel recovery, “PARALLEL” needed to be specified. reducing the time for the redo transmission to the secondary This is beneficial during batch loads Prior to this, for parallel recovery, “PARALLEL” needed to be specified.

    35. 10g R2 Improvements (Cont.) LGWR ASYNC Uses a new process (LNSn) to transmit the redo data directly from the online redo log to the standby database Physical standby database flashback Can flash back temporarily for reporting Logical standby database Automatically deletes applied archived log RMAN Automatically creates temp datafiles after recovery No DBA intervention The old primary database is automatically reconfigured as a new standby database once it reconnects to the Data Guard configuration. Previously the LGWR process was responsible for transferring this redo data to the standby destination. However this was done at the cost of some performance. With this new separate LNSn process, the LGWR is able to continue writing redo data to the online redo logs without having to process the redo transmission.No DBA intervention The old primary database is automatically reconfigured as a new standby database once it reconnects to the Data Guard configuration. Previously the LGWR process was responsible for transferring this redo data to the standby destination. However this was done at the cost of some performance. With this new separate LNSn process, the LGWR is able to continue writing redo data to the online redo logs without having to process the redo transmission.

    36. 11g Improvements Physical standby database open read/write for test or other purposes with zero compromise in data protection using new Snapshot Standby Automatic failover configurable for immediate response to designated events or errors More flexibility in primary/standby configurations e.g. Windows primary and Linux standby Rolling upgrade options now in physical standby with Transient Logical Standby ASYNC transport enhanced to eliminate the impact of latency on network throughput

    37. 11g Improvements (Cont.) Fast detection of corruptions caused by lost writes in the storage layer SQL Apply supports XML data type (CLOB) Many performance, manageability, and security enhancements Support for new Oracle Database 11g Options – Oracle Active Data Guard and Oracle Advanced Compression Fast Start Failover now available for Maximum Performance mode

    38. Outline Overview Network Optimization ARCn & LGWR Redo Transport Checkpoint, Redo Read/Apply & Recovery Wait Events 10g R2 & 11g Improvements Best Practices

    39. Best Practices Geographically separate primary & standby DB Ensure standby hardware configuration same as the primary Tune standby for write intensive operations Test Data Guard before deploy in production Set standard OS and DB parameters to recommended values Perform switchover testing Fully document a failover procedure Use FORCE LOGGING mode Perform a load test and obtain bench marks on the largest volume of redo Pay particular attention to the network and the I/O performance of the storage This will alleviate confusion during a stressful time. Use FORCE LOGGING mode to ensure that all database data changes are logged and the standby remains consistent with the production.Perform a load test and obtain bench marks on the largest volume of redo Pay particular attention to the network and the I/O performance of the storage This will alleviate confusion during a stressful time. Use FORCE LOGGING mode to ensure that all database data changes are logged and the standby remains consistent with the production.

    40. Best Practices (Cont.) Use real-time apply Use the Data Guard Broker Enable Flashback Database on both primary and secondary databases Evaluate using AFFIRM attribute Possible performance issues on primary Verify Asynchronous I/O enabled Carefully consider DB_BLOCK_CHECKING Use real-time apply so that redo data is applied to the standby database as soon as it is received. Use the Data Guard Broker to create, manage and monitor the Data Guard configuration. Enable Flashback Database on both primary and secondary databases. Doing this will ensure that the old primary database can be easily reinstated as a new standby database following a failover. Ensures disk I/O on standby is synchronously completed before transaction complete Async I/O Verify for DB and OS Disk throughput can be measured using the ‘dd’ command Use real-time apply so that redo data is applied to the standby database as soon as it is received. Use the Data Guard Broker to create, manage and monitor the Data Guard configuration. Enable Flashback Database on both primary and secondary databases. Doing this will ensure that the old primary database can be easily reinstated as a new standby database following a failover. Ensures disk I/O on standby is synchronously completed before transaction complete Async I/O Verify for DB and OS Disk throughput can be measured using the ‘dd’ command

    41. Best Practices (Cont.) Don’t multiplex standby redo logs (SRLs) Correctly set number of SRLs Increase PARALLEL_EXECUTION_MESSAGE_SIZE Place SRLs in fast disk group or disks Use at lease two standby DBs with Maximum Protection Mode Utilize COMMIT NOWAIT if appropriate Setting the parameter PARALLEL_EXECUTION_MESSAGE_SIZE to 8192 dramatically increases the performance of the parallel recovery. The number of standby redo logs should be equal to the sum of all online log groups for each thread (instance) plus the number of threads. Application of the redo logs on the standby requires reading the blocks into the database buffer cache then the recovery slave processes applies the redo logs to the database. Prior to 10.1.0.5, the PARALLEL option needed to be specified for usage of the parallel recovery by issuing the “RECOVER MANAGED STANDBY DATABASE PARALLEL” command. However in versions after 10.1.0.5, the PARALLEL option is used by default and is equal to the number of CPU’s on the system. The number of the parallel salves however can further be increased by specifying the number of degree in the command as ‘RECOVER MANAGED STANDBY DATABASE PARALLEL 5;’ Although LGWR SYNC will wait for confirmation that redo has been successfully received from the standby as described above, the parameter COMMIT NOWAIT can be used so that commits are returned to the application without waiting for redo to be written to disk. If applications or transactions are able to utilize COMMIT NOWAIT, significant improvement in response time and database throughput can be seen. This means that even though control is given back to the application, a commit is not guaranteed as it is with a COMMIT WAIT, which is the default COMMIT. Setting the parameter PARALLEL_EXECUTION_MESSAGE_SIZE to 8192 dramatically increases the performance of the parallel recovery. The number of standby redo logs should be equal to the sum of all online log groups for each thread (instance) plus the number of threads. Application of the redo logs on the standby requires reading the blocks into the database buffer cache then the recovery slave processes applies the redo logs to the database. Prior to 10.1.0.5, the PARALLEL option needed to be specified for usage of the parallel recovery by issuing the “RECOVER MANAGED STANDBY DATABASE PARALLEL” command. However in versions after 10.1.0.5, the PARALLEL option is used by default and is equal to the number of CPU’s on the system. The number of the parallel salves however can further be increased by specifying the number of degree in the command as ‘RECOVER MANAGED STANDBY DATABASE PARALLEL 5;’ Although LGWR SYNC will wait for confirmation that redo has been successfully received from the standby as described above, the parameter COMMIT NOWAIT can be used so that commits are returned to the application without waiting for redo to be written to disk. If applications or transactions are able to utilize COMMIT NOWAIT, significant improvement in response time and database throughput can be seen. This means that even though control is given back to the application, a commit is not guaranteed as it is with a COMMIT WAIT, which is the default COMMIT.

    42. Best Practices (Cont.) Ensure appropriate bandwidth between primary and secondary Increase default send & receive queue sizes TXQUEUELENGTH NET_DEV_MAX_BACKLOG Session Data Unit Adjust value to 32767 Improvement during large data transmissions Incorrect sizing of the device queue, can cause loss of the data, due to buffer overflow which then triggers retransmission of data. This repeated retransmission of data can cause network saturation, resource consumption and response delays. OS commands like ifconfig and netstat can be used to obtain stats on network performance High values for “errors” and “dropped” packets indicate a problem in the network Incorrect sizing of the device queue, can cause loss of the data, due to buffer overflow which then triggers retransmission of data. This repeated retransmission of data can cause network saturation, resource consumption and response delays. OS commands like ifconfig and netstat can be used to obtain stats on network performance High values for “errors” and “dropped” packets indicate a problem in the network

    43. Questions? Lots of things we didn’t cover If we don’t cover something you wanted to hear, please contact me.

    44. Congratulations, you’re done! The best way to receive feedback is via the evaluation forms. Make sure you ask the attendees to complete the forms. Provide your name, session name and session # for them to fill out on the form. Attendees or those who read your session from the web/CD may want to contact you with further questions; optionally you can provide your contact information.Congratulations, you’re done! The best way to receive feedback is via the evaluation forms. Make sure you ask the attendees to complete the forms. Provide your name, session name and session # for them to fill out on the form. Attendees or those who read your session from the web/CD may want to contact you with further questions; optionally you can provide your contact information.

More Related