1 / 36

Large RAC Benchmark – Lessons Learned Hanan Hit

Large RAC Benchmark – Lessons Learned Hanan Hit. NOCOUG Feb 14, 2006. Today’s agenda. SUT Definitions Benchmark Goals Used RAC Mode OS (HP-UX) best practices HMP - Hyper Messaging Protocol UDP Usage OS Scheduling Deadlocks Startup & Shutdown Execution Plans ORA Parameters

teal
Download Presentation

Large RAC Benchmark – Lessons Learned Hanan Hit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large RAC Benchmark – Lessons LearnedHanan Hit NOCOUG Feb 14, 2006

  2. Today’s agenda • SUT Definitions • Benchmark Goals • Used RAC Mode • OS (HP-UX) best practices • HMP - Hyper Messaging Protocol • UDP Usage • OS Scheduling • Deadlocks • Startup & Shutdown • Execution Plans • ORA Parameters • Multiple Block Sizes

  3. The Tested Application • Type 1 - Powerbuilder front-end on windows, tuxedo middleware and Oracle backend. • Type 2 – Java on Sun servers using Weblogic as the middle tier. • Type 3 - Batch processing written in Pro*C/Pro*Cobol. • All the Oracle database servers run on HP servers, running Oracle 9.2.0.5 on HP-UX 11iV2.

  4. SUT

  5. SUT – (Cont.) • Database Size 30TB • RAW Devices • Non Shared Oracle Home • Two Full XP 12K storage array from HP • 82 TB each (1152 disks with 73GB 15K RPM) • 128GB Cache • 120,000 IOPS (Max Random Performance Disk)

  6. Machines

  7. Machines – (Cont.)

  8. Machines – (Cont.)

  9. Benchmark Goals • Manual Load Balancing – • Online only – 6K Type 1 , 3.5K Type 2 (Test 1) • All the connections to the RAC instances should be balanced. • The throughput should be comparable with an earlier NON-RAC test environment. • Online + Batch - 6K Type 1, 3.5K Type 2, 500 Type 3 (Test 2) • All the connections to the RAC instances should be balanced. • The throughput should be comparable with an earlier NON-RAC tests. • Type 3 timings should be comparable with the NON-RAC tests. • Online only- 12K Type 1 , 7K Type 2 (Test 3) • All the connections to the RAC instances should be balanced. • The throughput should increase 95% comparable with an earlier Test1.

  10. Benchmark Goals – (Cont.) • RAC Load Balancing – • Online only – 6K Type 1 , 3.5K Type 2 (Test 1) • All the connections to the RAC instances should be balanced. • The throughput should be comparable with an earlier NON-RAC test environment. • Configuration/Setup: • Amend the /etc/tnsnames.ora with the RAC LB strings. • RAC Fail Over Testing– • Online only – 6K Type 1 , 3.5K Type 2 (Test 1) • All the database connections that were connected to the RAC3 instance should have been successfully failed over to the other RAC1 instance.

  11. RAC - Mode

  12. OS Port Specific Best Practices • IPC Protocols • HP’s HMP • HP True64’s RDG (Reliable DataGram) • Sun’s RSM (Remote Shared Memory) • VIA on Intel • In general the recommendation is to use low latency user mode IPC rather then UDP or TCP.

  13. What is HyperFabric • HyperFabric is a high-speed cluster interconnect fabric that supports both the industry standard TCP/UDP over IP and HP’s proprietary Hyper Messaging Protocol (HMP). • HyperFabric extends the scalability and reliability of TCP/UDP by providing transparent load balancing of connection traffic across multiple network interface cards (NICs) and transparent failover of traffic from one card to another without invocation of MC/ServiceGaurd. • The HyperFabric NIC incorporates a network processor that implements HP’s Hyper Messaging Protocol and provides lower latency and lower host CPU utilization for standard TCP/UDP benchmarks over HyperFabric when compared to gigabit Ethernet. • Hewlett-Packard released HyperFabric in 1998 with a link rate of 2.56 Gbps over copper. In 2001, Hewlett-Packard released HyperFabric 2 with a link rate of 4.0 Gbps over fiber with support for compatibility with the copper HyperFabric interface. Both HyperFabric products support clusters up to 64-nodes.

  14. HyperFabric Switches • Hewlett-Packard provides the fastest cluster interconnect via its proprietary HyperFabric switches. • The latest product being HyperFabric 2, which is a new set of hardware components with fiber connectors to enable low-latency, high bandwidth system interconnect. • With fiber interfaces, HyperFabric 2 provides faster speed up to 4Gbps in full duplex over longer distance up to 200 meters. • HyperFabric 2 also provides excellent scalability by supporting up to 16 hosts via point-to-point connectivity and up to 64 hosts via fabric switches. It is backward compatible with previous versions of HyperFabric and available on IA-64, PA-RISC servers.

  15. Hyper Messaging Protocol (HMP) • HMP also known as the “Lowfat” is HP’s reliable user-mode IPC protocol. • Hyper Messaging Protocolsignificantly expands on the feature set provided by TCP/UDP by providing a true Reliable Datagram model for both remote direct memory access (RDMA) and traditional message semantics. • Coupled with OS bypass capability and the hardware support for protocol offload provided by HyperFabric, HMP provides high bandwidth, low latency and extremely low CPU utilization with an interface and feature set optimized for business critical parallel applications such as Oracle 9i RAC.

  16. How to Relink Oracle with HMP? • OS Configuration • setprivgrp dba MLOCK • Persistent over reboots => /etc/privgroup dba MLOCK • Kernel MAXFILES must be at least 1024 • HMP Configuration in /opt/clic/lib/skgxp/skclic.conf • HMP relink • make –f ins_rdbms.mk rac_on ipc_hms ioracle • UDP relink (Back to default) • make –f ins_rdbms.mk rac_on ipc_udp ioracle

  17. HMP Diagnostics • Check the alert.log and verify that you are using HMP protocol. • netstat –in Check that the CLIC #N interfaces are up on all nodes and that the MTU is at least 32K • Cluster_interconnects init.ora (spfile) parameter will not be showing the IP address and shouldn’t be used. • > oradebug setmypid • > oradebug ipc

  18. HMP Findings • Each Oracle shadow process will require 300-400K of additional memory. • The system couldn’t scale the No. of concurrent connections beyond ~ 1K and no additional connections could be established to either of the instances once this threshold had been reached. • The maximum (single card), memory was 4GB (during the benchmark timeframe) • We couldn’t start more than a single instance while reaching the above threshold while the other ones were waiting. • Direct connection could be established to the single node that was active.

  19. HMP Findings (Cont.) • netstat may show that most of the traffic is going to single interface, but this is really not a concern as the traffic is load balanced in the fabric and the command is not able to catch it. • Rollback to UDP over HyperFabric as this is usually more then “good enough”.

  20. UDP Findings • Use UDP over HyperFabric as this will have the ability to use multiple cards. • Due to the Fail Over scenarios we had to be able to cope with at least twice node loads on a single node, hence minimum 20K concurrent connections. • The initial setup failed with (ORA-27504) that were pointing to OS Errors (227). • The cause for the above was found to be the fact that we were hitting an OS limitations of maximum UDP ports which is defaulted to 16380 (49152 through 65535). • The solution is to lower the starting port from 49152 to 20000 • /usr/bin/ndd –set /dev/udp udp_smallest_anon_port 20000

  21. Scheduling & process prioritization • Most UNIX OS implemented time-sharing (TS) and real time (RT) scheduling. • The default for user process is TS. • With the regular scheduler (TS) the longer a process runs, its priority get weaker, and the chances for its preemption increases. • The cost of context switch is likely to be very high especially for LMD/LMS processes in RAC. • SCHED_NOAGE scheduler allows the processes to hold the CPU until they give it up. • Starting all processes in the same highest priority of this scheduler remove the possibility that they will ‘steal’ the CPU from each other

  22. SCHED_NOAGE • Unlike the normal TS policy a process scheduled with the SCHED_NOAGE willnot increase or decrease its priority nor it will be preempted. • How to implement: • As root (after reboot) • setprivgrp -g dba RTSCHED RTPRIO • Enter in /etc/privgroup dba RTSCHED RTPRIO • In init.ora (spfile) enter • hpux_sched_noage=178 • Check (ps) that the Oracle backgrond processes use priority 178. • In RAC make sure that this will be true for all nodes.

  23. Deadlock etc. • Deadlock situations might get extrapolated to a serious issue in RAC environment. • Please make sure to check these issues prior to any RAC implementations. • Hints that might help • *.event = "7445 trace name ERRORSTACK level 3" • We encountered a situation that a instance had crashed due to deadlocks. • RAC1 had crashed while the databases on the other nodes remained up.

  24. Deadlock etc. (Cont.) • *** In RAC environment the deadlock detection works different than NON-RAC. In NON-RAC it updates the alert.log with ORA-60 and generate a user trace file, But in RAC environment it puts the deadlock graph only in LMD Tracefile (background_dump_dest no user trace file). • Since we are dealing with global resources in RAC, the deadlock detection mechanism is different than from a non-RAC environment. In RAC, LMD process periodically checks for deadlocks on the database. How often it checks for deadlocks is controlled by the parameter _lm_dd_interval. The default is 60 seconds on most versions / platforms and can be checked with the following query: • select x.ksppinm , y.ksppstvl from x$ksppi x , x$ksppcv y where x.indx = y.indx and x.ksppinm like '%lm_dd%‘order by x.ksppinm; • Setting _lm_dd_interval too low can cause LMD to consume too much CPU. Setting it too high can cause applications to hang up because deadlocks aren't being detected quickly enough. The default is sufficient for most cases.

  25. Deadlock etc. (Cont.) • *** Deadlocks may occur on the same node and on different nodes in the cluster. • Diagnostibility of ORA-60 errors on the single instance is easier then the case that is spans multiple instances (getting the SQL stmt). • On multiple instances starting from 9.2.0.6 / 10.1.0.3 and above Oracle dumps additional diagnostics for deadlock in RAC environment. • On a single instance you can use the following (After checking the <node> in the LMD trace file) ON All instances (spfile). • event="60 trace name errorstack level 3;name systemstate level 10"

  26. Startup & Shutdown • Be cautious while performing startup and shutdown operations on such a large SGA’s. • Shutdown may take very long time (half an hour…). • While performing the below procedure we got few Oracle Errors • srvctl start database -d <NODE NAME> • Bug 2540942 (LMS may spin in kjctr_rksxp() during simultaneous shutdown). • During simultaneous shutdown of several instances with large buffer caches there is the chance of running out of tickets due to the amount of messages that need to be sent for cleanup purposes. Most visible side effect is LMS spinning in kjctr_rksxp(). the ora-600:[504] is the side effect of this.

  27. Different Execution Plans • Node 1 Example Enter value for hashvalue: 2601636153 SQL_HASH SQL_TEXT --------------- --------------------------------------------------------------------------- 2601636153 select members.MEMBER_ID into :b0:b1 from (select t.MEMBER_ID from UFMI_INV u ,RM_MEMBER_ID t where (((((u.URBAN_ID(+)=:b2 and u.FLEET_ID(+)=:b3) and u.MEMBER_ID(+)=t.MEMBER_ID) and u.MEMBER_ID is null ) and t.MEMBER_ID>:b4) and t.MEMBER_ID<=:b5) order by t.MEMBER_ID ) members where ROWNUM<=:b6 Enter value for hashvalue: 2601636153 Plan Table ------------------------------------------------------------------------------------------------------------------------ | Operation | Name | Starts | E-Rows | A-Rows | Buffers | Reads | Writes | E-Time | ------------------------------------------------------------------------------------------------------------------------ | INDEX RANGE SCAN |UFMI_INV_PK | | 4 | | | | | | | SORT JOIN | | | 4 | | | | | | | INDEX RANGE SCAN |RM_MEMBER_ID_PK | | 142K| | | | | | | MERGE JOIN OUTER | | | | | | | | | | FILTER | | | | | | | | | | FILTER | | | | | | | | | | VIEW | | | 142K| | | | | | | COUNT STOPKEY | | | | | | | | | | SELECT STATEMENT | | | | | | | | | ------------------------------------------------------------------------------------------------------------------------ 12 rows selected.

  28. Different Execution Plans (Cont.) • Node 3 Example Enter value for hashvalue: 2601636153 SQL_HASH SQL_TEXT --------------- --------------------------------------------------------------------------- 2601636153 select members.MEMBER_ID into :b0:b1 from (select t.MEMBER_ID from UFMI_INV u ,RM_MEMBER_ID t where (((((u.URBAN_ID(+)=:b2 and u.FLEET_ID(+)=:b3) and u.MEMBER_ID(+)=t.MEMBER_ID) and u.MEMBER_ID is null ) and t.MEMBER_ID>:b4) and t.MEMBER_ID<=:b5) order by t.MEMBER_ID ) members where ROWNUM<=:b6 Enter value for hashvalue: 2601636153 Plan Table ------------------------------------------------------------------------------------------------------------------------- | Operation | Name | Starts | E-Rows | A-Rows | Buffers | Reads | Writes | E-Time | ------------------------------------------------------------------------------------------------------------------------ | INDEX RANGE SCAN |UFMI_INV_PK | | 1 | | | | | | | INDEX RANGE SCAN |RM_MEMBER_ID_PK | | 1 | | | | | | | NESTED LOOPS OUTER | | | | | | | | | | FILTER | | | | | | | | | | FILTER | | | | | | | | | | VIEW | | | 1 | | | | | | | COUNT STOPKEY | | | | | | | | | | SELECT STATEMENT | | | | | | | | | -------------------------------------------------------------------------------------------------------------------- ---- 11 rows selected.

  29. What might be the Reason? • The first set of binds on node1 return a larger result set. • Bind Peeking • Version 9i / 10G uses “Bind Variable Peeking” • the CBO makes the assumption, that the bind variable values used for the first execution of a SQL statement are representative of the bind variable values to be used in future executions of the same SQL statement. • Bind variable peeking occurs every time the cursor is parsed or loaded. • If the data is skewed, the execution plan may change when, for example, the shared pool is flushed or the underlying object are analyzed • The optimizer is making different estimates because of the different parameters set on each node. • Histograms built on these columns

  30. How to “Solve” the issue? • Turn off bind peeking - (_optim_peek_user_binds=false) • Pros – Will eliminate any bind peeking • Cons - This will affect *all* statements, some of which could already be benefiting from bind peeking • Hints – (If the data is skewed or if the result set of the binds vary a lot) • Pros -Always get the same plan, regardless of peeked binds • Cons - This would unfortunately require an app change. • Use Store Outlines (we chose this option) • Pros – No Need to change the application and applied immediately. • Cons – Extra DBA maintenance and any changes to the query will disable the use of the outline.

  31. _ Hidden Parameters & More • _row_cr => • Global cache cr request Event – This event is generated when an instance has requested a consistent read data block (or UNDO segment headers), and the block to be transferred hasn’t arrived at the requesting instance. Please make sure to check V$CR_BLOCK_SERVER. • The usage of _row_cr=TRUE (default = FALSE) might reduce the number CR rollbacks and avoid a costly cleanup/rollback in RAC. Instead of performing block cleanup/rollback the usage of this parameter will try to generate a CR for a given ROW. • Once we used this parameter we were hitting another bug # 3577772.8 which we found during a Direct Path export.

  32. _ Hidden Parameters & More (Cont.) • FAST_START_MTTR_TARGET => • This parameters guarantees that instance or crash recovery will be completed within this time frame and when set to non zero value will use incremental checkpoints. The default value is 300 Sec (5 minutes). • The customer were using 0 sec and once setting this parameter to the Oracle default we reduced the inter-node communications as well as reduced the overall I/O on the system. Please make sure to amend it to suite your site requirements. • _BUMP_ HIGHWATER_ MARK_COUNT => • HW Enqueue – (High water mark) - High Water Mark for the table may be constantly increased since the table is growing and running out of blocks for new inserts on the free list. The default value for bumping up the HWM is 0, which may be too low on a busy system or for the average insert size. • We increased it to 5 and nearly eliminated this Enqueue at all.

  33. Multiple Block Sizes? • Single instance was using KEEP buffer pool nearly the size as the DEFAULT buffer pool. • We tried once to segregate the index from the data on a smaller buffer pool (2k for indexes). • ORA-00379: no free buffers available in buffer pool DEFAULT for block size 2K • Due to the un-symmetric machines we encountered the above error once the proper buffer pool was not defined on all instances. • The recommendations - Combine the different buffer pool caches into one, no need to use KEEP, RECYCLE. Oracle can manage them well.

  34. Summary • Address application deadlocks. This especially important once moving to RAC. • Combine the different buffer caches into one, no need to use KEEP, RECYCLE. Oracle can manage them well. • Move to Locally Managed Tablespaces, with ASSM (Automatic segment space management). • Don’t consider using HMP while going to production in very large systems. The usage of the fabric over UDP should be enough. • Try to use Hints once you encounter different execution plans (Binds). • If possible partition to the Applications to few physical nodes (only in real high concurrency & volumes). • All bugs had been fixed in newer versions of Oracle. • Takes extra precautions once performing startup & shutdown in such large environments.

  35. References • HP Cluster Install • Oracle 9i – Best Practices – performance Diagnostics & Tuning.   • http://www.oracle.com/technology/documentation/index.html (And search for HMP).

  36. Thanks

More Related