Oracle RAC (cluster) application scalability Eric Grancher and Anton Topurov CERN IT department / openlab. RAC aware change In order to reduce the cluster contention, a new scheme for the insertion has been developed. In the new scheme: - each “client” receives a unique number
Eric Grancher and Anton TopurovCERN IT department / openlab
RAC aware change
In order to reduce the cluster contention, a new scheme for the insertion has been developed. In the new scheme:
- each “client” receives a unique number
- each of the tables is partitioned
- each of the “clients” inserts into a separate partition.
Thanks to these changes, the system achieved 150’000 events per second at peak. It has been observed that under heavy load the event loading can sometimes freeze.
A number of modifications have been made to pre-allocate the space (tablespace re-utilization as a background task).
Thanks to these changes, the system now achieves a consistent insertion rate of 150’000 events per second.
Introduction – Aim
Oracle Real Application Cluster (RAC) is the Oracle cluster technology. It has been used at CERN since 1996 (it was called Oracle Parallel Server up to Oracle version 8.1). RAC is increasingly used in High Energy Physics. It is used at the LCG Tier1 sites and at CERN.
It is a “shared-everything” clustered system where each of the clustered nodes directly accesses the storage via a Storage Area Network (typically Fibre Channel) or via Network Attached Storage (typically NFS over Ethernet). RAC provides a scalable solution in terms of high-availability and performance. The mechanism for the coordination of the data changes from the different cluster nodes is called “cache fusion”. Applications which are not “read-mostly” may not easily scale with the number of nodes in a RAC cluster.
Siemens/ETM’s PVSS software is a key component of the supervisory layer of the control systems for the LHC accelerator and experiments. It is a SCADA system. One of its components is the module that archives the monitoring data events in a database. Current requirements are as high as 150’000 events per second.
We have shown that nearly linear scalability can be obtained with RAC for some typical write-mostly workload. The tests have been done on a 6 node RAC cluster utilizing 32 SATA disks connected via Fibre Channel to the hosts.
It requires an adequate data model and a database client interface capable of supporting bulk operations. Instrumentation in the database and in the application is necessary for achieving fine tuning results. We have used extensively the table partitioning feature to achieve such transaction rates.
The iterative methodology for RAC scalability using the wait interface features provided in Oracle 10g has been key for the success of this work.
Initial tests have shown that “out of the box” performance was about 100 events per second archiving. The database was waiting on the clients “SQL*Net message from client”. Investigations have shown that:
- a generic C++ database API was used (no bulk loading)
- a trigger was launched after each event upload.
The company providing the software made a set of radical changes following the initial tests and recommendations. They changed the API to use OCCI (Oracle C++ Call Interface) and after bulk loading a set of events into a temporary table, the software now makes a call to a PL/SQL package.
Thanks to these changes, the system achieved 2’000 events per second.
In order to reduce the I/O needs (“db file sequential read” is a wait event that is linked with I/O request), several changes have been tested and implemented:
- index usage analysis and reduction
- table structure change (move to Index Organised Table)
- replacement of the merge by an insert
- use of direct path insert (bypassing the buffer cache).
Thanks to these changes, the system achieved 16’000 events per second. The top wait events indicate clearly that the system is slowed down because of cluster related wait events.
Global Cache Service
4. inform GCS
3. copy the block
Figure 3. data block has been changed on instance 5. This block is held in exclusive mode by Instance 5. Instance 1 attempts to modify the block. The data block is not written to disk, all communication is done over the cluster interconnect.
Clients (interactive, batch)
Figure 1. RAC logical view (shared disk, servers, clients).
Table 1. AWR report during a test after iteration 1.
Table 3. AWR report during a test after iteration 3.
Graph 1. Events per second loaded in the database.
The methodology used for performance tuning is based on the Oracle wait interface. Before version 7, developers and database administrators have been working with ratios that provide a statistical view over some database key indicators (in memory / disk sorts, full table scans / indexed table access…).
As of Oracle version 7, the database presents the instrumentation results (event 10046) through the wait interface. It has completely changed the methodology for performance tuning.
In version 10, Oracle has introduced Active Session History (ASH) and Automatic Workload Repository (AWR). They provide automatic gatherings of session and instance-wide wait events aggregation.
The methodology used for this project is based on an iterative AWR/ASH based tuning exercise, until the required performance was achieved.
Shiers J 2006 The Evolution of Databases in HEP Computing in High Energy and Nuclear Physics (13-17 February, 2006)
Dyke J and Shaw S 2006 Pro Oracle Database 10g RAC on Linux (Apress)
Morle J 2006 Connection management in an Oracle RAC Configuration <http://www.scaleabilities.co.uk/component/option,com_remository/Itemid,53/func,fileinfo/id,12/>
Lahiri T, Srihari V, Chan W, Macnaughton N and Chandrasekaran S 2001 Cache Fusion: Extending Shared-Disk Clusters with Shared Caches Proceedings of the 27th international Conference on Very Large Data Bases (September 11 - 14, 2001)
Cache Fusion and the Global Cache Service Oracle9i Real Application Clusters Concepts Release 2 (9.2) Part Number A96597-01
+ stable rate
update eventlastval set …
trigger: update eventlast
… insert into events_history (…) values (…)
Table 2. AWR report during a test after iteration 2.
PL/SQL: insert /*+ APPEND */into events_history (…) partition
PARTITION (1)select …from temp
We thank Manuel Gonzalez Berges, Piotr Golonka, Wayne Salter (CERN IT-CO), Chris Lambert, Nilo Segura Chinchilla (CERN IT-DES), Luca Canali (CERN IT-PSS), Svetozár Kapusta (ALICE experiment), Bjørn Engsig and Lothar Flatz (Oracle).
1. run the workload,
gather ASH/AWR information
bulk insert into temp table
4. modify client code
database codehardware configuration
2. find the top event that slows down the processing
Figure 2. iterative methodology for performance tuning.
3. understand why time
is spent on this event
Figure 4. description of the architecture changes. Test setup with a 6 nodes RAC and 150 clients (lxplus) each generating 1’000 changes/s.
CERN - IT Department
CH-1211 Genève 23