Maria Girone DB Post-Mortem: Database Services for Physics at CERN

Database post-mortem Maria Girone, IT WLCG CCRC’08 Post-Mortem Workshop

Agenda of the DB post-mortem • Database Services for Physics (Maria Girone, 20 mins) • ATLAS DB Metrics (Sasha Vaniachine, 10 mins) • Castor and SRM (Giuseppe Lo Presti, 15 mins) Maria Girone DB Post Mortem 2

Database Service for Physics • Upgrade to new hardware, standby DBs • Experience in CCRC’08 • DB resource usage • DB availability and performance monitoring • “DB Dashboard” • Power cut • Preparing for the LHC start-up Maria Girone DB Post Mortem 3

Introduction • A wide-range of mission-critical applications that are part of LCG rely on databases • These services are rated consistently by the LHC VOs as amongst the most critical • Our goal is to set up and operated a Distributed Database Service for the WLCG • Oracle Real Application Clusters (RAC) as building-block architecture at Tier0 and Tier1 sites • CERN uses Oracle Data Guard for major hardware/software upgrades • Oracle Streams replication between online and offline (ATLAS, CMS, LHCb) and from Tier0 to Tier1 sites (ATLAS and LHCb) • CMS uses Frontier/SQUID from Fermilab for distributing and caching database data via a web protocol Maria Girone DB Post Mortem 4

DB Services for Physics at CERN • ~ 25 RAC databases (up to 6 nodes) • 125 servers, 150 disk arrays (2000 disks) • Or: 450 CPU cores, 900GB of RAM, 550 TB of raw disk space(!) • Connected via the 3D project to 10 Tier-1 sites for synchronized databases • Sharing policies and procedures • Team of 5 DBAs + service coordinator and link to experiments • 24x7 (still) on “best effort” service for production databases • Maintenance without downtime within RAC features • 0.04% services unavailability (all 2008) = 3.5 hours/year • 0.22% server unavailability (all 2008) = 19 hours/year • Patch deployment, broken hardware Maria Girone DB Post Mortem 5

Upgrade to new hardware APRIL 2008 6

Hardware details • Major migration in April in view of LHC start-up (RAC5 and RAC6) • Dual-CPU quad-core 2950 DELL servers, 16GB memory, Intel 5400-series “Harpertown”; 2.33GHz clock • 60 disk arrays, 16 disks SATA disks of 400 GB each • dual-ported FC controllers • Some initial instabilities with new hardware we operated CCRC’08 in safe conditions, using “old” hardware as Data Guard stand-by DBs • No changes needed in service procedures, but additional protection against failures • Potentially this is a model for future upgrades • No further hardware failures observed • RAC5 and RAC6 have a “mixed power configuration” • Storage, FC infrastructure, Ethernet Switches and half of the servers are connected to the critical power Maria Girone DB Post Mortem 7

Database set-up RAC1 RAC2 RAC5 RAC6 RAC3 RAC4

Database Resource Usage in 2008 • No significant increase in resource usage on databases • Clear improvement seen with change of HW Presentation title - 9

DB availability and performance monitoring 10

“DB Dashboard” • All-in-one page • Offline, Online DBs + Streams SLS availability Streams layout + availability Usage per instance Usage per application Users+Applications Logbook Presentation title - 11

DB availability and performance page

Streams Monitoring 13

ATLAS – DB • Quadcore upgrade per CCRC08 • More cache and CPU power • critical DQ2 application showed a measurable performance improvement, up to 4x for some queries • Most applications moved to one preferred node. Better cache utilization and less cluster interconnect traffic • PVSS • Production accounts ran smoothly during CCRC08 • The upgrade to PVSS DB schema version 7.0 beta has been tested: minor problems found, all fixed • We expect the final patch to come out in a few weeks. Experiments will decide if they want the upgrade to be done Maria Girone DB Post Mortem 14

WLCG - DB • Locked owner accounts to reduce probability of human errors • SAM/Gridview – biggest resource consumer • Sharing several tables • Schema review being addressed with the applications developers • VOMS • Need to follow some DB good practices • Use of bind variables • Use of sequences Maria Girone DB Post Mortem 15

CMS - DB • Quadcore upgrade per CCRC08 • More cache and CPU power • Critical PhEDEx application showed a measurable performance improvement • Most applications moved to one preferred node. Better cache utilization and less cluster interconnect traffic • CMS Dashboard • Significant increase in resource consumption (CPU and IOPS), frequent 'high load' alarms • Identified (and reported to developers) • better application change management • automatic database jobs' optimization • Planned a schema review with developers • Smooth running of all other applications Maria Girone DB Post Mortem 16

Streams Issues • ATLAS online capture process aborted on Tuesday 20/05 02:30 • Logminer failed due to a memory problem on the database • Restarted on 20/05 at 09:20 (20 mins to synchronize) • Separate capture processes to synchronize new databases at PIC aborted on Monday 19/05 19:40 • “_SGA_SIZE” parameter missing during the setup • long transaction caused logminercomponent running out of memory • Restarted on Tuesday 20/05 at 09:15 (30 mins to synchronize) • No major impact on experiments’ production • Working-hour operation with central expertise seems an appropriate Service Level Maria Girone DB Post Mortem 17

Tier1 sites DB status • ASGC • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • BNL • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • CNAF • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • GridKa • No problems during CCRC’08 and improved performance on LFC (increased # of threads from 20 to 60) • IN2P3 • No problems during CCRC’08 • NDGF • No problems during CCRC’08 • PIC • No problems during CCRC’08. Upgraded ATLAS to RAC of 3 nodes (quad-cores), 8 G RAM Storage: 4 TB LHCb to RAC of 2 nodes (quad-cores), 8 G RAM Storage: 2TB • RAL • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • SARA • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • TRIUMF • Increased RAM from 4GB/node to 10GB/node and increased CPU from 1 dual core (1.6 GHZ) to 2 dual core (3.00 GHZ) per node on ATLAS 3D RAC Maria Girone DB Post Mortem 18

CC Power Cut 30.05.08 19

CERN Power cut – Friday 30-5 • At 6:30 a power cut caused the ‘physics power’ to go down • Devices connected to the ‘critical power’ were not affected (RAC2 and RAC3) • RAC5 and RAC6 clusters have ‘mixed power’ configuration for servers • Consequences • Services transparently relocated on surviving nodes on some clusters (RAC5) • Other clusters went down due to human error • RAC6 power for Ethernet switches was NOT on critical • Faulty scripts of OEM agents prevented automatic start-up when the power was back • Reviewing procedures with CC operators in case of power loss Maria Girone DB Post Mortem 20

Main issue during the power cut • Ethernet network switches in RAC6 were not connected to the critical power (wrong connection of the power bar) • The public and cluster interconnect networks went down Presentation title - 21

Impact of the power cut on the DB Services

Preparing for the LHC start-up 23

Major DB Activities in June • Complete migration to 64bit for all our services • Perform the upgrade of all our production RACs to Oracle 10.2.0.4 • On validation since March 2008, no problems reported • Fixes some bugs relevant for our services • Most Tier1 sites also planning this upgrade, coordinated via 3D meetings • On Streams: • ATLAS: request for adding a trigger configuration schema to the replication from ATLR to the 10 Tier1 sites • Propagation rules must be changed • New ATLAS database available at PIC • Joined to the Streams setup • Several sites running on separate Streams setups due to recent interventions will be merged in the next weeks Maria Girone DB Post Mortem 24

Summary of DBs in CCRC’08 • Distributed database infrastructure is ready for accelerator turn-on • Smooth running during CCRC’08 • Experiments are ramping-up to full use of the Tier1 infrastructure • Minor issues found and all being followed-up • Oracle Data Guard for “critical DBs” at Tier0 during CCRC’08 worked well • Need a more defined plan if this becomes a request from experiments and WLCG • “DB dashboard” is a key tool for the application developers and DB resource coordinators • Well appreciated by our users • Would like to extend it to the Tier1 sites, picking up the recent developments from ATLAS • Reminder: 24x7 on “best effort” Maria Girone DB Post Mortem 25

Conclusions • Recognizing the importance DB services to the experiments’ activities, we have built up robust, scalable and flexible solutions • These solutions successfully address a wide-range of use cases • Testing and validation – hardware, DB versions, applications – proven key to smooth production • Many years of close cooperation between application developers and database administrators have resulted in reliable, manageable services Maria Girone DB Post Mortem 26

Many thanks to CERN Database Service Teams Experiments’ DBAs Tier1 DBAs 27

More Details • LCG 3D wiki • interventions, performance summaries • http://lcg3d.cern.ch • Physics Database Services at CERN wiki • https://twiki.cern.ch/twiki/bin/view/PSSGroup/PhysicsDatabasesSection • Support: phydb.support@cern.ch

Backup slides 29

CERN Power cut - consequences • Ethernet switches for the public network down • DB servers where not reachable through TCP/IP • Ethernet switches for the cluster interconnect network down: • RAC6 clusters moved to single node (cluster node eviction caused this) • WLCG db cannot support the load only on one node • Virtual IP addresses (HA) did not failover • Impact on new connections DB Post Mortem 30

CERN Power cut – other lessons learned • When power came up most DB servers required manual DBA intervention to startup • This was traced down to a faulty startup script provided with the latest OEM agent upgrade • Workaround established with Oracle • ASM instances on some CMSR nodes required manual startup • Lesson learned: the procedure for disk failure replacement had to be changed for new kernel version • WLCG downtime when power was back • Human error caused cluster corruption when moving LCG_SAME service to available nodes DB Post Mortem 31

CERN Power cut – Broken HW • The headnode for RAC6 storage died during the powercut • Impact: remote access and monitoring of CMS, LCG, LHCb clusters was not available • Fixed only Thursday, 5th June, thanks FIO • Lesson leaned: headnodes should also be connected to critical power • Two disks and a power supply died during the powercut • Not affecting availability • A higher than usual rate of failed power supplies was observed in the following days DB Post Mortem 32

Why Oracle 10.2.0.4? • Oracle have recently delivered 10.2.0.4 • This includes numerous fixes: • Numerous library cache bugs • 6044413 - WRONG RESULTS WITH INDEX UNIQUE SCAN ON A 11GR1 DB (table pre-fetch) • Several streams related bugs • 6043052 Leak in Perm Allocations with library cache comments ora-4031 generated • 4061534 ORA-00600 [KRVTADC] IN CAPTURE PARALLEL PROCESS (using compressed tables) • 5205636 Apply fails with ORA-26687 while dropping a table with referential constraints • All 10.2.0.3 patches (we had to apply a few) DB Post Mortem 33

H/W Configuration Details • 1 disk array ~ 6.4 TB raw space ~25% effective space (mirroring and on-disk backups)

Maria Girone DB Post-Mortem: Database Services for Physics at CERN

Maria Girone DB Post-Mortem: Database Services for Physics at CERN

Presentation Transcript

Post Mortem Review Process

Early post mortem events

[Project Name] Post-Mortem

Post Mortem Changes

Post Mortem Review

Wuppertal Post Mortem

POST MORTEM INTERVAL

Post Mortem

2. Post mortem and tissue harvesting a) Post mortem delays

MDC post-mortem redux

Post Mortem Prototyping G.Kruk

Post Mortem

AI Post Mortem

POST MORTEM

Mercury Post-mortem

POST MORTEM CHANGES

CVMFS Post Mortem

POST MORTEM CHANGES

Temple Review Post-Mortem

POST MORTEM CHANGES