1 / 34

Database post-mortem

Database post-mortem. Maria Girone, IT WLCG CCRC’08 Post-Mortem Workshop. Agenda of the DB post-mortem. Database Services for Physics (Maria Girone, 20 mins) ATLAS DB Metrics (Sasha Vaniachine, 10 mins) Castor and SRM (Giuseppe Lo Presti, 15 mins). Database Service for Physics.

pbraxton
Download Presentation

Database post-mortem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database post-mortem Maria Girone, IT WLCG CCRC’08 Post-Mortem Workshop

  2. Agenda of the DB post-mortem • Database Services for Physics (Maria Girone, 20 mins) • ATLAS DB Metrics (Sasha Vaniachine, 10 mins) • Castor and SRM (Giuseppe Lo Presti, 15 mins) Maria Girone DB Post Mortem 2

  3. Database Service for Physics • Upgrade to new hardware, standby DBs • Experience in CCRC’08 • DB resource usage • DB availability and performance monitoring • “DB Dashboard” • Power cut • Preparing for the LHC start-up Maria Girone DB Post Mortem 3

  4. Introduction • A wide-range of mission-critical applications that are part of LCG rely on databases • These services are rated consistently by the LHC VOs as amongst the most critical • Our goal is to set up and operated a Distributed Database Service for the WLCG • Oracle Real Application Clusters (RAC) as building-block architecture at Tier0 and Tier1 sites • CERN uses Oracle Data Guard for major hardware/software upgrades • Oracle Streams replication between online and offline (ATLAS, CMS, LHCb) and from Tier0 to Tier1 sites (ATLAS and LHCb) • CMS uses Frontier/SQUID from Fermilab for distributing and caching database data via a web protocol Maria Girone DB Post Mortem 4

  5. DB Services for Physics at CERN • ~ 25 RAC databases (up to 6 nodes) • 125 servers, 150 disk arrays (2000 disks) • Or: 450 CPU cores, 900GB of RAM, 550 TB of raw disk space(!) • Connected via the 3D project to 10 Tier-1 sites for synchronized databases • Sharing policies and procedures • Team of 5 DBAs + service coordinator and link to experiments • 24x7 (still) on “best effort” service for production databases • Maintenance without downtime within RAC features • 0.04% services unavailability (all 2008) = 3.5 hours/year • 0.22% server unavailability (all 2008) = 19 hours/year • Patch deployment, broken hardware Maria Girone DB Post Mortem 5

  6. Upgrade to new hardware APRIL 2008 6

  7. Hardware details • Major migration in April in view of LHC start-up (RAC5 and RAC6) • Dual-CPU quad-core 2950 DELL servers, 16GB memory, Intel 5400-series “Harpertown”; 2.33GHz clock • 60 disk arrays, 16 disks SATA disks of 400 GB each • dual-ported FC controllers • Some initial instabilities with new hardware we operated CCRC’08 in safe conditions, using “old” hardware as Data Guard stand-by DBs • No changes needed in service procedures, but additional protection against failures • Potentially this is a model for future upgrades • No further hardware failures observed • RAC5 and RAC6 have a “mixed power configuration” • Storage, FC infrastructure, Ethernet Switches and half of the servers are connected to the critical power Maria Girone DB Post Mortem 7

  8. Database set-up RAC1 RAC2 RAC5 RAC6 RAC3 RAC4

  9. Database Resource Usage in 2008 • No significant increase in resource usage on databases • Clear improvement seen with change of HW Presentation title - 9

  10. DB availability and performance monitoring 10

  11. “DB Dashboard” • All-in-one page • Offline, Online DBs + Streams SLS availability Streams layout + availability Usage per instance Usage per application Users+Applications Logbook Presentation title - 11

  12. DB availability and performance page

  13. Streams Monitoring 13

  14. ATLAS – DB • Quadcore upgrade per CCRC08 • More cache and CPU power • critical DQ2 application showed a measurable performance improvement, up to 4x for some queries • Most applications moved to one preferred node. Better cache utilization and less cluster interconnect traffic • PVSS • Production accounts ran smoothly during CCRC08 • The upgrade to PVSS DB schema version 7.0 beta has been tested: minor problems found, all fixed • We expect the final patch to come out in a few weeks. Experiments will decide if they want the upgrade to be done Maria Girone DB Post Mortem 14

  15. WLCG - DB • Locked owner accounts to reduce probability of human errors • SAM/Gridview – biggest resource consumer • Sharing several tables • Schema review being addressed with the applications developers • VOMS • Need to follow some DB good practices • Use of bind variables • Use of sequences Maria Girone DB Post Mortem 15

  16. CMS - DB • Quadcore upgrade per CCRC08 • More cache and CPU power • Critical PhEDEx application showed a measurable performance improvement • Most applications moved to one preferred node. Better cache utilization and less cluster interconnect traffic • CMS Dashboard • Significant increase in resource consumption (CPU and IOPS), frequent 'high load' alarms • Identified (and reported to developers) • better application change management • automatic database jobs' optimization  • Planned a schema review with developers • Smooth running of all other applications Maria Girone DB Post Mortem 16

  17. Streams Issues • ATLAS online capture process aborted on Tuesday 20/05 02:30 • Logminer failed due to a memory problem on the database • Restarted on 20/05 at 09:20 (20 mins to synchronize) • Separate capture processes to synchronize new databases at PIC aborted on Monday 19/05 19:40 • “_SGA_SIZE” parameter missing during the setup • long transaction caused logminercomponent running out of memory • Restarted on Tuesday 20/05 at 09:15 (30 mins to synchronize) • No major impact on experiments’ production • Working-hour operation with central expertise seems an appropriate Service Level Maria Girone DB Post Mortem 17

  18. Tier1 sites DB status • ASGC • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • BNL • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • CNAF • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • GridKa • No problems during CCRC’08 and improved performance on LFC (increased # of threads from 20 to 60) • IN2P3 • No problems during CCRC’08 • NDGF • No problems during CCRC’08 • PIC • No problems during CCRC’08. Upgraded ATLAS to RAC of 3 nodes (quad-cores), 8 G RAM Storage: 4 TB LHCb to RAC of 2 nodes (quad-cores), 8 G RAM Storage: 2TB • RAL • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • SARA • No problems during CCRC’08 • Planning upgrade to 10.2.0.4 • TRIUMF • Increased RAM from 4GB/node to 10GB/node and increased CPU from 1 dual core (1.6 GHZ) to 2 dual core (3.00 GHZ) per node on ATLAS 3D RAC Maria Girone DB Post Mortem 18

  19. CC Power Cut 30.05.08 19

  20. CERN Power cut – Friday 30-5 • At 6:30 a power cut caused the ‘physics power’ to go down • Devices connected to the ‘critical power’ were not affected (RAC2 and RAC3) • RAC5 and RAC6 clusters have ‘mixed power’ configuration for servers • Consequences • Services transparently relocated on surviving nodes on some clusters (RAC5) • Other clusters went down due to human error • RAC6 power for Ethernet switches was NOT on critical • Faulty scripts of OEM agents prevented automatic start-up when the power was back • Reviewing procedures with CC operators in case of power loss Maria Girone DB Post Mortem 20

  21. Main issue during the power cut • Ethernet network switches in RAC6 were not connected to the critical power (wrong connection of the power bar) • The public and cluster interconnect networks went down Presentation title - 21

  22. Impact of the power cut on the DB Services

  23. Preparing for the LHC start-up 23

  24. Major DB Activities in June • Complete migration to 64bit for all our services • Perform the upgrade of all our production RACs to Oracle 10.2.0.4 • On validation since March 2008, no problems reported • Fixes some bugs relevant for our services • Most Tier1 sites also planning this upgrade, coordinated via 3D meetings • On Streams: • ATLAS: request for adding a trigger configuration schema to the replication from ATLR to the 10 Tier1 sites • Propagation rules must be changed • New ATLAS database available at PIC • Joined to the Streams setup • Several sites running on separate Streams setups due to recent interventions will be merged in the next weeks Maria Girone DB Post Mortem 24

  25. Summary of DBs in CCRC’08 • Distributed database infrastructure is ready for accelerator turn-on • Smooth running during CCRC’08 • Experiments are ramping-up to full use of the Tier1 infrastructure • Minor issues found and all being followed-up • Oracle Data Guard for “critical DBs” at Tier0 during CCRC’08 worked well • Need a more defined plan if this becomes a request from experiments and WLCG • “DB dashboard” is a key tool for the application developers and DB resource coordinators • Well appreciated by our users • Would like to extend it to the Tier1 sites, picking up the recent developments from ATLAS • Reminder: 24x7 on “best effort” Maria Girone DB Post Mortem 25

  26. Conclusions • Recognizing the importance DB services to the experiments’ activities, we have built up robust, scalable and flexible solutions • These solutions successfully address a wide-range of use cases • Testing and validation – hardware, DB versions, applications – proven key to smooth production • Many years of close cooperation between application developers and database administrators have resulted in reliable, manageable services Maria Girone DB Post Mortem 26

  27. Many thanks to CERN Database Service Teams Experiments’ DBAs Tier1 DBAs 27

  28. More Details • LCG 3D wiki • interventions, performance summaries • http://lcg3d.cern.ch • Physics Database Services at CERN wiki • https://twiki.cern.ch/twiki/bin/view/PSSGroup/PhysicsDatabasesSection • Support: phydb.support@cern.ch

  29. Backup slides 29

  30. CERN Power cut - consequences • Ethernet switches for the public network down • DB servers where not reachable through TCP/IP • Ethernet switches for the cluster interconnect network down: • RAC6 clusters moved to single node (cluster node eviction caused this) • WLCG db cannot support the load only on one node • Virtual IP addresses (HA) did not failover • Impact on new connections DB Post Mortem 30

  31. CERN Power cut – other lessons learned • When power came up most DB servers required manual DBA intervention to startup • This was traced down to a faulty startup script provided with the latest OEM agent upgrade • Workaround established with Oracle • ASM instances on some CMSR nodes required manual startup • Lesson learned: the procedure for disk failure replacement had to be changed for new kernel version • WLCG downtime when power was back • Human error caused cluster corruption when moving LCG_SAME service to available nodes DB Post Mortem 31

  32. CERN Power cut – Broken HW • The headnode for RAC6 storage died during the powercut • Impact: remote access and monitoring of CMS, LCG, LHCb clusters was not available • Fixed only Thursday, 5th June, thanks FIO • Lesson leaned: headnodes should also be connected to critical power • Two disks and a power supply died during the powercut • Not affecting availability • A higher than usual rate of failed power supplies was observed in the following days DB Post Mortem 32

  33. Why Oracle 10.2.0.4? • Oracle have recently delivered 10.2.0.4 • This includes numerous fixes: • Numerous library cache bugs • 6044413 - WRONG RESULTS WITH INDEX UNIQUE SCAN ON A 11GR1 DB (table pre-fetch) • Several streams related bugs • 6043052 Leak in Perm Allocations with library cache comments ora-4031 generated • 4061534 ORA-00600 [KRVTADC] IN CAPTURE PARALLEL PROCESS (using compressed tables) • 5205636 Apply fails with ORA-26687 while dropping a table with referential constraints • All 10.2.0.3 patches (we had to apply a few) DB Post Mortem 33

  34. H/W Configuration Details • 1 disk array ~ 6.4 TB raw space ~25% effective space (mirroring and on-disk backups)

More Related