1 / 24

CERN DB Services: Status, Activities, Announcements

CERN DB Services: Status, Activities, Announcements. Marcin Blaszczyk - IT-DB. Recap. Last workshop: 16 th Nov 2010 – at that time We were using 10.2.0.4 We were installing new hardware to replace RAC3 & RAC4 RAC8 in “ Safehost ” for standbys RAC9 for integration DBs

lewis
Download Presentation

CERN DB Services: Status, Activities, Announcements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CERN DB Services: Status, Activities, Announcements Marcin Blaszczyk - IT-DB Replication Technology Evolution for ATLAS Data Workshop, 3rd of June 2014

  2. Recap • Last workshop: 16th Nov 2010 – at that time • We were using 10.2.0.4 • We were installing new hardware to replace RAC3 & RAC4 • RAC8 in “Safehost” for standbys • RAC9 for integration DBs • 11.2 evaluation process • 10.2.0.5 upgrade under planning • Infrastructure for Physics DB Services • Quadcoremachines with 16GB of RAM • FC infrastructure for storage (~2500 disks)

  3. Things have changed… • Service evolution • RAC8 in Safehost for standby installed • Performed in Q3 2010 • To assure geographical separation for DR • New standby installations - for each production DB • 10.2.0.5 upgrade • Performed in Q1 2011

  4. Oracle 11gR2 • SW upgrade + HW migration • Target version 11.2.0.3 • Performed in Q1 2012 • HW migration • New HW installations (RAC10 & RAC11) • 8 cores (16 threads) CPU, 48GB of memory • Move from ASM to NAS • Netapp NAS storage • Replication technology • Usage of streams replication - gradually reduced • Usage of Active Data Guard has grown

  5. Offloading with ADG • Offloading Backups to ADG • Significantly reducesload on primary • Removes sequential I/O of full backup • Offloading Queries to ADG • Transactional workload runs on primary • Read-only workload can be moved to ADG • Examples of workload on our ADGs: • Ad-hoc queries, analytics and long-running reports, parallel queries, unpredictable workload and test queries • ORA-1555 (snapshot too old) • Sporadic occurrences • Oracle bug – to be confirmed if present in 11.2.0.4

  6. New Architecture with ADG Maximum performance Maximum performance Active Data Guard for users’ access Redo Transport Redo Transport Redo Transport Primary Database Active Data Guard for users’ access and for disaster recovery Primary Database Active Data Guard for disaster recovery 2. Busy & critical ADG 1. Low load ADG • Disaster recovery • Offloading read-only workload

  7. IT-DB Service on 11gR2 • IT-DB service much more stable • Workload has been stabilized • High loads and node reboots eliminated • More powerful HW • Offloading to ADG helps a lot • 11g clusterware more stable • Storage model benefited from using NAS • single/multiple disk failure can’t affect DB service anymore • Faster and less vulnerable Streams replication

  8. Preparation for Run2 • Oracle SW • No good solution to fit entire RUN 2 • New Software versions: • 11.2.0.4 vs12.1.0.1 • New HW • 32 threads CPU, 128/256GB memory • New Storage NetApp model • More SSD cache • Consolidated storage

  9. Hardware upgrades in Q1 2014 • New servers and storage • Servers: more RAM, more CPU • 128GB of RAM memory (48GB current prod machines) • Storage: more SSD cache • Newer NetApp model • Consolidated storage • Refresh cycle of OS and OS related • Puppet & RHEL 6 • Refresh cycle of our HW • New HW for production • Current production HW will be moved to standby

  10. Software upgrades in Q1 2014 • Available Oracle releases • 11.2.0.4 • 12.1.0.1 • Evolution – how to balance • Stable services • Latest releases for bug fixes • Newest releases for new features • Fit with LHC schedule

  11. DBAs & workload validation • DBAs - can do: • Test upgrades of integration and production databases • Share experience across users communities • Database CAPTURE and REPLAY with RAT testing • Capture workload from production and replay it in upgraded DB • Useful to catch bugs and regressions • Unfortunately it cannot cover the edge cases

  12. Validation by the users • Validation by the application owners is very valuable to reduce risk • Functional tests • Tests with ‘real world’ data sizes • Tests with concurrent workload • The criticality depends • On the complexity of the application • On how well they can test their SQL

  13. Recent Changes: Q1-Q2 2014 • DB services for Experiments/WLCG • Target version 11.2.0.4 • Exceptions - target 12c • ATLARC • LHCBR • Few more IT-DB services • Interventions took 2-5 hours of DB downtime • Depending on system complexity: standby infrastructure, number of nodes etc…

  14. Upgrade technique - overview 4 2 5 1 6 3 DATABASE downtime RW Access RW Acess DATA GUARD RAC database Primary database RAC RDBMS upgrade Upgrade complete! Clusterware 12c + RDBMS 11.2.0.3 Clusterware 11g + RDBMS 11.2.0.3 Clusterware 12c + RDBMS 11.2.0.4 Redo Transport Redo Transport

  15. Phased approach to 12c • Some DBs already on 12.1 version • ATLARC, LHCBR • Smooth upgrade • No major issues discovered so far • Following Oracle SW evolution, depending on • Next 12c releases feedback (12.2) • Testing status • Possibility to schedule upgrades • Next possible slot for upgrades to 12c 1stpatchset • Technical stop Q4 2014/Q1 2015? • Candidates: offline DBs (ATLR, CMSR, LCGR…)

  16. Monitoring & Security • Monitoring • RacMon • EM12c • Strmmon • Support level during LS1 • Best effort • Security • AuditMon • Firewall rules for external access • For ADCR in 2013 • For ATLR in 2014

  17. IT-DB Operations Report ATLAS databases • Production DBs: 12nodes*, ~69TB of data • ATONR: 2 nodes, ~8TB • ADCR: 4 nodes, ~19,5 TB • ATLR: 3nodes, ~20.5TB • ATLARC: 2 nodes, ~17 TB • *ATLAS DASHBOARD (1 node of WLCG database), ~4TB • Standby DBs: 14 nodes, ~75TB of data • ATONR_ADG: 2 nodes; ATONR_DG: 2 nodes • ADCR_ADG: 4 nodes; ADCR_DG: 3 nodes • ATLR_DG: 3 nodes • Integration DBs: 4 nodes, ~18 TB of data • INTR: 2 nodes, ~7,5 TB, • INT8R: 2 nodes, ~9TB • **ATLASINT: 2 nodes, ~2 TB (will be consolidated with INT8R) • Nearly 165TBof space, 30database servers • 12* databases (11 RAC clusters + 1 dedicated RAC node*)

  18. Replication for ATLAS - current status

  19. Replication for ATLAS - plans • Replication changesoverview • PVSS • Read onlyreplica: Active Data Guard • COOL • Online -> Offline: GoldenGate • Offline ->Tier1s: GoldenGate • MUON • Streamsstopped when ATLAS new solution for custom data movement will be in place

  20. Conclusions • Focus on stability for DB services • Software evolution • Critical services has just moved to 11.2.0.4 • Long perspective: keep testing towards 12c • HW evolution • Technology evolution for replication • ADG & GG will fully replace Oracle Streams

  21. Acknowledgements • Work presented here on behalf of: • CERN Database Group

  22. Thankyou! Marcin.Blaszczyk@cern.ch Replication Technology Evolution for ATLAS Data Workshop, 3rd of June 2014

More Related