1 / 21

RAL Tier1 Operations

RAL Tier1 Operations. Andrew Sansum 18 th April 2012. Staffing. Staff changes since GridPP27: Leavers Kier Hawker (Database Team Leader) New Starters Orlin Alexandrov (Grid Team) Dimitrios (Fabric Team) Vasilij Savin (Fabric Team) New Roles Ian Collier - “ Grid Team ” Leader

Download Presentation

RAL Tier1 Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAL Tier1 Operations Andrew Sansum 18th April 2012

  2. Staffing Staff changes since GridPP27: Leavers • Kier Hawker (Database Team Leader) New Starters • Orlin Alexandrov (Grid Team) • Dimitrios (Fabric Team) • Vasilij Savin (Fabric Team) New Roles • Ian Collier - “Grid Team” Leader • Richard Sinclair Database Team Leader • James Adams – storage system development Tier-1 Status

  3. Some Changes • CVMFS in use for Atlas & LHCb: • The Atlas (NFS) software server used to give significant problems. • Some CVMFS teething issues but overall much better! • Virtualisation: • Starting to bear fruit. Uses Hyper-V. • Numerous test systems • Production systems that do not require particular resilience. • Quattor: • Large gains already made. Tier-1 Status

  4. Database Infrastructure We making Significant Changes to the Oracle Database Infrastructure. Why? • Old servers are out of maintenance • Move from 32bit to 64bit databases • Performance improvements • Standby systems • Simplified architecture

  5. Database Disk Arrays - Future Oracle RAC Nodes Fibrechannel SAN Data Guard Power Supplies (on UPS) Disk Arrays Tier-1 Status

  6. Castor Changes since last GridPP Meeting: • Castor upgrade to 2.1.10 (March) • Castor version 2.1.10-1 (July) needed for the higher capacity "T10KC" tapes. • Updated Garbage Collection Algorithm (to “LRU” rather than the default which is based on size). (July) • (Moved ‘logrotate’ to 1pm rather than 4am.) Tier-1 Status

  7. Recent Developments (I) • Hardware • Procured and commissioned 2.6PB disk • Procured and commissioned 15KHS06 disk • T10KC tape drives deployed and (1.5PB) ATLAS data migrated • New head nodes and core infrastructure storage capacity • Procured A new Tier-1 core network and new Site network • ORACLE Database Hardware upgrade and re-organisation • Rebuilding database SAN infrastructure • Increased CASTOR database resilience. Now have two copies of CASTOR database. Maintained in step by Oracle Data-guard. • Upgraded 3D service to ORACLE 11 • Virtualisation infrastructure (Hyper-V) now approved for critical production systems (deployment starting). Tier-1 Status

  8. CASTOR (significant improvements in latency) • Upgraded to CASTOR 2.1.11-8 (major upgrade) • Head node replacement • EMI/UMD upgrades of Grid Middleware Tier-1 Status

  9. Castor Issues. • Load related issues on small/full service classes (e.g. AtlasScratchDisk; LHCbRawRDst) • Load can become concentrated on one or two disk servers. • Exacerbated if uneven distribution if disk server sizes. • Solutions: • Add more capacity; clean-up. • Changes to tape migration policies. • Re-organization of service classes. Tier-1 Status

  10. Disk Server Outages by Cause (2011) Tier-1 Status

  11. Disk Drive Failure – Year 2011

  12. Double Disk Failures (2011) In process of updating the firmware on the particular batch of disk controllers. Tier-1 Status

  13. Data Loss Incidents Summary of losses since GridPP26 Total of 12 incidents logged: • 1 – Due to a disk server failure (loss of 8 files for CMS) • 1 – Due to a bad tape (loss of 3 files for LHCb) • 1 - Files not in Castor Nameserver but no location. ( 9 LHCb files) • 9 – Cases of corrupt files. In most cases the files were old (and pre-date Castor checksumming). Checksumming in place of tape and disk files. Daily and random checks made on disk files. Tier-1 Status

  14. T10KC Tapes In Production Type Capacity In Use Total Capacity A 0.5TB 5570 2.2PB B 1TB 2170 1.9PB (CMS) C 5TB Tier-1 Status

  15. T10000C Issues • Failure of 6 out of 10 tapes. • Current A/B failure rate roughly 1 in 1000. • After writing part of a tape an error was reported. • Concerns are three fold: • A high rate of write errors cause disruption • If tapes could not be filled our capacity would be reduced • We were not 100% confident that data would be secure • Updated Firmware in drives. • 100 tapes now successfully written without problem. • In contact with Oracle. Tier-1 Status

  16. A couple of final comments Disk server issues are the main area of effort for hardware reliability / stability. ...but do not forget the network. Hardware that has performed reliably in the past may throw up a systematic problem. Tier-1 Status

  17. Formal Operations Processes WLCG DAILY ops Exception Review Requirements Production Scheduling Team Fault Review Change Review Exception Handling Management Meeting SIR Review Liaison Meeting Tier-1 Status

  18. Service Exceptions 2011 • Definitions • Service exception – High priority fault alert raising a pager call • Callout – Service exception raised outside formal working hours • Operations Team • Daytime – “Admin on Duty” (AoD). Holds pager, handles service exceptions – passes on to daytime teams. • “Nighttime” – Primary Oncall (Like AoD) – holds pager fixes easy problems, operationally “in Charge”. Second line On-call (one per team) guarantees response. Some (not guaranteed) third line support or escalation in serious incidents. • Exceptions Count in 2011 • 461 Service exceptions • 265 callouts Tier-1 Status

  19. Exceptions by Type by Week

  20. Exceptions by Service

  21. Plans for Future • ORACLE 11 upgrade for CASTOR/LFC/FTS needed by July • CASTOR • Switch on transfer manager (reduce transfer startup latency) • Upgrade to 2.1.11-9 (needed before Oracle 11 upgrade) • Upgrade to 2.1.12 • Network (move Tier-1 backbone to 40Gb/s) • Site “front of house” network upgrade “early summer” • Tier-1 new routing and spine layer .. DRI …. Tier-1 Status

More Related