1 / 18

Operation of the CERN Managed Storage environment; current status and future directions

Operation of the CERN Managed Storage environment; current status and future directions. CHEP 2004 / Interlaken Data Services team: Vladim ír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith. Managed Storage Dream. 0011010 1010011

Download Presentation

Operation of the CERN Managed Storage environment; current status and future directions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operation of the CERN Managed Storage environment;current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo Caçote, Charles Curran, Jan van Eldik, David Hughes, Gordon Lee, Tony Osborne, Tim Smith

  2. Managed Storage Dream 0011010 1010011 0011101 1111011 • Free to open… Instant access • Any time later… Unbounded recall • Find exact same coins Goods integrity CERN Managed Storage: Tim.Smith@cern.ch

  3. Managed Storage Reality 0011010 1010011 0011101 1111011 • Maintain + upgrade, innovate + technology refresh • Ageing equipment, escalating requirements • Dynamic store / Active Data Management Tape Store Disk Cache CERN Managed Storage: Tim.Smith@cern.ch

  4. CERN Managed Storage CASTOR Grid Service New Service Scalability SRM Service Highly Distributed System GRIDftp servers Redundancy Scalability 42 stager/disk caches 370 disk servers 6,700 spinning disks CASTOR Service Stage Servers Stage Servers Stage Servers Stage Servers Stage Servers Stage Servers Stage Servers Stage Servers Disk Cache Disk Cache Disk Cache Disk Cache Disk Cache CASTOR Servers Disk Cache Disk Cache Disk Cache Reliability Uniformity Automation 70 tape servers 35,000 tapes Tape Store Tape Store Tape Store Tape Store Tape Store Tape Store Tape Store Tape Store CERN Managed Storage: Tim.Smith@cern.ch

  5. CASTOR Service • Running experiments • CDR for NA48, COMPASS, Ntof • Experiment peaks of 120MB/s • Combined average 10TB/day • Sustained 10MB/s per dedicated 9940B drive • Record 1.5 PB in 2004 • Pseudo-online analysis • Experiments in the analysis phase • LEP and Fixed Target • LHC experiments in construction • Data production / analysis (Tier0/1 operations) • Test beam CDR CERN Managed Storage: Tim.Smith@cern.ch

  6. Quattor-ising • Motivation: Scale (See G.Cancio’s talk) • Uniformity; Manageability; Automation • Configuration Description (into CDB) • HW and SW; nodes and services • Reinstallation • Quiescing a server ≠ draining a client! • Gigabit cards gymnastics; BIOS upgrades for PXE • Eliminate peculiarities from CASTOR nodes • Switches misconfigurations, firmware upgrades • (ext2 -> ext3) • Manageable servers CERN Managed Storage: Tim.Smith@cern.ch

  7. LEMON-ising • Lemon agent everywhere • Linux box monitoring and alarms • Automatic HW static checks • Adding • CASTOR server specific • Service monitoring • HW Monitoring • temperatures, voltages, fans etc • lm_sensors -> IPMI (see tape section) • disk errors; SMART • smartmontools • auto checks; predictive monitoring • tape drive errors; SMART • Uniformly monitored servers CERN Managed Storage: Tim.Smith@cern.ch

  8. Warranties CERN Managed Storage: Tim.Smith@cern.ch

  9. Disk Replacement Unacceptably high failure rate! • 10 months before case agreed: Head instabilities • 4 weeks to execute • 1224 disks exchanged (=18%); And the cages CERN Managed Storage: Tim.Smith@cern.ch

  10. Disk Storage Developments • Disk Configurations / File systems • HW.Raid-1/ext3 -> HW.Raid-5+SW.Raid-0/XFS • IPMI: HW health monit. + remote access • Remote reset + power-on/off (indep. of OS) • Serial console redirection over LAN • LEAF: Hardware and State Management • Next generations (see H.Meinhard’s talk) • 360 TB SATA in a box • 140 TB external SATA disk arrays • New CASTOR stager (JD.Durand’s talk) CERN Managed Storage: Tim.Smith@cern.ch

  11. Tape Service • 70 tape servers (Linux) • (mostly) Single FibreChannel attached drives • 2 symmetric robotic installations • 5 x STK 9310 Silos in each Drives Media Backup Bulk physics Fast Access CERN Managed Storage: Tim.Smith@cern.ch

  12. Chasing Instabilities • Tape server temperatures? CERN Managed Storage: Tim.Smith@cern.ch

  13. Media Migration • Technology generations • Migrate data to avoid obsolescence and reliability issues in drives • 1986 3480 / 3490 • 1995 Redwood • 2001 9940 • Financial • Capacity gain in sub generations CERN Managed Storage: Tim.Smith@cern.ch

  14. Media Migration 1% of A tapes unreadable on B drives – keep A drives (drive head tolerances) Replace A drives by B drives Capacity, Performance, Reliability Migrate A to B format 9940B 200GB 30MB/s 9940A 60GB 12MB/s 9 months; 25% of B resources CERN Managed Storage: Tim.Smith@cern.ch

  15. Tape Service Developments • Removing tails… • Tracking of all tape errors (18 months) • Retiring of problematic media • Proactive retiring of heavily used media (>5000 mounts) • repack on new media • Checksums • Populated writing to tape • Verified loading back to disk • Drive testing • Commodity LTO-2; High end IBM3592/STK-NG • New Technology; SL8500 library / Indigo CERN Managed Storage: Tim.Smith@cern.ch

  16. CASTOR Central Servers • Combined Oracle DB and Application Daemons node • Assorted helper applications distributed (historically) across ageing nodes • FrontEnd / BackEnd split • FE: Load balanced applications servers • Eliminate interference with DB • Load distribution, overload localisation • BE: (developing) clustered DB • Reliability, security CERN Managed Storage: Tim.Smith@cern.ch

  17. GRID Data Management • GridFTP + SRM servers (Former) • Standalone / experiment dedicated • Hard to intervene; not scalable • New load-balanced shared 6 node Service • castorgrid.cern.ch • DNS hacks for Globus reverse lookup issues • SRM modifications to support operation behind load balancer • GridFTP standalone client • Retire ftp and bbftp access to CASTOR CERN Managed Storage: Tim.Smith@cern.ch

  18. Conclusions • Stabilising HW and SW • Automation • Monitoring and control • Reactive -> Proactive Data Management CERN Managed Storage: Tim.Smith@cern.ch

More Related