Tier1 View: Resilience Status, plans, and best practice

Tier1 View: ResilienceStatus, plans, and best practice Martin Bly RAL Tier1 Fabric Manager GridPP22 – UCL - 2 April 2009

Overview “How to make critical services at the T1 bullet proof” Resilience at the Tier1 - Martin Bly - GridPp22

Resilience - Why? • Services and system components fail • <insert_expletive_of_your_choice> happens! • You don’t want your services to be brought down by a failure • MoU commitments quite taxing to meet even without failures • You can’t hide from auntie SAM… • Better to deal with problems without pressure to restart services • Fewer mistakes • Even better to avoid the problems in the first place • So: design service implementation so that it *will* survive failures of whatever nature Resilience at the Tier1 - Martin Bly - GridPp22

Approaches to resilience • Hardware • Use hardware that can survive component failure • Software • Use software that can survive problems on hardware • Use software designed for distributed operation • Use software that has inbuilt resilience • Location • Locate hosts such that a service can survive failure at host location Resilience at the Tier1 - Martin Bly - GridPp22

Hardware Resilient hardware will help your services survive common failure modes and keep it operating until you can replace the component and make the service resilient again Resilience at the Tier1 - Martin Bly - GridPp22

Storage • Most common is RAID as used in storage arrays • Single (RAID5) or double (RAID6) disk failures do not take out the storage array • Use of hot spares allows automatic rebuilds to maintain the resilience • RAID1 for system disks in servers – in the event of a single disk failure the server carries on • RAID1 with a hot spare can be used for super-critical systems – automatic rebuild maintains the resilience • Works with software RAID as well as hardware RAID controllers • If you set the BIOS up for hot-swap capability… • Failed disks can be replaced without taking the service down • If you have hot-swap caddies Resilience at the Tier1 - Martin Bly - GridPp22

Memory • ECC helps systems to detect and correct single bit and multi-bit errors in the RAM – can help prevent data corruption • If the EEC correction rate begins to rise, the RAM may be failing, or need reseating, or be subject to interference, or be slipping out of tolerance. • Higher-end kit can stop using ‘bad’ RAM – if not interrupting the service is considered worth the cost (high) Resilience at the Tier1 - Martin Bly - GridPp22

Power Supply • Redundant PSU configurations • N+1 redundancy: at least one more PSU in a server than is needed to make it work. If one fails, the server keeps running and the failed unit can be replaced without taking the server down • Multiple power feeds • For an N+1 redundant PSU configuration, one can feed each PSU from a different PDU. If one PDU fails (and they do), or the fuse blows (and they definitely do!) the other PSU is still powered and the service can continue • UPS for systems where loss of power is a problem • Bridge blips, brownouts and short interruptions, smoothed feed, harmonic reduction • Permanent or time-limited – how much power must it provide and how long must it continue? Resilience at the Tier1 - Martin Bly - GridPp22

Interconnects • Networking • Two or more network ports bonded can provide resilience if cables routed to different switches or via different routes – increases performance too • Bonded links in fibre installations can provide resilience against transceiver failure or fibre cuts • ‘Stacked’ switches with bi-directional stacking capability • If one cable fails, data goes the other way • If one unit fails, data can still reach the one the other side • Fail-over links in site infrastructure and national / international long-haul links - fibre cuts happen with depressing regularity • Fibre-channel • Multi-port FC HBAs and array controllers can be set up to provide two independent routes from servers to storage devices with multi-path and failover support keeping the data flowing Resilience at the Tier1 - Martin Bly - GridPp22

Software Software services should be designed to be resilient and to be provided by multiple hosts and at distributed locations. “This is the Grid – it’s distributed. If the services aren’t distributable, <expletive> rewrite them.” – anon Resilience at the Tier1 - Martin Bly - GridPp22

Monitoring • If it can be monitored… • Look for and restart failed service daemons • Look for signatures of impending problems to predict component failure • Idle disks hide their faults • Regular low-level verification runs to push sick drives over the edge • Replace early in failure cycle • So it doesn’t fail during a rebuild… • Increased error rates on network links from failing line cards, transceivers or cable/fibre degradation • If you have redundant links, you can replace the faulty one and keep the service going • Call-out system for problems that impact services Resilience at the Tier1 - Martin Bly - GridPp22

Multiple hosts • Services can be provided by more than one host if the application supports it • Share the load and increase performance • If one host fails, the rest provide the service • Use DNS round-robin to ‘randomly’ select a host using a service alias with short TTL • Take broken host/s out of active DNS • Avoid single-points-of-failure • Can locate multiple hosts… • … in different rooms • … in different buildings • … at different sites Resilience at the Tier1 - Martin Bly - GridPp22

Tier1 Resilience steps at the Tier1… Resilience at the Tier1 - Martin Bly - GridPp22

Hardware at the Tier1 • Most of the hardware techniques are used at the Tier1 • Bulk storage uses RAID1/5/6, ECC RAM, N+1 PSUs, multiple power feeds, regular verifies of arrays (scrubbing) • Services nodes use RAID1, ECC RAM, some with N+1 PSUs • Databases: RAID1/10/5/6, ECC RAM, N+1 PSU, dual FC links, multiple power feeds • Networking: redundant off-site link to SJ5 • working on redundancy (failover/backup) for OPN link to CERN • UPS (in the new building) • 24/7 UPS for critical services / database racks • Short-lived UPS for storage systems to allow clean shutdown Resilience at the Tier1 - Martin Bly - GridPp22

CASTOR Service srm ns (Neptune) ORACLE RAC (Pluto) LSF licence FC ARRAY In general (all for CMS) mirror disks on stager/lsf master and rmmaster Single CASTOR Instance eg CMS Stager LSF Master rmmaster Shared Castor Core mirror disks Resilience at the Tier1 - Martin Bly - GridPp22

3D Services + LHCB LFC 3D lhcb lfc readonly replica, single host, fast kickstart failover to CERN 3D ORACLE RAC FC ARRAY Resilience at the Tier1 - Martin Bly - GridPp22

FTS and General LFC FTS 5 Web Front Ends in DNS RR LFC DNS RR LFC currently single Host. Second host planned for mid September 1 channel / VO agent host ( raid 1) Hot spare soon work in progress, running late Oracle currently 2 independent servers. Work active to deploy 3 server RAC RAID 10 SAN Oracle RAC Resilience at the Tier1 - Martin Bly - GridPp22

CE and Fabric NIS CE 3 doublets, one for each of ATLAS CMS and LHCB each CE has Mirror disks dn to account mapping Mirrored disks ce02 03 04 05 torque/maui /home file system (hardware RAID) Resilience at the Tier1 - Martin Bly - GridPp22

CE/SRM instances Resilience at the Tier1 - Martin Bly - GridPp22

LB WMS WMS and LB • Now: • lcgwms01 – LHC • lcgwms02 – everyone • lcgwms03 – non-LHC • Developments: • lcgwms01 – LHC • lcgwms02 – LHC • lcgwms03 – non-LHC • All WMS use both LB systems WMS triplet, LB doublet Resilience at the Tier1 - Martin Bly - GridPp22

Other Tier1 Services UK-BDII: DNS R-R triplet of simple hosts Copes with load, provides resilience Easy kickstart for rapid instancing RGMA registry: single host, RAID disks, easy kickstart MONbox: single host, RAID disks, easy kickstart VO boxes: several x single host, easy kickstart Site BDII DNS R-R doublet of simple hosts (same as UK-BDII) PROXY Doublet of simple hosts, easy kickstart GOCDB: internal failover with alternative database, (oracle), and external failover to another web front-end in Germany and mirrored database in Italy. Latter still being tested. Apel: has a warm standby and is buying new hardware. Resilience at the Tier1 - Martin Bly - GridPp22

Tier1 Monitoring Catch problems early with nagios where possible (or at least catch problems before anyone notices) load alarms File systems near to full certificates close to expiry Failed drives Some ganglia/cacti capacity planning reviews (but ad hoc) looking for long term trends. Service Operations team making a difference. Resilience at the Tier1 - Martin Bly - GridPp22

Tier1 Backups • Critical hosts all backed up to tape store • Tape details written to central loggers • So we can find which tape numbers to restore if the host is toast • Speedy restores to toasted systems • Verify and exercise backups… Resilience at the Tier1 - Martin Bly - GridPp22

Tier1 On-call A good driver for service improvement. Continuous improvement process with weekly review of night-time incidents Review is driver for: Auto-restarters (team still not 100% keen) Improved monitoring (more plugins) Better response documentation. Changes to processes Also runs daytime Gradually routine operations will become more and more the responsibility of the service intervention team. CASTOR team carry out “weekly” detailed review of all incidents (looking to see how to avoid them again). Will generalise to whole Tier-1 Resilience at the Tier1 - Martin Bly - GridPp22

Tier1 People Several teams with some degree of expertise sharing within each team Fabric, Grid/Support, CASTOR, Databases This has been pretty successful and we are reasonably confident we can handle tractable problems without the specialist present As far as is reasonable fair/practicable we seek to ensure leave is scheduled to ensure expert cover – not always possible On-call also spreading expertise in critical services (e.g., even the Facility Manager knows how to restart the CASTOR request handler!) Able to call upon RAL Tier-2 staff (or other GRIDPP/elsewhere) in case of complete lack of expertise. Have done this occasionally. Should probably be prepared to do it more often. Resilience at the Tier1 - Martin Bly - GridPp22

Off Site services A few critical services are candidates for off-site replication, others such as BDIIs, LHCB LFC are already federated Possible candidates: FTS and general LFC (possibly RGMA) Both essential to GRIDPP LFC based on Oracle Streaming technology already deployed and tested elsewhere (3D) RAL could operate these remotely, but existing configuration very expensive (£40K hardware) plus Oracle licences. Failover to new DNS names would also need to be site resilient (not trivial). May be worth exploring with nearby sites or Daresbury Resilience at the Tier1 - Martin Bly - GridPp22

Questions To Andrew, please…! Resilience at the Tier1 - Martin Bly - GridPp22

Tier1 View: Resilience Status, plans, and best practice