130 likes | 296 Views
RAL Site Report. Martin Bly HEPiX @ SLAC – 11-13 October 2005. Overview. Intro Hardware OS/Software Services Issues. RAL T1. Rutherford Appleton Lab hosts the UK LCG Tier-1 Funded via GridPP project from PPARC Supports LCG and UK Particle Physics users VOs:
E N D
RAL Site Report Martin Bly HEPiX @ SLAC – 11-13 October 2005
Overview • Intro • Hardware • OS/Software • Services • Issues RAL Site Report - HEPiX @ SLAC
RAL T1 • Rutherford Appleton Lab hosts the UK LCG Tier-1 • Funded via GridPP project from PPARC • Supports LCG and UK Particle Physics users • VOs: • LCG: Atlas, CMS, LHCb, (Alice), dteam • Babar • CDF, D0, H1, Zeus • Bio, Pheno • Expts: • Minos, Mice, SNO, UKQCD • Theory users • … RAL Site Report - HEPiX @ SLAC
Tier 1 Hardware • ~950 CPUs in batch service • 1.4GHz, 2.66GHz, 2.8GHz – P3 and P4/Xeon (HT off) • 1.0GHz systems retiring as they fail, phase out end Oct '05 • New procurement • Aiming for 1400+ SPECint2000/CPU • Systems for testing as part of evaluation of tender • First delivery early '06, second delivery in April/May '06 • ~40 systems for services (FEs, RB, CE, LCG servers, loggers etc) • 60+ disk servers • Mostly SCSI attached IDE or SATA ~220TB unformatted • New procurement: probably PCI/SATA solution • Tape robot • 6K slots, 1.2PB, 10 drives RAL Site Report - HEPiX @ SLAC
Tape Robot / Data Store • Current data: 300TB, PP -> 200+TB (110TB Babar) • Castor 1 system trials • Many CERN-specifics • HSM (Hierarchical Storage Manager) • 500TB, DMF (Data Management Facility) • SCSI/FC • Real file system • Data migrates to tape after inactivity • Not for PP data • Due November 05 • Procurement for a new robot underway • 3PB, ~10 tape drives • Expect to order end Oct 05 • Delivery December 05 • In service by March 06 (for SC4) • Castor system RAL Site Report - HEPiX @ SLAC
Networking • Tier-1 backbone at 4x1Gb/s • Upgrading some links to 10Gb/s • Multi-port 10Gb/s layer-2 switch stack as hub when available • 1Gb/s production link Tier-1 to RAL site • 1Gb/s link to SJ4 (internet) • 1Gb/s HW firewall • Upgrade site backbone to 10Gb/s expected late '05, early '06 • Link Tier-1 to site at 10Gb/s – possible mid-2006 • Link site to SJ5 @ 10Gb/s – mid '06 • Site firewall remains an issue – limit 4Gb/s • 2x1Gb/s link to UKLight • Separate development network in UK • Links to CERN @ 2Gb/s, Lancaster @ 1Gb/s (pending) • Managed ~90MB/s during SC2, less since • Problems with small packet loss causing traffic limitations • Tier-1 to UKLight upgrade to 4x1Gb/s pending,10Gb/s possible • UKLight link to CERN requested @ 4Gb/s for early '06 • Over-running hardware upgrade (4 days expanded to 7 weeks) RAL Site Report - HEPiX @ SLAC
Tier1 Network Core – SC3 ADS Caches RAL Site Non-SC hosts N x 1Gb/s dCache pools 5510-1 7i-1 Router A FW 1Gb/s 1Gb/s 4 x 1Gb/s 1Gb/s to SJ4 dCache pools N x 1Gb/s 4 x 1Gb/s 2 x 1Gb/s to CERN Gridftp servers 5510-2 7i-3 UKLight Router 4 x 1Gb/s 2 x 1Gb/s 290Mb/s to Lancaster Non-SC hosts RAL Site Report - HEPiX @ SLAC
OS/Software • Main services: • Batch, FEs, CE, RB… : SL3 (3.0.3, 3.0.4, 3.0.5) • LCG 2_6_0 • Torque/MAUI • 1 Job/CPU • Disk: RH72 custom, RH73 custom • Some internal services on SL4 (loggers) • Project to use SL4.n for disk servers underway • Solaris disk servers decommissioned • Most hardware sold • AFS on AIX • Transarc • Project to move to Linux (SL3/4) RAL Site Report - HEPiX @ SLAC
Services (1) - Objyserv • Objyserv database service (Babar) • Old service on traditional NFS server • Custom NFS, heavily loaded, unable to cope with increased activity on batch farm due to threading issues in server • Additional server solution with same technology not tenable • New service: • Twin ams-based servers, 2 CPUs, HT on, 2 GB RAM • SL3, RAID1 data disks • 4 servers per host system • Internal redirection using iptables to different server ports depending which of 4 IP addressed used to make the connection • Able to cope with some ease: 600+ clients • Contact Chris Brew RAL Site Report - HEPiX @ SLAC
Services (2)Home file system • Home file system migration • Old system: • ~85GB on A1000 RAID array • Sun Ultra10, Solaris 2.6, 100Mb/s NIC • Failed to cope with some forms of pathological use • New system: • ~270GB SCSI RAID5, 6 disk chassis • 2.4GHz Xeon, 1GB RAM, 1Gb/s NIC • SL3, ext3 • Stable under I/O and quota testing, and during backup • Migration: • 3 weeks planning • 1 week of nightly rsync followed by checksuming • Convince ourselves the rsync works • 1 day farm shutdown to migrate • 1 single file detected to have checksum error • Quotas for users unchanged… • Keep the old system on standby to restore its backups RAL Site Report - HEPiX @ SLAC
Services (3) – Batch Server • Catastrophic disk failure on Saturday late evening over a holiday weekend • Staff not expected back till 8:30am Wednesday • Problem noted Tuesday morning • Initial inspection - disk a total failure • No easy access to backups • Backup tape numbers in logs on failed disk! • No easy recovery solution with no other system staff available • Jobs appear happy – terminating OK, sending sandboxes to gatekeeper etc. But no accounting data, no new jobs started. • Wednesday: • Hardware `revised’ with two disks, Software RAID1, clean install of SL3 • Backups located, batch/scheduling configs recovered from tape store • System restarted with MAUI off to allow Torque to sort itself out • Queues came up closed • MAUI restarted • Service picked up smoothly • Lessons: • Know where the backups are and how to identify which tapes are the right ones • Unmodified batch workers are not good enough for system services RAL Site Report - HEPiX @ SLAC
Issues • How to run resilient services on non-resilient hardware? • Committed to run 24x365, 98%+ uptime • Modified batch workers with extra disks and HS caddies as servers • Investigating HA-Linux • Batch server and scheduling experiments positive • RB,CE, BDII, R-GMA … • Databases • Building services maintenance • Aircon, power • Already two substantial shutdowns in 2006 • New building • UKLight is a development project network • There have been problems with managing expectations for production services on a development network • Unresolved packet loss in CERN-RAL transfers • Under investigation • 10Gb/s kit expensive • Components we would like are not yet affordable/available • Pushing against LCG turn-on date RAL Site Report - HEPiX @ SLAC