Tier1 Site Report

Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly

Overview • RAL / Tier-1 • Hardware • Services • Monitoring • Networking Tier1 Site Report - HEPSysMan, RAL

RAL / Tier-1 • Change in UK science funding structure: • CCLRC and PPARC have been merged to form a new Research Council: Science and Technology Facilities Council (STFC) • Combined remit, looks after large facilities, grants etc… • RAL is one of the several STFC institutes • Some internal restructuring and name changes in Business Units • New corporate styles, etc • RAL hosts the UK WLCG Tier-1 • Funded via GridPP2 project by STFC • Supports WLCG and UK Particle Physics users and collaborators • atlas, cms, lhcb, alice, dteam, ops, babar, cdf, d0, h1, zeus, bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, mice, sno, ukqcd, harp, theory users … • Expect no change operationally as a result of STFC ‘ownership’. Tier1 Site Report - HEPSysMan, RAL

Finance & Staff • GridPP3 project funding approved • “From Production to Exploitation” • Provides for UK Tier-1, Tier-2s and some software activity • April 2008 to March 2011 • Tier1: • Increase in staff: 17 FTE (+3.4 FTE from ESC) • Hardware resources for WLCG: ~£7.2M • Tight funding settlement, contingencies for HW and power • Additional Tier1 staff now in-post • 2 x systems administrators: James Thorne, Lex Holt • 1 x hardware technician: James Adams • 1 x PPS admin: Marian Klein Tier1 Site Report - HEPSysMan, RAL

New Computing Building • Funding for a new computer centre building • Funded by RAL/STFC as part of site infrastructure • Shared with HPC and other STFC computing facilities • Design complete: ~300 racks + 3-4 tape silos • Planning permission granted • Tender running for construction and fitting out • Construction starts July, planned to be ready for occupation mid August 2008 Tier1 Site Report - HEPSysMan, RAL

Tape Silo • Sun SL8500 tape silo • Expanded from 6,000 to 10,000 slots • 8 robot trucks • 18 x T10K, 10 x 9940B drives • 8 x T10K tape drives for CASTOR • Second silo planned this FY • SL8500, 6,000 slots • Tape passing between silos may be possible Tier1 Site Report - HEPSysMan, RAL

Capacity Hardware FY06/07 • CPU • 64 x 1U twin dual-core Woodcrest 5130 units: ~550kSI2K • 4GB RAM, 250GB DATA HDD, dual 1GB NIC • Commissioned January 07 • Total capacity ~1550 kSIbase2K, ~1275 job slots • Disk • 86 x 3U 16-bay servers: 516TB(10^12) data capacity • 3Ware 9550SX, 14 x 500GB data drives, 2 x 250GB system drives • Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC • Commissioned March 07, into production service as required • Total disk storage ~900TB • ~40TB being phased out at end of life (~5 years) Tier1 Site Report - HEPSysMan, RAL

Storage commissioning  • Problems with recent storage systems now solved! • Issue: WD5000YS (500GB) units show ‘random throws’ from RAID units • No host logs of problems, testing drive offline shows no drive issues • Common to two completely different hardware configurations • Problem isolated: • Non-return loop in the drive firmware • Drive head needs to move occasionally to avoid ploughing a furrow in the drive platter lubricant: due to timeout issues in some circumstances, the drive would just sit there stuck, and communication with the controller would time out, causing a drive eject • Yanking the drive resets the electronics and no problem is evident (or logged) • WD patched firmware once problem isolated • Subsequent reports of the same or similar problem from non-HEP sites Tier1 Site Report - HEPSysMan, RAL

Operating systems • Grid services, batch workers, service machines • Mainly SL3.0.3, SL3.0.5, SL3.0.8, some SL4.2, SL4.4, all ix86 • Planning for x86_64 WNs, SL4 batch services • Disk storage • New servers using SL4/i386/ext3, some x86_64 • CASTOR, dCache, NFS, Xrootd • Older servers: SL4 migration in progress • Tape systems • AIX: ADS tape caches • Solaris: silo/library controllers • SL3/4: CASTOR caches, SRMs, tape servers • Oracle systems • RHEL3/4 • Batch system • Torque/MAUI • Problems with jobs ‘failing to launch’ • Reduced using Torque with rpp disabled Tier1 Site Report - HEPSysMan, RAL

Services • UK National BDII • Single system was overloaded • Dropping connections, failing to reply to queries • Failing SAM tests, detrimental to UK Grid services (and reliability stats!) • Replaced single unit with a DNS-‘balanced’ pair in Feb 07 • Extended to triplet in March • UIs • Migration to gLite-flavour in May 07 • CE • Overloaded system moved to twin dual-core (AMD) node with faster SATA drive • Plan a second (gLite) CE to split load • RB • Second RB added to ease load • PPS • Service now in production • Testing gLite-flavour middleware • AFS • Upgrade of hardware postponed, pending review of service needs Tier1 Site Report - HEPSysMan, RAL

Storage Resource Management • dCache • Performance issues • LAN performance very good • WAN performance and tuning problems • Stability issues • Now better: • increased number of open file descriptors, number of logins allowed • Java 1.4 -> 1.5 • ADS • In-house system, many years old • Will remain for some legacy services, but not planned for PP • CASTOR • Replacing both dCache disk and tape SRMs for major data services • Replace T1 access to existing tape services • Production services for ATLAS, CMS, LHCb • CSA06 to castor OK • Support issues • ‘Plan B’ Tier1 Site Report - HEPSysMan, RAL

CASTOR Issues • Lots of issues causing stability problems • Scheduling transfer jobs to servers in the wrong service class • Problems upgrading to latest version • T1 running older versions, not in use at CERN • Struggle to get new versions running on test instance • Support patchy • Performance on disk servers with single file system poor compared to performance on servers with multiple file systems: • Castor schedules transfers per file system whereas LSF uses limits per disk server • New LSF plug-in should resolve but needs latest LSF and Castor • WAN tuning not good for LAN transfers • Problem with ‘Reserved Space’ • Lots of other niggles and unwarranted assumptions • Short hostnames Tier1 Site Report - HEPSysMan, RAL

Monitoring • Nagios • Production service implemented • Replaces SURE for alarm and exception handling • 3 servers (1 master + 2 slaves) • Almost all systems covered • 800+ • Some stability issues with server • Memory use • Call-out facilities to be added • Ganglia • Updating to latest version • More stable • CACTI • Network monitoring Tier1 Site Report - HEPSysMan, RAL

Networking • All systems have 1Gb/s connections • Except oldest fraction of the batch farm • 10Gb/s interlinks everywhere • 10Gb/s backbone complete within T1 • Nortel 5530/5510 stacks • Considering T1 internal topology – will it meet the intra-farm transfer rates? • 10Gb/s link to RAL site backbone • 10Gb/s link to RAL T2 • 10Gb/s link to UK academic network SuperJanet5 (SJ5) • Direct link to SJ5 rather than via local MAN • Active 10 April 2007 • Link to Firewall now at 2Gb/s • Planned 10Gb/s bypass for T1-T2 data traffic • 10Gb/s OPN link to CERN • T1-T1 routing via OPN being implemented Tier1 Site Report - HEPSysMan, RAL

Testing developments • Viglen HX2220i ‘Twin’ system • Intel Clovertown ‘Quads’ • Benchmarking, running in batch system • Viglen HS216a storage • 3U 16-bay with 3ware 9650SX-16 controller • Similar to recent servers • controller is PCI-E, RAID6 • Data Direct Networks storage • ‘RAID’-style controller with disk shelves attached via FC, FC attached to servers. • Aim to test performance under various load types and SRM clients Tier1 Site Report - HEPSysMan, RAL

Comments, Questions?

Tier1 Site Report