1 / 16

Tier1 Site Report

This site report provides an overview of the hardware, services, monitoring, and networking at the Tier1 site in RAL. It also discusses the change in UK science funding structure and the funding and staff updates of the site. Additionally, it provides information about the new computing building and tape silo, capacity hardware, storage commissioning, operating systems, and services at the site.

agooding
Download Presentation

Tier1 Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly

  2. Overview • RAL / Tier-1 • Hardware • Services • Monitoring • Networking Tier1 Site Report - HEPSysMan, RAL

  3. RAL / Tier-1 • Change in UK science funding structure: • CCLRC and PPARC have been merged to form a new Research Council: Science and Technology Facilities Council (STFC) • Combined remit, looks after large facilities, grants etc… • RAL is one of the several STFC institutes • Some internal restructuring and name changes in Business Units • New corporate styles, etc • RAL hosts the UK WLCG Tier-1 • Funded via GridPP2 project by STFC • Supports WLCG and UK Particle Physics users and collaborators • atlas, cms, lhcb, alice, dteam, ops, babar, cdf, d0, h1, zeus, bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, mice, sno, ukqcd, harp, theory users … • Expect no change operationally as a result of STFC ‘ownership’. Tier1 Site Report - HEPSysMan, RAL

  4. Finance & Staff • GridPP3 project funding approved • “From Production to Exploitation” • Provides for UK Tier-1, Tier-2s and some software activity • April 2008 to March 2011 • Tier1: • Increase in staff: 17 FTE (+3.4 FTE from ESC) • Hardware resources for WLCG: ~£7.2M • Tight funding settlement, contingencies for HW and power • Additional Tier1 staff now in-post • 2 x systems administrators: James Thorne, Lex Holt • 1 x hardware technician: James Adams • 1 x PPS admin: Marian Klein Tier1 Site Report - HEPSysMan, RAL

  5. New Computing Building • Funding for a new computer centre building • Funded by RAL/STFC as part of site infrastructure • Shared with HPC and other STFC computing facilities • Design complete: ~300 racks + 3-4 tape silos • Planning permission granted • Tender running for construction and fitting out • Construction starts July, planned to be ready for occupation mid August 2008 Tier1 Site Report - HEPSysMan, RAL

  6. Tape Silo • Sun SL8500 tape silo • Expanded from 6,000 to 10,000 slots • 8 robot trucks • 18 x T10K, 10 x 9940B drives • 8 x T10K tape drives for CASTOR • Second silo planned this FY • SL8500, 6,000 slots • Tape passing between silos may be possible Tier1 Site Report - HEPSysMan, RAL

  7. Capacity Hardware FY06/07 • CPU • 64 x 1U twin dual-core Woodcrest 5130 units: ~550kSI2K • 4GB RAM, 250GB DATA HDD, dual 1GB NIC • Commissioned January 07 • Total capacity ~1550 kSIbase2K, ~1275 job slots • Disk • 86 x 3U 16-bay servers: 516TB(10^12) data capacity • 3Ware 9550SX, 14 x 500GB data drives, 2 x 250GB system drives • Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC • Commissioned March 07, into production service as required • Total disk storage ~900TB • ~40TB being phased out at end of life (~5 years) Tier1 Site Report - HEPSysMan, RAL

  8. Storage commissioning  • Problems with recent storage systems now solved! • Issue: WD5000YS (500GB) units show ‘random throws’ from RAID units • No host logs of problems, testing drive offline shows no drive issues • Common to two completely different hardware configurations • Problem isolated: • Non-return loop in the drive firmware • Drive head needs to move occasionally to avoid ploughing a furrow in the drive platter lubricant: due to timeout issues in some circumstances, the drive would just sit there stuck, and communication with the controller would time out, causing a drive eject • Yanking the drive resets the electronics and no problem is evident (or logged) • WD patched firmware once problem isolated • Subsequent reports of the same or similar problem from non-HEP sites Tier1 Site Report - HEPSysMan, RAL

  9. Operating systems • Grid services, batch workers, service machines • Mainly SL3.0.3, SL3.0.5, SL3.0.8, some SL4.2, SL4.4, all ix86 • Planning for x86_64 WNs, SL4 batch services • Disk storage • New servers using SL4/i386/ext3, some x86_64 • CASTOR, dCache, NFS, Xrootd • Older servers: SL4 migration in progress • Tape systems • AIX: ADS tape caches • Solaris: silo/library controllers • SL3/4: CASTOR caches, SRMs, tape servers • Oracle systems • RHEL3/4 • Batch system • Torque/MAUI • Problems with jobs ‘failing to launch’ • Reduced using Torque with rpp disabled Tier1 Site Report - HEPSysMan, RAL

  10. Services • UK National BDII • Single system was overloaded • Dropping connections, failing to reply to queries • Failing SAM tests, detrimental to UK Grid services (and reliability stats!) • Replaced single unit with a DNS-‘balanced’ pair in Feb 07 • Extended to triplet in March • UIs • Migration to gLite-flavour in May 07 • CE • Overloaded system moved to twin dual-core (AMD) node with faster SATA drive • Plan a second (gLite) CE to split load • RB • Second RB added to ease load • PPS • Service now in production • Testing gLite-flavour middleware • AFS • Upgrade of hardware postponed, pending review of service needs Tier1 Site Report - HEPSysMan, RAL

  11. Storage Resource Management • dCache • Performance issues • LAN performance very good • WAN performance and tuning problems • Stability issues • Now better: • increased number of open file descriptors, number of logins allowed • Java 1.4 -> 1.5 • ADS • In-house system, many years old • Will remain for some legacy services, but not planned for PP • CASTOR • Replacing both dCache disk and tape SRMs for major data services • Replace T1 access to existing tape services • Production services for ATLAS, CMS, LHCb • CSA06 to castor OK • Support issues • ‘Plan B’ Tier1 Site Report - HEPSysMan, RAL

  12. CASTOR Issues • Lots of issues causing stability problems • Scheduling transfer jobs to servers in the wrong service class • Problems upgrading to latest version • T1 running older versions, not in use at CERN • Struggle to get new versions running on test instance • Support patchy • Performance on disk servers with single file system poor compared to performance on servers with multiple file systems: • Castor schedules transfers per file system whereas LSF uses limits per disk server • New LSF plug-in should resolve but needs latest LSF and Castor • WAN tuning not good for LAN transfers • Problem with ‘Reserved Space’ • Lots of other niggles and unwarranted assumptions • Short hostnames Tier1 Site Report - HEPSysMan, RAL

  13. Monitoring • Nagios • Production service implemented • Replaces SURE for alarm and exception handling • 3 servers (1 master + 2 slaves) • Almost all systems covered • 800+ • Some stability issues with server • Memory use • Call-out facilities to be added • Ganglia • Updating to latest version • More stable • CACTI • Network monitoring Tier1 Site Report - HEPSysMan, RAL

  14. Networking • All systems have 1Gb/s connections • Except oldest fraction of the batch farm • 10Gb/s interlinks everywhere • 10Gb/s backbone complete within T1 • Nortel 5530/5510 stacks • Considering T1 internal topology – will it meet the intra-farm transfer rates? • 10Gb/s link to RAL site backbone • 10Gb/s link to RAL T2 • 10Gb/s link to UK academic network SuperJanet5 (SJ5) • Direct link to SJ5 rather than via local MAN • Active 10 April 2007 • Link to Firewall now at 2Gb/s • Planned 10Gb/s bypass for T1-T2 data traffic • 10Gb/s OPN link to CERN • T1-T1 routing via OPN being implemented Tier1 Site Report - HEPSysMan, RAL

  15. Testing developments • Viglen HX2220i ‘Twin’ system • Intel Clovertown ‘Quads’ • Benchmarking, running in batch system • Viglen HS216a storage • 3U 16-bay with 3ware 9650SX-16 controller • Similar to recent servers • controller is PCI-E, RAID6 • Data Direct Networks storage • ‘RAID’-style controller with disk shelves attached via FC, FC attached to servers. • Aim to test performance under various load types and SRM clients Tier1 Site Report - HEPSysMan, RAL

  16. Comments, Questions?

More Related