RHIC/US ATLAS Tier 1 Computing Facility Site Report

Christopher Hollowell Physics Department Brookhaven National Laboratory hollowec@bnl.gov RHIC/US ATLAS Tier 1 Computing FacilitySite Report HEPiX Upton, NY, USA October 18, 2004

Facility Overview • Created in the mid 1990's to provide centralized computing services for the RHIC experiments • Expanded our role in the late 1990's to act as the tier 1 computing center for ATLAS in the United States • Currently employ 28 staff members: planning on adding 5 additional employees in the next fiscal year

Facility Overview (Cont.) • Ramping up resources provided to ATLAS: Data Challenge 2 (DC2) underway • RHIC Run 5 scheduled to begin in late December 2004

Centralized Disk Storage • 37 NFS Servers Running Solaris 9: recent upgrade from Solaris 8 • Underlying filesystems upgraded to VxFS 4.0 • Issue with quotas on filesystems larger than 1 TB in size • ~220 TB of fibre channel SAN-based RAID5 storage available: added ~100 TB in the past year

Centralized Disk Storage (Cont.) • Scalability issues with NFS (network-limited to ~70 MB/s max per server [75-90 MB/s max local I/O] in our configuration): testing of new network storage models including Panasas and IBRIX in progress • Panasas tests look promising. 4.5 TB of storage on 10 blades available for evaluation by our user community. DirectFlow client in use on over 400 machines • Both systems allow for NFS export of data

Centralized Disk Storage (Cont.)

Centralized Disk Storage: AFS • Moving servers from Transarc AFS running on AIX to OpenAFS 1.2.11 on Solaris 9 • The move from Transarc to OpenAFS motivated by Kerberos4/Kerberos5 issues and Transarc AFS end of life • Total of 7 fileservers and 6 DB servers: 2 DB servers and 2 fileservers running OpenAFS • 2 Cells

Mass Tape Storage • Four STK Powderhorn silos provided, each with the capability of holding ~6000 tapes • 1.7 PB data currently stored • HPSS Version 4.5.1: likely upgrade to version 6.1 or 6.2 after RHIC Run 5 • 45 tape drives available for use • Latest STK tape technology: 200 GB/tape • ~12 TB disk cache in front of the system

Mass Tape Storage (Cont.) • PFTP, HSI and HTAR available as interfaces

CAS/CRS Farm • Farm of 1423 dual-CPU (Intel) systems • Added 335 machines this year • ~245 TB local disk storage (SCSI and IDE) • Upgrade of RHIC Central Analysis Servers/Central Reconstruction Servers (CAS/CRS) to Scientific Linux 3.0.2 (+updates) underway: should be complete before next RHIC run

CAS/CRS Farm (Cont.) • LSF (5.1) and Condor (6.6.6/6.6.5) batch systems in use. Upgrade to LSF 6.0 planned • Kickstart used to automate node installation • GANGLIA + custom software used for system monitoring • Phasing out the original RHIC CRS Batch System: replacing with a system based on Condor • Retiring 142 VA Linux 2U PIII 450 MHz systems after next purchase

CAS/CRS Farm (Cont.)

Security • Elimination of NIS, complete transition to Kerberos5/LDAP in progress • Expect K5 TGT to X.509 certificate transition in the future: KCA? • Hardening/monitoring of all internal systems • Growing web service issues: unknown services accessed through port 80

Grid Activities • Brookhaven planning on upgrading external network connectivity to OC48 (2.488 Gbps) from OC12 (622 Mbps) to support ATLAS activity • ATLAS Data Challenge 2: jobs submitted via Grid3 • GUMS (Grid User Management System) • Generates grid-mapfiles for gatekeeper hosts • In production since May 2004

Storage Resource Manager (SRM) • SRM: middleware providing dynamic storage allocation and data management services • Automatically handles network/space allocation failures • HRM (Hierarchical Resource Manager)-type SRM server in production • Accessible from within and outside the facility • 350 GB Cache • Berkeley HRM 1.2.1

dCache • Provides global name space over disparate storage elements • Hot spot detection • Client software data access through libdcap library or libpdcap preload library • ATLAS & PHENIX dCache pools • PHENIX pool expanding performance tests to production machines • ATLAS pool interacting with HPSS using HSI: no way of throttling data transfer requests as of yet

RHIC/US ATLAS Tier 1 Computing Facility Site Report