1 / 17

RHIC/US ATLAS Tier 1 Computing Facility Site Report

Christopher Hollowell Physics Department Brookhaven National Laboratory hollowec@bnl.gov. RHIC/US ATLAS Tier 1 Computing Facility Site Report. HEPiX Upton, NY, USA October 18, 2004. Facility Overview.

iain
Download Presentation

RHIC/US ATLAS Tier 1 Computing Facility Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Christopher Hollowell Physics Department Brookhaven National Laboratory hollowec@bnl.gov RHIC/US ATLAS Tier 1 Computing FacilitySite Report HEPiX Upton, NY, USA October 18, 2004

  2. Facility Overview • Created in the mid 1990's to provide centralized computing services for the RHIC experiments • Expanded our role in the late 1990's to act as the tier 1 computing center for ATLAS in the United States • Currently employ 28 staff members: planning on adding 5 additional employees in the next fiscal year

  3. Facility Overview (Cont.) • Ramping up resources provided to ATLAS: Data Challenge 2 (DC2) underway • RHIC Run 5 scheduled to begin in late December 2004

  4. Centralized Disk Storage • 37 NFS Servers Running Solaris 9: recent upgrade from Solaris 8 • Underlying filesystems upgraded to VxFS 4.0 • Issue with quotas on filesystems larger than 1 TB in size • ~220 TB of fibre channel SAN-based RAID5 storage available: added ~100 TB in the past year

  5. Centralized Disk Storage (Cont.) • Scalability issues with NFS (network-limited to ~70 MB/s max per server [75-90 MB/s max local I/O] in our configuration): testing of new network storage models including Panasas and IBRIX in progress • Panasas tests look promising. 4.5 TB of storage on 10 blades available for evaluation by our user community. DirectFlow client in use on over 400 machines • Both systems allow for NFS export of data

  6. Centralized Disk Storage (Cont.)

  7. Centralized Disk Storage: AFS • Moving servers from Transarc AFS running on AIX to OpenAFS 1.2.11 on Solaris 9 • The move from Transarc to OpenAFS motivated by Kerberos4/Kerberos5 issues and Transarc AFS end of life • Total of 7 fileservers and 6 DB servers: 2 DB servers and 2 fileservers running OpenAFS • 2 Cells

  8. Mass Tape Storage • Four STK Powderhorn silos provided, each with the capability of holding ~6000 tapes • 1.7 PB data currently stored • HPSS Version 4.5.1: likely upgrade to version 6.1 or 6.2 after RHIC Run 5 • 45 tape drives available for use • Latest STK tape technology: 200 GB/tape • ~12 TB disk cache in front of the system

  9. Mass Tape Storage (Cont.) • PFTP, HSI and HTAR available as interfaces

  10. CAS/CRS Farm • Farm of 1423 dual-CPU (Intel) systems • Added 335 machines this year • ~245 TB local disk storage (SCSI and IDE) • Upgrade of RHIC Central Analysis Servers/Central Reconstruction Servers (CAS/CRS) to Scientific Linux 3.0.2 (+updates) underway: should be complete before next RHIC run

  11. CAS/CRS Farm (Cont.) • LSF (5.1) and Condor (6.6.6/6.6.5) batch systems in use. Upgrade to LSF 6.0 planned • Kickstart used to automate node installation • GANGLIA + custom software used for system monitoring • Phasing out the original RHIC CRS Batch System: replacing with a system based on Condor • Retiring 142 VA Linux 2U PIII 450 MHz systems after next purchase

  12. CAS/CRS Farm (Cont.)

  13. CAS/CRS Farm (Cont.)

  14. Security • Elimination of NIS, complete transition to Kerberos5/LDAP in progress • Expect K5 TGT to X.509 certificate transition in the future: KCA? • Hardening/monitoring of all internal systems • Growing web service issues: unknown services accessed through port 80

  15. Grid Activities • Brookhaven planning on upgrading external network connectivity to OC48 (2.488 Gbps) from OC12 (622 Mbps) to support ATLAS activity • ATLAS Data Challenge 2: jobs submitted via Grid3 • GUMS (Grid User Management System) • Generates grid-mapfiles for gatekeeper hosts • In production since May 2004

  16. Storage Resource Manager (SRM) • SRM: middleware providing dynamic storage allocation and data management services • Automatically handles network/space allocation failures • HRM (Hierarchical Resource Manager)-type SRM server in production • Accessible from within and outside the facility • 350 GB Cache • Berkeley HRM 1.2.1

  17. dCache • Provides global name space over disparate storage elements • Hot spot detection • Client software data access through libdcap library or libpdcap preload library • ATLAS & PHENIX dCache pools • PHENIX pool expanding performance tests to production machines • ATLAS pool interacting with HPSS using HSI: no way of throttling data transfer requests as of yet

More Related