1 / 12

Site Report: The Linux Farm at the RCF

HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory. Site Report: The Linux Farm at the RCF. RCF - Overview. Provide computing facilities for RHIC users: General computing environment General interactive tasks (email, document processing, web)

yeva
Download Presentation

Site Report: The Linux Farm at the RCF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HEPIX-HEPNT October 22-25, 2002 Ofer Rind RHIC Computing Facility Brookhaven National Laboratory Site Report: The Linux Farm at the RCF

  2. RCF - Overview Provide computing facilities for RHIC users: • General computing environment • General interactive tasks (email, document processing, web) • Data analysis facility • Computing infrastructure for RHIC experiments • Code development, repository & distribution • Raw data recording & reconstruction • Data analysis ACF: US Atlas Tier 1 Computing Facility • Shared infrastructure and synergy with RCF Support staff: 25 FTE's (4 dedicated to Linux Farm) Ofer Rind - RHIC Computing Facility Site Report

  3. RCF - Structure Ofer Rind - RHIC Computing Facility Site Report

  4. RCF - Component Summary Mass Storage Subsystem • StorageTek library managed by HPSS • 4 Silos, 1.2PB capacity (expanding to 4.5PB) • In Run-2, raw data recorded at a common rate of 70MB/sec for a total of 170TB • Total data store ~300TB Disk Storage • Fibre channel SAN served by NFS • ~110TB Raid5 • 14 Sun 450, Solaris 8 [2-02] (5 Sun 480 coming online) • IBM AFS servers (AIX) Linux Server Farm Ofer Rind - RHIC Computing Facility Site Report

  5. Linux Farm Hardware • 840 1U and 2U servers (pre-'99 towers have been retired) • 69 kSPECint95, expanding to 100 kSPECint95 (2+ TFLOPS) • Most have 1GB mem (at least 500MB) • Local SCSI disks up to 140GB/node • Allocated by experiment • Further allocated for Raw Data Reconstruction (CRS) and Re- constructed Data Analysis (CAS) VA Linux PIII 450Mz 148 Jun 99 VA Linux PIII 700Mz 48 Aug 00 VA Linux PIII 800Mz 168 Nov 00 IBM PIII 1000Mz 316 Aug 01 IBM PIII 1400Mz 160 Oct 02 Ofer Rind - RHIC Computing Facility Site Report

  6. Linux Farm Software Configuration • RedHat 7.2 upgraded to 2.4.9-31 kernel • Image(s) installed via Kickstart server and customized for RCF environment via rpm • NFS + AFS home directory and file access • Interactive login allowed on selected nodes • Job management: (CAS) LSF 4.2 - slightly re-architected for robustness. Peak throughput before summer conferences was >150K jobs/week. (CRS) Locally produced Perl-based batch system (AIX needed for HPSS API). Approx. 670K jobs processed for Run-2. • Expanding use of distributed disk models (rootd, ??) • Atlas Grid testbed Ofer Rind - RHIC Computing Facility Site Report

  7. Tracking LSF Usage Star queues weekly job statistics (week of Oct. 10) Job starts/hr Avg runtime/hr Runtime Ofer Rind - RHIC Computing Facility Site Report

  8. Security and Monitoring Security: • RCF firewall within BNL site firewall • SSH2 only access through gateway bastion nodes (Solaris x86) • User access restricted to a subset of systems (CAS only) Monitoring: • 24 hr. on-call staff for critical systems during RHIC operation • Cluster mgmt. software: • VACM (VA Linux) • xCAT (IBM, http://www.x-cat.org) • Cron scripts to "clean" nodes and head off possible problems (memory leaks, full disks, etc.) • CTS system for problem reports Ofer Rind - RHIC Computing Facility Site Report

  9. Farm Alert System Web-monitoring (user-accessible) plus paging/email alerts Python scripts running locally transferring node status information to a MySQL database. Notification of problems with NFS/AFS (e.g. stale file handles), LSF daemons, high load, etc. Ofer Rind - RHIC Computing Facility Site Report

  10. Network Operation Status Perl scripts monitor network service connectivity for all nodes (ssh, yp, etc.) Ofer Rind - RHIC Computing Facility Site Report

  11. Load Monitoring and History MySQL database for usage history History available back to Sept. '01 via web interface. CPU Load averaged over (98) Phenix machines during the month of September. Ofer Rind - RHIC Computing Facility Site Report

  12. Plans for the Near Future • 160 newly delivered IBM nodes to be brought online • Expect purchase bid to go out for ~220 more nodes at beginning of FY03 (pending funding approval) • Scaling up data storage capacity and throughput for Run-3 (up to 10X data increase over Run-2, starting in December) • Evaluation of LSF 5 and Condor ongoing, with an eye towards distributed disk services • Expanding Atlas GRID services Ofer Rind - RHIC Computing Facility Site Report

More Related