90 likes | 201 Views
This report outlines the recent upgrades and improvements in the central computing systems at the Thomas Jefferson National Accelerator Facility in Newport News, Virginia. Significant enhancements include a transition to Solaris 8 and HP 11i systems, the evaluation of RedHat 10, and the upgrade of Windows 2000 Domain services. The facility has expanded online disk space and improved batch farm management and monitoring via JASMine and Auger. Notable projects involve implementing enhanced backup systems, establishing better file scheduling, and addressing issues related to network infrastructure and storage solutions.
E N D
Jefferson LabSite Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA Kelvin.Edwards@jlab.org 757-269-7770 http://cc.jlab.org HEPiX - TRIUMF, Oct. 20, 2003
Central Computing • Sun systems • Upgrade to Solaris 8 almost complete • HP systems • All upgraded to HP 11i • Moving away from HP for central services • Linux systems • Still at RedHat 7.2 • Evaluating RedHat 10 (Fedora 1) • Windows 2000 Domain Upgrade • Implemented in May • Working on Group policy issues
Central Computing (cont) • Network Appliance • 2 recently upgraded to the FAS940 (~16k NFS Ops/sec) • ~4.5TB online disk space (1.5TB home, 2TB group) • Linux fileserver • 3Ware SATA system • 2TB scratch area (16 160GB Seagate SATA drives) • Backups • QuickRestore • Seagate LTOs, Overland Tape Library
Scientific Computing • JASMine & Auger (http://cc.jlab.org/scicomp) • JASMine: Mass Storage Tape + Disk Cache • Auger: Batch Farm Management & Monitoring • Typical Day • 2 – 4 TB of INPUT data through the farm • Process 2000 – 5000 jobs • Certificates used for all user authentication • Tape drives • 6 9840s – migrating data to 9940Bs • 13 9940A – Read only • 15 9940B – all data written to these tapes
Scientific Computing (cont) • Linux File Servers • 16 Data Movers – • 10 Mylex eXtremeRAID 2000 RAID cards (RAID-5) (SCSI) • 6 Adaptec 2200S Raid Cards (RAID-50) (U320 SCSI) • 32 Cache/Work File Servers • Mixture of Mylex and 3Ware cards • Batch Farming – over 24000 SPECint95, LSF • 178 RH 7.2 Linux dual-processors (P2 750 to P4 2.66GHz)
Noteworthy • Kswapd failures -- Solved • Automount timeouts set to 60 seconds, NOT minutes • Adaptec 2200S raid cards • Instead of the MegaRaid cards • Not quite as fast, but acceptable • Timeout problem -- fix available • Adaptec TOE (TCP Offloading Engine) • Problems with RH7.2, custom kernel (XFS), and their driver • Anyone else using them? Good results?
Projects • Windows • Standard builds (Server, IIS, desktop, laptop) • Backup Software Upgrade • Reliaty (was QuickRestore) • SSH v2 Internally • Networks • Gigabit connection to our border router • VLans for use on site
Projects (cont) • JASMine • Rewrite disk cache • Support farm output caches • Policy-based file movement off-site • Auger • Better file scheduling/pinning
Projects (cont) • PPDG • SRM version 2 • Replication • Replica Catalog web service interface • Remote Job submission • User and System JDLs • Batch web service integration with Auger