1 / 10

CASPUR Site Report

CASPUR Site Report. Andrei Maslennikov Sector Leader - Systems Orsay, April 2001. Will be shortly covered:. Central computers Storage Tape-related systems Central services AFS, SSH... External support CASPUR and HEP Projects for year 2001. Central computers.

edward
Download Presentation

CASPUR Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CASPUR Site Report Andrei Maslennikov Sector Leader - Systems Orsay, April 2001

  2. Will be shortly covered: • Central computers • Storage • Tape-related systems • Central services • AFS, SSH... • External support • CASPUR and HEP • Projects for year 2001 A.Maslennikov - Orsay 2001

  3. Central computers • Sun SMP - 3500/4500 - 22 processors - Solaris 7+ • - Parallel batch (GRD/Codine) on 8 (336Mhz/2Gb) and 14 (336Mhz/3.6GB) CPUs • - Not very popular, load < 40% • Alpha SMP - 4100/ES40 - 48 processors - DU 4.0F+ • - Front-end : 4 CPU x 500Mhz/2Gb + 2 x 532Mhz/1Gb • - Parallel batch (GRD/Codine) : 20 CPU x 500Mhz/2Gb + 20 x 400Mhz/1Gb • - Serial batch (GRD/Codine) : 2 CPU x 667Mhz/1Gb • - Systems very stable since 1 year • - Batch load: 100% • IBM SMP - Power3 - 72 processors - AIX 4.3.3 ML8/ PSSP 3.2+ • - Front-end : 4 CPU x 375Mhz/4GB • - Parallel batch (GRD/Codine) : 64 CPU x 375Mhz/16Gb - 4 nodes on Colony Switch • - Serial batch : 4 CPU (2 x 375/ 2 x 200 Mhz) • - Batch load: 100% A.Maslennikov - Orsay 2001

  4. Storage • Scratch Areas • - Local RAID-0 scratch areas on alphas and Sun (30-40 GB per node) • - IBM 2102 FC Storage Server for SP3 with 560 GB - GPFS • Large-File Network Data Areas (NFS) • - Two Network Appliance Filers: F540(150GB/FE) and F760(600GB/GE) • - F540: mainly used for tape staging and as a temp space on the LAN - is being dismissed • - F760: dedicated for number-crunching nodes and is not saturated - 1 TB is being added • Small-File Network Data Areas (AFS) • - some 1.4 TB of Dothill RAID-5 on switched Fibre Channel SAN • - 4 servers on CASPUR LAN ( 4x Sun UltraSparc 440 ) • - 70 GB of commodity disk on WAN (Bari and Lecce, 2x Intel/Linux RH 7.0++/OAFS) • NAS and SAN • - NAS: F760, AFS servers <=> Stager and main number-crunchers : being migrated to GE • - SAN: 48 Brocade ports, 10 hosts, 6 disk systems, 6 tape drives : grows fast • - In discussion: tests of 4 Dothill 7120 systems of 0.7TB each as GPFS core on SP3 A.Maslennikov - Orsay 2001

  5. Tape-related systems • Tape Drives and Robotics • - 9740 STK Library (494 slots) is no longer sufficient to host both CASPUR and BABAR data • - Just acquired a new LTO/FC 3584 system from IBM - with 300 slots and 4 drives • - Choice influenced also by the work done at CERN (Baud,Collin,Curran): • http://cscct.home.cern.ch/cscct/ultrium/index.htm • - LTO: excellent streaming speeds - measured 15MB/sec native • - LTO: positioning slow (av 100 sec vs 15 sec for 9840), IBM is working on it • - LTO: 1 drive costs 1/4th of 9840, 1/?th Of 9940. • - Currently on FC: 9840(bridged), DLT7000(bridged), 4xLTO(native) • Tape services • - Automated ADSM backup for some 20 service hosts and Windows desktops • - Automated AFS backup • - Tape locking via Tape Dispatcher • - Staging Servers for CASPUR and BABAR (since 1998): • o fully portable (perl+mysql) • o redundant data format • o multitape supported • o users handle only file names A.Maslennikov - Orsay 2001

  6. Central services • All our central services are Linux-based: • - Syscontrol, DNS, Web, Mail, License, Print, Remote Access, DB • - Linux system tree - always up-to-date • - CASPUR BigBox CD with OAFS and SSH/OAFS - always fresh • During last year: • - Moved to uniform hardware: rack-mounted systems (VA Linux) • - System disks of Syscontrol and DB hosts on Mylex DAC960 CTL (RAID-5) • - Mail: implemented commercial HA solution from Steel Eye (LifeKeeper): • o redundant heartbeat (serial and ethernet) • o RAID-5 spool and sw on low-end Infortrend CTL with 2 host channels • o ping mail while halting the current host: only 5 packets lost A.Maslennikov - Orsay 2001

  7. AFS, SSH... • AFS • - OAFS a marvel (cheap servers possible). • - Free enhanced OAFS client RPMs available at: /afs/caspur.it/project/openafs • - Badly missing: AFS port for COMPAQ Tru64 5.1. • Transarc does nothing • OAFS port may be done at KTH. Now trying to help them to get the OS source • Anybody else interested? • - Maintenance contract: IBM cannot make an offer for more than a year. We receive • support free-of-charge, but hopefully it will end up soon. • SSH • - 1.2.x dangerous • - migrated urgently to openssh 2.3.0p1 (AFS-aware with direct authentication and watcher), • on all architectures A.Maslennikov - Orsay 2001

  8. External support • Turnkey departmental solution • - OAFS Cell on Linux • automated backup required (DLT or AIT autoloaders) • redundant disk when possible • user and space management tools • Clients of UNIX and Windows, MAC AFS Gateway • Server normally stuffed with many other services: web,mail,dns, nis etc • - Organization of work • local trained person per cluster a must • no-root-pw a must • remote-only support (notification mainly via e-mail) • max 20% of total FTE resources dedicated (mainly for initial set-up) • Outside CASPUR: • - 7 Clusters (number 8 just ordered) with about 80 nodes • - Some 20 stand-alone machines (we are getting rid of these) • - All kinds of hardware and all flavours of UNIX A.Maslennikov - Orsay 2001

  9. CASPUR and HEP • Everyday works for INFN: • - Fullscale AFS system support (maintenance and hotline) • - ASIS mirroring to 18 INFN Sections • - SSH tree maintenance • - Linux tree maintenance incl. bootable CDs at the latest patchlevel • BABAR Cluster at CASPUR • - 5 E450 Sun Servers with 6TB of disk (Sun, COMPAQ, DotHill) • - Linux/OAFS file server with backup • - 10-host Sun MC Farm (Ultra 5) • - 14-host Intel/Linux MC farm (rackmounted + 1 TB of RAID IDE) • - multitape stager on E450 - 2 STK 9840 drives • - GRD/Codine on all nodes • Other • - Regular exchanges with CERN • - Virgo (software) A.Maslennikov - Orsay 2001

  10. Projects for year 2001 • Syscontrol DB • - mysql now, migration to InterBase by the end of 2001 • - Hosts’ DB and Syslog event collector DB • - Hooks for syscontrol applications • Control and Monitoring • - agent up and running on all Linux hosts • - being ported to other architectures (encryption) • - server integration with Syscontrol DB(event logs and configuration) • Problem management • - currently study possible solutions, Razor is one of the options • Console Server • - planned for the second half of 2001 • - currently look at the serial hardware • Security • - accent on host-based • - host security “index” is being developed to integrate with Syscontrol A.Maslennikov - Orsay 2001

More Related