Performance and Procurement Update: Infrastructure Challenges and Enhancements at NIKHEF/SARA

NL-T1 Report Ron Trompert

Contents • Infrastructure • Usage • CPU • Disk • Tape • I/O • Disk storage • Compute • Tape • LHCOPN dashboard • Issues • New procurements • dCache

Infrastructure

CPU Usage

CPU Usage SARA NIKHEF

ATLAS Disk Usage SARA disk in GB NIKHEF disk in GB

ATLAS Tape Usage SARA tape usage in TB

I/O storage In Out

I/O Gina

I/O tape • Read performance: about 400-500 MB/s from tape to dCache on average when there is no heavy tape writing going on • Have done some on the fly tuning to get there • Adapted dcachehsm copy script • Cxfs/dmf clientnode tuning (queue lengths) • Performance is OK given the circumstances but not as much as what we aim for (1 GB/s) • Should be better with new hardware and DMF5 • Have replaced hpn-ssh with globusgridftp to copy files between cxfs/dmf clientnodes and dCache • Write performance is also about 400-500 MB/s • Adapted hsm copy script to compute checksums+flush and fsync writing to cxfs/dmf clients to avoid data corruption

LHC OPN Dashboard

Issues • Part of the farm at NIKHEF has not been usuable due to longstanding network issues related to the built-in switches of blade centers delivered in the autumn of 2009. The vendor has not been very active in attempting to resolve this but we hope to have a solution soon. • Due to the issue above, ATLAS jobs are only running on a part of the farm which implies that they are queued for a longer period of time. Pilot factories submit lots of jobs so that it appears that sometimes the batch system does not find any non-ATLAS runnable jobs due to this huge queue. This leads to unused job slots. • According the VO ID card ATLAS jobs need 3072 MB of virtual memory. Nevertheless, our batch systems limits this at 4096MB and this is still too small for some ATLAS jobs. How is ATLAS going to tackle this?

Issues • Is ATLAS able to use CVMFS using the mount point /cmvfs/atlas.cern.ch/? This would solve two problems? • A third of the content of the BDII • No quota on experiment software disk • We have seen transfers fro ATLASLOCALGROUPDISK to elsewhere. Isn’t LOCAL supposed to be LOCAL? • FTS channels • Wouldn’t it be good to let the site admins within the NL cloud be channel admin of their own channels? Then you can tune the channel anyway you want or turn yourself off when going in downtime.

New procurements • Compute: 50 KSi2006 rate, 27 KSi2006 rate at NIKHEF and 23 KSi2006 rate at SARA • Tape: 2PB • Disk: 850 TiB at SARA, 280 TiB at NIKHEF • Pledges are still under discussion

New procurements: Mass storage • Scalable solution with DMF5 • Investigating faster SAN storage

dCache@SARA • The Golden release 1.9.5-* has been a very reliable workhorse the past year. But ….. • There will be a new Golden release 1.9.12 with some very nice features for admins but also for users, like, for example: • srmGetTurl does not wait anymore for the standard 4 seconds • WebDAV (http/https). Mount dCache on your laptop. • So, we intend to upgrade

Performance and Procurement Update: Infrastructure Challenges and Enhancements at NIKHEF/SARA

Performance and Procurement Update: Infrastructure Challenges and Enhancements at NIKHEF/SARA

Presentation Transcript

A T1

NL-T1 Expectations , findings , and innovation

INFN-T1 site report

INFN-T1 site report

Chairman’s Report for TSG-T1 SIG SWG

INFN-T1 site report

INFN-T1 site report

ASGC T1 report

T1 Ejecta

T1-NREN

T1 Multiplexing

Report on T1 and T2 DAQ

INFN-T1 status report

INFN-T1 site report

INFN-T1 site report

T1 performances

T1 activity report