Northgrid Status

Alessandra Forti Gridpp25 Ambleside 25 August 2010 Northgrid Status

Apel pies • Lancaster status • Liverpool status • Manchester status • Sheffield status • Conclusions Outline

Apel pie (1)

Apel pie (2)

Apel pie (3)

Lancaster The new shared HPC facility at Lancaster. 1760 cores, delivering over 20k HEPSPEC2k6 Behind an LSF batch system. In a proper machine room (although that's had teething troubles). Added bonus of also providing a home for future GridPP kit. Access to a “Panasus Shelf” for a high performance tarball and VO software areas. 260 TB Grid-only storage included in purchase. Roger Jones is managing director and we have admin access to the cluster, so we have a strong voice in this joint venture. But with root access comes root responsibility.

Lancaster New ``HEC'' facility almost ready. Final stages of acceptance testing. Number of gotchas and niggles causing some delays, but this is expected due to the facilities scale. Underspecced fuses, overspecced UPS, overheating distribution boards, misbehaving switches, oh my! Site otherwise chugging along nicely (with some exceptions). New hardware to replace older servers Site needs a downtime in the near future to get these online and have a general clean up. Tendering for additional +400TB storage

Liverpool: Addition to Cluster 32 bit cluster replaced by 9 x Quad chassis (4 worker nodes in each), boosts 64bit capability, reduces power consumption Each node has 2 x X5620 CPUs (4 core), 2 x hot swap SATA, IPMI, redundant power supply Runs 64bit SL5, glite 3.2 Entire Cluster CPU: 6 x Quad core worker nodes (56 cores) 16 x Quad chassis, 4 WNs each (512 cores) 3 GB RAM per core 8214.44 HEPSPEC06 Storage: 286 Terabytes Plans: Migrate from VMWare to KVM (lcg-CE, CREAM-CE, Site BDII, MON Box ...)

Liverpool: problem https://gus.fzk.de/ws/ticket_info.php?ticket=61224 Site looks as not “not available” in gstat because GSTAT/Nagios doesn't recognise RedHat Enterprise Server as OS OS was correctly advertised following the procedure published How_to_publish_the_OS_name The ticket has been reassigned to gstat SU but the problem has affected the site nagios tests.

Manchester: new hardware Procurement at commissioning stage New hardware will replace half of the current cluster Computing power 20 quad chassis 2x6 core per motherboard 13.7 HEPSPEC06 per core 4GB memory per core 125 GB disk space per core 2 Gb/s bonded link per mother board Total: 960 cores = 13.15k HEPSPEC06 Storage 9x36bay units 30+2+1x2TB for the raid (30 data disks + 2 parity disks + 1 hot spare) 2x250GB for the OS Total: 540 TB usable

Manchester: new hardware Computing nodes arrived and currently running on site soak tests Testing opportuninty to use ext4 as file system rather than ext3 Initial test with iozone on old nodes show a visible improvement especially with multithreaded tests Single thread better in writing only Need to start testing new nodes Storage will be delivered this week Raid card had a disk detection problem. Issue is now solved by a new version of the firmware. Rack switches undergoing reconfiguration to allow 4x2Gb/s bonded links from quad chassis 1x4Gb/s bonded link from storage units 8xGb/s uplink to the main cisco

Manchester: general news Current stability Consistently delivered a high number of CPU hours Current storage is stable although slowed down by SL5(.3?)/XFS problem that causes bursts of load on the machine Evaluating how to tackle the problem for the new storage. Management Stefan Soldner Rembold, Mike Seymour and Un-ki Yang will take over from Roger Barlow the management of the Tier2.

Sheffield: major cluster upgrade We have built a new cluster in Physics Department with 24/7 access. It is a join cluster gridpp/local group with a common torque server Storage Storage Nodes running DPM 1.7.4, recently upgraded to SL5.5. Tests show latest SL5 kernels no longer show xfs bug DPM head node 8 cores and 16 GB RAM 200 TB of disk space have been deployed – covering the pledge for 2010 according to the new GRIDPP requirements SW RAID5 (no raid controllers) All disk pools assembled in Sheffield 2 TB disks seagate barracuda and one 2TB Western digital 5x16bay unit with 2 fs, 2x24 bay unit with 2 fs Cold spare unit on standby near each server 95% isk space reserved for Atlas Plan to add 50 TB late in the autumn

Sheffield: major cluster upgrade Infrastructure Additional 32A ring mains added to the machine room in the physics department Fiber links connecting servers in physics to WN in CICS Old 5kW aircon replaced with a new 10kW (3x10kW aircon units) Dedicated WAN link to cluster Torque Server Accepts jobs from grid CE and local cluster Sends jobs in to all Wns Hosts DNS server CE SL4.8 4 single AMD opteron processor 850, 8GB if RAM, redundant power supply, 72 GB scsi in a raid1 MONBOX and BDII

Sheffield: major cluster upgrade WN 50 WNs in Physics Department (32 from local hep cluster + 18 new), 5 Gb backbone Phenom 3200 MHz x86_64, 8 GB of RAM, 140GB/4cores, 11.96 HepSpec/core 102 old WNs in CICS, 1 Gb backbone 204 single Core 2.4 GHz Opterons (2GB), 4GB of RAM , 72 GB local disk per2Cores; 7.9 HepSpec/core; connected to the servers via fibre link Jobs requiring greater network bandwidth directed to WNs with better backbone software server with 1 TB disk (RAID1) and squid server were moved from CICS Cluster availability and reliability in July is 100% Sheffield is active in atlas production and user analysis

Northgrid has been pretty stable and steadily crunching CPU hours for the past 4 months and data. • Sites have got or are in the process of getting new hardware rejuvenating the CPUs and increasing the storage • Both the CPUs and storage MoU requirements for Northgrid should be satisfied in excess in the near future. Conclusion

Northgrid Status

Northgrid Status

Presentation Transcript

Status

Status

STATUS

Status…

Status

Status

Status

Status

Status

NorthGrid

Northgrid

Status

Status

Status

status

STATUS

PHENIX STATUS PHENIX STATUS

Group Status Project Status

Status