Alessandra Forti Gridpp25 Ambleside 25 August 2010. Northgrid Status. Apel pies Lancaster status Liverpool status Manchester status Sheffield status Conclusions. Outline. Apel pie (1). Apel pie (2). Apel pie (3). Lancaster. The new shared HPC facility at Lancaster.
The new shared HPC facility at Lancaster.
1760 cores, delivering over 20k HEPSPEC2k6
Behind an LSF batch system.
In a proper machine room (although that's had teething troubles).
Added bonus of also providing a home for future GridPP kit.
Access to a “Panasus Shelf” for a high performance tarball and VO software areas.
260 TB Grid-only storage included in purchase.
Roger Jones is managing director and we have admin access to the cluster, so we have a strong voice in this joint venture.
But with root access comes root responsibility.
New ``HEC'' facility almost ready.
Final stages of acceptance testing.
Number of gotchas and niggles causing some delays, but this is expected due to the facilities scale.
Underspecced fuses, overspecced UPS, overheating distribution boards, misbehaving switches, oh my!
Site otherwise chugging along nicely (with some exceptions).
New hardware to replace older servers
Site needs a downtime in the near future to get these online and have a general clean up.
Tendering for additional +400TB storage
Addition to Cluster
32 bit cluster replaced by 9 x Quad chassis (4 worker nodes in each), boosts 64bit capability, reduces power consumption
Each node has 2 x X5620 CPUs (4 core), 2 x hot swap SATA, IPMI, redundant power supply
Runs 64bit SL5, glite 3.2
6 x Quad core worker nodes (56 cores)
16 x Quad chassis, 4 WNs each (512 cores)
3 GB RAM per core
Storage: 286 Terabytes
Plans: Migrate from VMWare to KVM (lcg-CE, CREAM-CE, Site BDII, MON Box ...)
Site looks as not “not available” in gstat because GSTAT/Nagios doesn't recognise RedHat Enterprise Server as OS
OS was correctly advertised following the procedure published How_to_publish_the_OS_name
The ticket has been reassigned to gstat SU but the problem has affected the site nagios tests.
Procurement at commissioning stage
New hardware will replace half of the current cluster
20 quad chassis
2x6 core per motherboard
13.7 HEPSPEC06 per core
4GB memory per core
125 GB disk space per core
2 Gb/s bonded link per mother board
Total: 960 cores = 13.15k HEPSPEC06
30+2+1x2TB for the raid (30 data disks + 2 parity disks + 1 hot spare)
2x250GB for the OS
Total: 540 TB usable
Computing nodes arrived and currently running on site soak tests
Testing opportuninty to use ext4 as file system rather than ext3
Initial test with iozone on old nodes show a visible improvement especially with multithreaded tests
Single thread better in writing only
Need to start testing new nodes
Storage will be delivered this week
Raid card had a disk detection problem.
Issue is now solved by a new version of the firmware.
Rack switches undergoing reconfiguration to allow
4x2Gb/s bonded links from quad chassis
1x4Gb/s bonded link from storage units
8xGb/s uplink to the main cisco
Consistently delivered a high number of CPU hours
Current storage is stable although slowed down by SL5(.3?)/XFS problem that causes bursts of load on the machine
Evaluating how to tackle the problem for the new storage.
Stefan Soldner Rembold, Mike Seymour and Un-ki Yang will take over from Roger Barlow the management of the Tier2.
We have built a new cluster in Physics Department with 24/7 access. It is a join cluster gridpp/local group with a common torque server
Storage Nodes running DPM 1.7.4, recently upgraded to SL5.5. Tests show latest SL5 kernels no longer show xfs bug
DPM head node 8 cores and 16 GB RAM
200 TB of disk space have been deployed – covering the pledge for 2010 according to the new GRIDPP requirements
SW RAID5 (no raid controllers)
All disk pools assembled in Sheffield 2 TB disks seagate barracuda and one 2TB Western digital
5x16bay unit with 2 fs, 2x24 bay unit with 2 fs
Cold spare unit on standby near each server
95% isk space reserved for Atlas
Plan to add 50 TB late in the autumn
Additional 32A ring mains added to the machine room in the physics department
Fiber links connecting servers in physics to WN in CICS
Old 5kW aircon replaced with a new 10kW (3x10kW aircon units)
Dedicated WAN link to cluster
Accepts jobs from grid CE and local cluster
Sends jobs in to all Wns
Hosts DNS server
SL4.8 4 single AMD opteron processor 850, 8GB if RAM, redundant power supply, 72 GB scsi in a raid1
MONBOX and BDII
50 WNs in Physics Department (32 from local hep cluster + 18 new), 5 Gb backbone
Phenom 3200 MHz x86_64, 8 GB of RAM, 140GB/4cores, 11.96 HepSpec/core
102 old WNs in CICS, 1 Gb backbone
204 single Core 2.4 GHz Opterons (2GB), 4GB of RAM , 72 GB local disk per2Cores; 7.9 HepSpec/core; connected to the servers via fibre link
Jobs requiring greater network bandwidth directed to WNs with better backbone
software server with 1 TB disk (RAID1) and squid server were moved from CICS
Cluster availability and reliability in July is 100%
Sheffield is active in atlas production and user analysis