alessandra forti gridpp25 ambleside 25 august 2010 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Northgrid Status PowerPoint Presentation
Download Presentation
Northgrid Status

Loading in 2 Seconds...

play fullscreen
1 / 16

Northgrid Status - PowerPoint PPT Presentation

  • Updated on

Alessandra Forti Gridpp25 Ambleside 25 August 2010. Northgrid Status. Apel pies Lancaster status Liverpool status Manchester status Sheffield status Conclusions. Outline. Apel pie (1). Apel pie (2). Apel pie (3). Lancaster. The new shared HPC facility at Lancaster.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Northgrid Status

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
alessandra forti gridpp25 ambleside 25 august 2010
Alessandra Forti

Gridpp25 Ambleside

25 August 2010

Northgrid Status
Apel pies
  • Lancaster status
  • Liverpool status
  • Manchester status
  • Sheffield status
  • Conclusions

The new shared HPC facility at Lancaster.

1760 cores, delivering over 20k HEPSPEC2k6

Behind an LSF batch system.

In a proper machine room (although that's had teething troubles).

Added bonus of also providing a home for future GridPP kit.

Access to a “Panasus Shelf” for a high performance tarball and VO software areas.

260 TB Grid-only storage included in purchase.

Roger Jones is managing director and we have admin access to the cluster, so we have a strong voice in this joint venture.

But with root access comes root responsibility.


New ``HEC'' facility almost ready.

Final stages of acceptance testing.

Number of gotchas and niggles causing some delays, but this is expected due to the facilities scale.

Underspecced fuses, overspecced UPS, overheating distribution boards, misbehaving switches, oh my!

Site otherwise chugging along nicely (with some exceptions).

New hardware to replace older servers

Site needs a downtime in the near future to get these online and have a general clean up.

Tendering for additional +400TB storage


Addition to Cluster

32 bit cluster replaced by 9 x Quad chassis (4 worker nodes in each), boosts 64bit capability, reduces power consumption

Each node has 2 x X5620 CPUs (4 core), 2 x hot swap SATA, IPMI, redundant power supply

Runs 64bit SL5, glite 3.2

Entire Cluster


6 x Quad core worker nodes (56 cores)

16 x Quad chassis, 4 WNs each (512 cores)

3 GB RAM per core

8214.44 HEPSPEC06

Storage: 286 Terabytes

Plans: Migrate from VMWare to KVM (lcg-CE, CREAM-CE, Site BDII, MON Box ...)

liverpool problem
Liverpool: problem

Site looks as not “not available” in gstat because GSTAT/Nagios doesn't recognise RedHat Enterprise Server as OS

OS was correctly advertised following the procedure published How_to_publish_the_OS_name

The ticket has been reassigned to gstat SU but the problem has affected the site nagios tests.

manchester new hardware
Manchester: new hardware

Procurement at commissioning stage

New hardware will replace half of the current cluster

Computing power

20 quad chassis

2x6 core per motherboard

13.7 HEPSPEC06 per core

4GB memory per core

125 GB disk space per core

2 Gb/s bonded link per mother board

Total: 960 cores = 13.15k HEPSPEC06


9x36bay units

30+2+1x2TB for the raid (30 data disks + 2 parity disks + 1 hot spare)

2x250GB for the OS

Total: 540 TB usable

manchester new hardware1
Manchester: new hardware

Computing nodes arrived and currently running on site soak tests

Testing opportuninty to use ext4 as file system rather than ext3

Initial test with iozone on old nodes show a visible improvement especially with multithreaded tests

Single thread better in writing only

Need to start testing new nodes

Storage will be delivered this week

Raid card had a disk detection problem.

Issue is now solved by a new version of the firmware.

Rack switches undergoing reconfiguration to allow

4x2Gb/s bonded links from quad chassis

1x4Gb/s bonded link from storage units

8xGb/s uplink to the main cisco

manchester general news
Manchester: general news

Current stability

Consistently delivered a high number of CPU hours

Current storage is stable although slowed down by SL5(.3?)/XFS problem that causes bursts of load on the machine

Evaluating how to tackle the problem for the new storage.


Stefan Soldner Rembold, Mike Seymour and Un-ki Yang will take over from Roger Barlow the management of the Tier2.

sheffield major cluster upgrade
Sheffield: major cluster upgrade

We have built a new cluster in Physics Department with 24/7 access. It is a join cluster gridpp/local group with a common torque server


Storage Nodes running DPM 1.7.4, recently upgraded to SL5.5. Tests show latest SL5 kernels no longer show xfs bug

DPM head node 8 cores and 16 GB RAM

200 TB of disk space have been deployed – covering the pledge for 2010 according to the new GRIDPP requirements

SW RAID5 (no raid controllers)

All disk pools assembled in Sheffield 2 TB disks seagate barracuda and one 2TB Western digital

5x16bay unit with 2 fs, 2x24 bay unit with 2 fs

Cold spare unit on standby near each server

95% isk space reserved for Atlas

Plan to add 50 TB late in the autumn

sheffield major cluster upgrade1
Sheffield: major cluster upgrade


Additional 32A ring mains added to the machine room in the physics department

Fiber links connecting servers in physics to WN in CICS

Old 5kW aircon replaced with a new 10kW (3x10kW aircon units)

Dedicated WAN link to cluster

Torque Server

Accepts jobs from grid CE and local cluster

Sends jobs in to all Wns

Hosts DNS server


SL4.8 4 single AMD opteron processor 850, 8GB if RAM, redundant power supply, 72 GB scsi in a raid1


sheffield major cluster upgrade2
Sheffield: major cluster upgrade


50 WNs in Physics Department (32 from local hep cluster + 18 new), 5 Gb backbone

Phenom 3200 MHz x86_64, 8 GB of RAM, 140GB/4cores, 11.96 HepSpec/core

102 old WNs in CICS, 1 Gb backbone

204 single Core 2.4 GHz Opterons (2GB), 4GB of RAM , 72 GB local disk per2Cores; 7.9 HepSpec/core; connected to the servers via fibre link

Jobs requiring greater network bandwidth directed to WNs with better backbone

software server with 1 TB disk (RAID1) and squid server were moved from CICS

Cluster availability and reliability in July is 100%

Sheffield is active in atlas production and user analysis

Northgrid has been pretty stable and steadily crunching CPU hours for the past 4 months and data.
  • Sites have got or are in the process of getting new hardware rejuvenating the CPUs and increasing the storage
  • Both the CPUs and storage MoU requirements for Northgrid should be satisfied in excess in the near future.