1 / 19

SouthGrid Status

SouthGrid Status. Pete Gronbech: 2 nd April 2009 GridPP22 UCL. UK Tier 2 reported CPU – Historical View to Q109. UK Tier 2 reported CPU – Q1 2009. SouthGrid Sites Accounting as reported by APEL. Job distribution. Site Upgrades since gridpp21.

nira
Download Presentation

SouthGrid Status

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SouthGrid Status Pete Gronbech: 2nd April 2009 GridPP22 UCL

  2. UK Tier 2 reported CPU – Historical View to Q109

  3. UK Tier 2 reported CPU – Q1 2009

  4. SouthGrid SitesAccounting as reported by APEL

  5. Job distribution

  6. Site Upgrades since gridpp21 • RALPPD Increase of 640 cores (1568KSI2K) +380TB • Cambridge 32 cores (83KSI2K) + 20TB • Birmingham 64 cores on pp cluster and 128 cores HPC cluster which add ~430KSI2K • Bristol original cluster replaced by new quad cores systems 16 cores + increased share of the HPC cluster 53KSI2k + 44TB • Oxford extra 208 cores 540KSI2K + 60TB • Jet extra 120 cores 240KSI2K

  7. New Total Q109SouthGrid GridPP CPU (kSI2K) Storage (TB) EDFA-JET 483 1.5 700 90 Birmingham 120 55 Bristol 455 60 Cambridge 972 160 Oxford 2815 633 RALPPD 5545 999.5 Totals

  8. MoU

  9. Network rate capping • Oxford recently had its network link rate capped to 100mbs • This was as a result of continuous 300-350mbs traffic caused by CMS commissioning testing. • As it happens this test completed at the same time as we were capped, so we passed the test, and current normal use is not expected to be this high • Oxfords Janet link is actually 2*1gbit links which had become saturated. • Short term solution is to only rate cap JANET traffic to 200mbs, all other on site traffic remains at 1gbs. • Long term plan is to upgrade the JANET link to 10gbs within the year.

  10. spec benchmarking • Purchased the SPEC 2006 benchmark suite • Ran using the Hepix scripts to run the HEPspec06 way • Using the HEP spec benchmark should provide a level playing field. • In the past sites could choose any one of the many published values on the spec benchmark site.

  11. Staff Changes • Jon Waklin and Yves Coppens left in Feb 09 • Kashif Mohammad started in Jan 09 as the deputy coordinator for SouthGrid. • Chris Curtis will replace Yves starting in May. He is currently doing his PhD on the Atlas project. • The Bristol post will be advertised, it is jointly funded by IS and GridPP.

  12. gridppnagios

  13. Resilience • What do we mean by resilience? • The ability to maintain high availability and reliability of our grid service • Guard against failures • Hardware • Software

  14. Availability / Reliability

  15. Hardware Failures • The hardware • Critical Servers • Good quality equipment • Dual PSU • Dual mirrored systems disks and RAID for storage arrays • All systems have 3 year maintenance with on site spares pool. (disks, psu’s, ipmi cards) • Similar kit bought for servers so can swap h/w. • IPMI cards allow remote operation and control • The environment • UPS for critical servers • Network connected PDU’s for monitoring and power switching • Professional Computer room / rooms • Air Conditioning: Need to monitor the temperature • Actions based on the above environmental monitoring • Configure your UPS to shutdown systems in the event of sustained power loss • Shutdown cluster in the event of high temperature

  16. Hardware continued • So having guarded against the h/w failing if it does then we need to ensure rapid replacement • Restore from backups or reinstall • Automated installation system; • pxe, kickstart, cfengine • Good documentation • Duplication of Critical Servers • Multiple ce’s • Virtualisation of some services allows migration to alternative VM servers (mon, bdii, and ce’s) • Less reliance on external services • Could setup Local WMS, Top level BDII

  17. Software Failures • Main cause of loss of availability is software failures • Miss configuration • Fragility of glite middleware • OS system problems • Disks filling up • Service failures (eg ntp) • Good communications can help solve problems quickly. • Mailing lists, wikis, blogs, meetings, • Good monitoring and alerting (Nagios etc) • Learn from mistakes. Update systems and procedures to prevent reoccurrence.

  18. Recent example • Many SAM failures occasional passes • All test jobs pass • Almost all ATLAS jobs pass • Error logs revealed messages about proxy not being valid yet! • ntp on se head node had stopped • AND cfengine had been switched off on that node (so no automatic check and restart) • SAM test always gets a new proxy and if it got through the WMS and on to our cluster in to a reserved express queue slot within 4 mins would fail. • In this case the SAM tests were not accurately reflecting the usability of our cluster BUT it was showing a real problem.

  19. Conclusions • These systems are extremely complex • Automatic configuration and good monitoring can help but systems need careful tending • Sites should adopt best practice and learn from others • We are improving but its an ongoing task

More Related