1 / 18

SouthGrid Status

SouthGrid Status. Pete Gronbech: 4 th September 2008 GridPP 21 Swansea. UK Tier 2 reported CPU – Historical View. UK Tier 2 reported CPU – Q3 2008 (so far, July/August ). SouthGrid Sites Accounting as reported by APEL. Jobs by VO Q2-Q3 08. Q208 Report. RAL PPD 918KSI2K 158TB.

Download Presentation

SouthGrid Status

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SouthGrid Status Pete Gronbech: 4th September 2008 GridPP 21 Swansea

  2. UK Tier 2 reported CPU – Historical View

  3. UK Tier 2 reported CPU – Q3 2008 (so far, July/August )

  4. SouthGrid SitesAccounting as reported by APEL

  5. Jobs by VO Q2-Q3 08

  6. Q208 Report

  7. RAL PPD918KSI2K 158TB • 158TB storage . • Setting up Space tokens for ATLAS took some time. The fine grained control of permissions is not really supported by dcache. • Second 2008 hardware upgrade (90TB usable and 10 supermicro twins (~400KSI2K)) to be installed in October. • Will tender with the Tier 1 for a large (~£300K) purchase nearer Christmas. • Major CMS & ATLAS site, uses significant cpu and storage.

  8. Status at Cambridge391KSI2K 43TB • Already received: • 3 new servers, all Dell 1950s, for CE, UI and internal "installation server" • All of them in the rack, powered up. Presently testing the new 3.1 CE • Ordered and expecting delivery: • 20T disk - 2x10 disk servers • 2x16-core (total 32 core) nodes for topping up WNs • All from Viglen Expecting delivery end of this week • Current storage: ~40TB running on SL4 64bit • Problems: • 3.1 glite CE is still not ready for Condor; why pushing hard to upgrade? • With VOs: • There are no known issues) with LHCb, so far. • atlasprd jobs are running at the moment .

  9. 4 twin quad core workers with Intel Xeon E5450 3.00GHz are ready to be deployed in production. Grid jobs submission is done with a new CE, epgce3.ph.bham.ac.uk. SL4 64 bit is installed with 32 glite middleware. epgce3 is installed on a Dell Poweredge SC1435 (Quad-Core AMD Opteron Processor 2350 with 8GB of RAM and a 160GB drive). This box is far too powerful to host a CE. The long term plan is to run several (XEN) guest CEs or other grid servers on this box. Reverted to simple CE installation for simplicity. Are other people using XEN?, Yves will be very interested to find out how you secure your XEN host (iptables, do they bridging within a guest,...) feedback would be most welcome! Birmingham Status190KSI2K ~50TB

  10. 3 CEs now! epgce1 is submitting to Atlas farm epgce2 is submitting to our old eScience cluster. Need one more to submit to Bluebear :( Note the clusters are very disparate in a different locations: Location CPU arch mem/core Atlas Farm PP xeon 2GHz i686 512MB Twins PP new xeon 3GHz x86_64 2GBold eScience IS old xeon 3GHz i686 1GB BlueBear IS opt 2218 x86_64 2GB In early July, deployed a DPM pool node (also a Dell Poweredge SC1435) with a 40TB attached to it. It runs SL4 x86_64 and use XFS for DPM fs. Birmingham Status – 2

  11. Old eScience cluster is running fine. VO software area is installed on BlueBear GPFS. Need to relocate the software to a new GPFS disk with group quota rather than user quota. This makes Atlas software installation more difficult. Alessandro De Salvo, logs and tools are extremely helpful to sort out issues. Birmingham Status - 3

  12. Bristol • The second phase of the Bristol HPC cluster is being installed in the new computer room on the top of the Physics building. • The current plan is for physics to get an increasing fraction of the first phase of the HPC. They started with 32 nodes, this has increased to 39 and may go up to 400 in a matter of weeks. • In addition to grid use, local CMS users make heavy use of the HPC cluster. • Gridpp3 h/w money has been used to provide incremental upgrades to the infrastructure. New ce for the HPC, a gridftp server, a Storm server and a new sl4 mon box have been purchased. • 50TB of storage (gpfs) has been purchased and installed, and will be commissioned following Jon Waklin’s paternity leave. • Had some recent problems with the 16 bay infortrend array (se). This appears as a SCSI device to the host. The adaptec controller was replaced by an LSI, and the cables replaced, unfortunately it failed again under heavy CMS load yesterday.

  13. Bristol - 2

  14. Oxford 510KSI2K 102TB • Two sub clusters • (2007) 176 5345 (Intel quads) running SL4 • (2004) 76 2.8GHz cpus running SL4 • Tender just gone out for an upgrade of approx 60TB and 450KSI2K. To be installed in October. Upgrading the Networking to either 3com 5500 or Nortel 5510/5530 class equipment. • Older Dell 2.8 Xeons will be decommissioned following the new installation. Power usage costs out weigh cpu provided. • We had a few teething problems with the move to SL4 ce’s, including taking a little while to get APEL sorted. • All ATLAS space tokens now in place. This was initially delayed by having to wait for backplane swaps on our DPM pool nodes. • The 4 grid racks have been moved to the new Computer Room at Begbroke Science Park. Ewan and I are now qualified Removal Men!

  15. Oxford 2 We have to pay for the electricity used at the Begbroke Computer Room: • Cost in electricity to run old (4 years) Dell nodes is ~£8600 per year. (~79KSI2k) • Replacement cost in new twins is ~£6600 with electricity cost of ~£1100 per year. • So saving of ~£900 in the first year and £7500 per year there after. • Conclusion is, its not economically viable to run kit older than 4 years.

  16. EFDA JET • 242KSI2K 1.5TB • Had a problem with the ce not running SAM jobs but fine for other jobs, (probably certificate issue), this took some time to fix. Had a bad effect on availability of JET and SouthGrid. Rolled back to a backup of a working ce. Unfortunately this means its still running SL3. Will upgrade asap. • Worker nodes already upgraded to SL4.

  17. Work has started in preparing the infrastructure required to make Oxford a CMS site. RALPPD will provide the Phedex service required. Slow progress in hiring Deputy T2C. Interviewing for the second time on Monday next week. NGS integration; The Bristol HPC has become and NGS affiliate. SouthGrid regional VO will be used to bring local groups to the grid. SouthGrid….Other Points?

  18. SouthGrid Summary • Steady performance at most sites, JET back on line after problems. Oxford had a few issues but back up to speed again. • Still majority of the work happening at RALPPD and Oxford. • Expansion expected shortly at all sites • Still to fully exploit HPC cluster at Bristol and Birmingham.

More Related