1 / 14

London Tier 2

London Tier 2. Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney. LT2 Sites. Brunel University Imperial College London (including London e-Science Centre) Queen Mary University of London Royal Holloway University of London

jariah
Download Presentation

London Tier 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney

  2. LT2 Sites • Brunel University • Imperial College London • (including London e-Science Centre) • Queen Mary University of London • Royal Holloway University of London • University College London GridPP 11: London Tier 2 Status

  3. LT2 Management • Internal LT2 MoU signed by all institutes • MoU with GridPP signed by David Colling as acting chair of Management Board • Management board being formed but has not yet met • Technical board meets every three to four weeks GridPP 11: London Tier 2 Status

  4. Contribution to LCG2 • ‘Snapshot’ taken on 26th August • Number of WN CPUs in use by LCG *QMUL have since turned on hyperthreading and now allow up to 576 jobs Brunel joined LCG2 on 3rd September GridPP 11: London Tier 2 Status

  5. Brunel • Test system (1WN) PBS farm @ LCG-2_2_0 • Joined Testzone on 3rd September • Completely LCFG installed • In process of adding 60 WN’s • LCFG installation • Private network • Some problems with SCSI drives and network booting with LCFG • Have had problems with local firewall restrictions • These now seem to be resolved • GLOBUS_TCP_PORT_RANGE is not the default range GridPP 11: London Tier 2 Status

  6. Imperial College London • 66 CPU PBS HEP farm @ LCG-2_1_1 • Joined LCG2 prior to 1st April • Completely LCFG installed • In core zone • Early adopter of RGMA • London e-Science Centre has 900 CPU cluster • Cluster runs on locally patched RH7.2 version • Shared facility: no possibility of changing operating system • Could not install LCG2 on RH7.2 • Will run RH7.3 under User Mode Linux to install LCG2 • Batch system (Sun Grid Engine) is not currently supported by LCG • LeSC have already provided a globus-jobmanager for SGE • Work is in progress on updating this for LCG jobmanager • Information provider is being developed by Durham • Interests in SGE from other sites • LeSC Cluster will soon be SAMGrid enabled • LCG2 to follow GridPP 11: London Tier 2 Status

  7. Queen Mary • 348 CPU Torque+Maui Farm @ LCG-2_1_0 • Joined Testzone on 6th July • Private networked WN’s • Existing Torque server • “Manual” installation on WNs (local automated procedure) • LCFG installed CE and SE • Configure CE to be client to Torque server • OS is Fedora 2 • Only site in LCG2 not running RH7.3 ! • Recently turned on hyperthreading • Offers 576 job slots GridPP 11: London Tier 2 Status

  8. Queen Mary • Fedora 2 Port • CE and SE are LCFG installed RH7.3 • LCG-2_0_0 was installed on Fedora 2 WN • tar up the /opt directory from an LCFG installed RH7.3 node • untar it on the Fedora 2 WN. • Only needed to recompile the Globus toolkit. • Also jar files in /usr/share/java needed • Everything worked! • For LCG-2_1_0 this method failed! • The upgraded edg-rm functions no longer worked. • Recompiling the Globus toolkit did not help. • LCG could not provide SRPMS for the edg-rm • Current status: • LCG-2_1_0 on SE, CE with LCG-2_0_0 on WN • Seems to work! But is clearly not ideal… • With LCG-2_2_0 upgrade will test the lcg-* utilities • Will replace the edg-rm functions GridPP 11: London Tier 2 Status

  9. 148 CPU PBS+Maui Farm @ LCG-2_1_1 Joined Testzone on 19th July Private networked WN’s Existing PBS server Manual installation on WN’s LCFG installed CE and SE Configure CE to be client to PBS server Shared NFS /home directories Uses pbs jobmanager, not lcgpbs jobmanager Needed to configure WN for jobs to run in scratch area not enough space in /home for whole farm Some problems still under investigation Stability problems with Maui. Large, compressed files sometimes become corrupted when copied to the SE Looks like a hardware problem. Also: 80 cpu Babar farm running Babargrid Royal Holloway GridPP 11: London Tier 2 Status

  10. University College London • UCL-HEP 20 CPU PBS farm @ LCG-2_1_1 • Joined Testzone 18th June • Existing PBS server • Manual installation of WN’s • LCFG installed CE and SE • Configure CE to be client to PBS server • Shared /home directories • Uses pbs jobmanager, not lcgpbs jobmanager • so far no problems with space on shared /home • Hyperthreading allows up to 76 jobs, but grid queues are restricted to less than this • Stability problems with OpenPBS GridPP 11: London Tier 2 Status

  11. University College London • UCL-CCC 88 CPU PBS farm @ LCG-2_2_0 • Joined Testzone on 24th June • Had power failure that took farm offline from 4th to 25th August • Originally had cluster of 192 CPUs running Sun Grid Engine under RH9 • UCL central computing services agreed to reinstall half of farm with RH7.3 for LCG, using LCFG • Hyperthreading allows 176 jobs (44 dual-CPU WNs) GridPP 11: London Tier 2 Status

  12. Contribution to GridPP • Promised vs. Delivered *CPU count includes shared resources where CPU’s are not 100% dedicated to Grid/HEP kSI2K value takes this sharing into account GridPP 11: London Tier 2 Status

  13. Site Experience • Storage Elements are all ‘classic’ gridftp servers • Cannot pool large TB raid arrays to deploy large disk spaces • Many farms are shared facilities • Existing batch queues • Manual installation of WN – needs to be automated for large farms! • CE becomes client to batch server • Private networked WNs • Needed additional Replica Manager configuration • Some OS constraints • Lack of all SRPMS still a problem • Most sites taken by surprise by lack of warning of new releases • Problems scheduling workload • Documentation has improved • But communication could be improved further! • The default LCFG installed farms (IC-HEP, UCL-CCC) have been amongst the most stable and easily upgraded • But this is not an option for most significant Tier 2 resources GridPP 11: London Tier 2 Status

  14. Summary • LT2 sites have managed to contribute a significant amount of resources to LCG • Still more to come! • This has required a significant amount of (unfunded) effort from staff – HEP and IT – at the institutes • 2.5 GridPP2 funded support posts to be appointed soon • Will help! • Any deviation from a “standard” installation comes at a price • Installation, upgrades, maintenance • But the large resources at Tier 2s tend to be shared facilities • don’t have the freedom to install a “standard” OS, whatever it might be • LCG moving from RH7.3 to Scientific Linux will not necessarily help! GridPP 11: London Tier 2 Status

More Related