1 / 26

GridPP Status Report

GridPP Status Report. David Britton, 15/Sep/09. Introduction . Issues from the last Oversight: “Other Experiments.” EGI/NGI/NGS etc. CASTOR. OPN network. Since the last Oversight: The UK has continued to be a major contributor to wLCG

harper
Download Presentation

GridPP Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP Status Report David Britton, 15/Sep/09

  2. Introduction Issues from the last Oversight: • “Other Experiments.” • EGI/NGI/NGS etc. • CASTOR. • OPN network. Since the last Oversight: • The UK has continued to be a major contributor to wLCG • A focus on resilience and disaster management (GridPP22) • The UK infrastructure has been validated by STEP09. • Moved the Tier-1 to R89. • Procured significant new hardware. • Adapted to developments in the LHC schedule; the EGI+ proposals; and the UK funding constraints. To be covered by Project Manager: • Project Milestones/Deliverables. • Project Risks. • Project Finances.

  3. WLCG: Largest scientific Grid in the world Worldwide: 288 sites in 55 countries – 190,000 CPUs In the UKI: 22 sites and about 19,000 CPUs September 2009: >315,000 KSI2K

  4. UK CPU Contribution Same picture if non-LHC VOs included

  5. UK Site Contributions 2007 – 8 - 9 NorthGrid: 34% – 22% - 15% London: 28% – 25% - 32% ScotGrid: 18% – 17% - 22% Tier-1: 13% – 15% - 13% SouthGrid: 7% – 16% - 13% GridIreland: 0% – 6% - 5% All areas of the UK make valuable contributions “Other VOs” used 16% of the CPU time this year.

  6. UK Site Contributions: Non LHC VOs Top-12 “Other VOs” include many disciplines All regions supported the“OtherVOs”.

  7. Tier-2 Resources The Tier-2s have delivered (Brunel currently installing 600TB of disk) Accounting error: 230TB delivered.

  8. Tier-2 Performance The Tier-2s have improved and are performing well. Resource-weighted averages

  9. Service Resilience GridPP23 Agenda A sustained push was made on improving service resilience at all levels. Many improvements were made at many sites and, ultimately, STEP09 demonstrated the the UK Grid was ready for data (see later slide). Disaster management processes were developed and are regularly engaged (see later slide).

  10. STEP09 UK Highlights • RAL was the best ATLAS Tier-1 after the BNL ATLAS-only Tier-1 • Glasgow ran more jobs then any of the 50-60 ATLAS Tier-2 sites throughout the world. • Tier-2 sites made good contributions and were tuning (not fire-fighting) during STEP09 and subsequent testing. • Quote: “The responsiveness of RAL to CMS during STEP09 was in stark-contrast to many other Tier-1s.” • CMS noted the tape performance at RAL was very good as was the CPU efficiency (CASTOR 2.1.7 worked well). • Many (if not all) the metrics for the experiments were met, and in some cases, significantly exceeded at RAL during STEP09.

  11. STEP09:RAL Operations Overview • Generally very smooth operation: • Most service systems relatively unloaded plenty of spare capacity • Calm atmosphere. • Daytime “production team” monitored service • Only one callout, • Most of the team even took two days out off site for department meeting! • Very good liaison with VOs and good idea what was going on. • In regular informal contact with UK representatives • Some problems with CASTOR tape migration (3 days) on ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments. • Robot broke down for several hours (stuck handbot led to all drives de-configured in CASTOR). Caught up quickly. • Very useful exercise – learned a lot, but very reassuring • More at: http://www.gridpp.rl.ac.uk/blog/category/step09/

  12. STEP09: RAL Batch Service Farm typically running > 2000 jobs. By 9th June at equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%) • Problem 1: ATLAS job submission exceeded 32K files on CE • See hole on 9th. We thought ATLAS had paused  took time to spot. • Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. • Need more ATLAS jobs in queue faster. Manually cap ALICE. Fixed by 9th June. See decrease in (red) ALICE work. • Problem 3: Occupancy initially poor (initially 90%). Short on memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.

  13. Data Transfers RAL achieved the highest average input and output data rates of any Tier-1.

  14. OPN Resilience

  15. (GridPP22) Current Issues: R89 In the end, hand-over to STFC was delay from Dec to Apr 09. Hardware was delayed but we were (almost) rescued by the LHC schedule change. Minor (?) issues remain with R89 (Aircon-trips; water-proof membrane?)

  16. Tier-1 Hardware • The FY2008 hardware procurement had to await the acceptance of R89. • The CPU is tested, accepted, and being deployed (14,000 HEPSPEC06 to add to current 19,000) • The disk procurement (2 PB to add to existing 1.9PB) was split into two halves (different disks and controllers to mitigate against acceptance problems). This has proved sensible, as one batch has demonstrated ejection issues. • One half of the disk is being deployed; progress is being made on the other half and best guess is deployment by end of November. • A second SL85000 tape robot is available. • The FY09 hardware procurement is underway.

  17. Disaster Management • A four-stage disaster management process was established at the Tier-1 earlier this year as part of our focus on resilience and disaster management. • Designed to be used regularly so that process is familiar. This means low-threshold to trigger Stage-1 “disasters” • At Stage-3, the process formally involves stake-holders outside the Tier-1, including GridPP management. This has now happened several times including: • R89 aircon trip • R89 water leak • Disk procurement problem • Swine flu planning. • The process is still being honed, but I believe it is very useful.

  18. EGI/NGI EGI Coordinating body in Amsterdam UK-NGI - NGI National initiatives in member countries - NGI - NGI GridPP Involves STFC, EPSRC and JISC (at least) in the UK. NGS EGI is vital to GridPP but it is not GridPP’s core business to run an e-science infrastructure for the whole of the UK: seek a middle ground. 15/Sep/09

  19. EU Landscape UK involvement via the UK NGI with global tasks such as GOGDB, security, dissemination, training.... EGI UK involvement: FTS/LFC support post at RAL? Heavy Users SSC EMI SSC SSC (Roscoe) Unicore ARC gLite UK involvement with Ganga? UK involvement with APEL, GridSite? … 15/Sep/09

  20. User Support • Help pages. • GridPP23 talks. • User survey at RAL

  21. Actions • OPN – Detailed document provided. Cost is covered by existing GridPP hardware funds. Propose to proceed immediately to provision. • Other Experiments – Usage shownon Slide-6. Allocation Policy is on the UserBoard web-pages: http://www.gridpp.ac.uk/eb/allocpolicy.html • EGI/NGI/NGS – Paper provided. GridPP/UK has established potential links with all the structural units and is engaged in the developments. • CASTOR – Paper provided. Paper provided. Version 2.1.7 used during STEP09 worked well beyond the levels needed. 2.1.8 becoming an issue. 15/Sep/09

  22. Current Issues Operational: • Timing of CASTOR 2.1.8 upgrade. • Shake-down issues with R89. • Problem with 50% of current disk purchase. High Level: • Hardware planning – lack of clarity on approved global resources. • Hardware pledges – financial constraints and the 2010 pledges. • GridPP4 – lack of information on scope, process or timing against a backdrop of severe financial problems within STFC. 15/Sep/09

  23. Key issue in the next six months To receive a sustained flow of data from CERN and to meet all the experiment expectations associated with custodial storage; data reprocessing; data distribution; and analysis. • Requires: • A resilient OPN network • Stable operation of CASTOR storage • Tier-1 hardware and services • Tier-1 to Tier-2 networking • Tier-2 hardware and services • Help, support, deployment and operations. • That is, the UK Particle Physics Grid. • The milestones necessary to meet these requirements have been met (with the possibly exception of the first) and the entire system validated with STEP09. • We believe the UK is ready. • We know that problems will arise and have focused on resilience to reduce the incidence of these, and on disaster management to handle those that do occur. 15/Sep/09

  24. The End

  25. Schedule • It is foreseen that LHC will ready for beam by mid-November • Before that • All sectors powered separately to operating energy ++ • Dry runs of many accelerator systems (from Spring) • Injection, extraction, RF, collimators • Controls • Full machine checkout before taking beam • Beam tests • TI8 (June) • TI2 (July) • TI2 and TI8 interleaved (September) • Injection tests (late October)

More Related