1 / 35

8 th June 2009 Overview Board

WLCG Status Report. 8 th June 2009 Overview Board. Agenda. General status & Milestones STEP’09 Resource planning – post-RRB EGI Progress EGI & WLCG (+CERN) Jamie: Preparations for HEP SSC Steven: gLite consortium and later Discussion

shayla
Download Presentation

8 th June 2009 Overview Board

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WLCG Status Report 8th June 2009 Overview Board

  2. Agenda • General status & Milestones • STEP’09 • Resource planning – post-RRB • EGI • Progress • EGI & WLCG (+CERN) • Jamie: • Preparations for HEP SSC • Steven: • gLite consortium and later • Discussion • How WLCG should interact with EGI + NGIs in future

  3. WLCG MoU Signature Status • All anticipated signatures have now been received, including Brazil (April). • Today we have 49 MoU signatories, representing 34 countries: • Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, • Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, • Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, • Switzerland, Taipei, Turkey, UK, Ukraine, USA.

  4. CERN + Tier 1 accounting - 2008

  5. Accounting - 2009 • Ramp up in 2009 delayed until September • But several Tier 1s have started • CPU usage is significantly increased wrt 2008 • All missing resources from 2008 now installed except • NL-T1 where delayed until >>July; and • ASGC where fire delayed CPU installation

  6. Reliabilities now regularly reported for all experiments in addition to OPS (next slides) • Only T2 federation still not reporting is Ukraine

  7. Experiment-specific reliabilities

  8. Service issues Problems requiring “Service Incident Report: • 8/1: CERN many jobs killed due to memory problems • 17/1: CERN FTS transfer problems for ATLAS • 23/1: CERN FTS/SRM/Castor problems for ATLAS • 24/1: FZK FTS &LFC down for 3 days • 26/1: Backward-incompatible change on SRM • 21/2: CNAF: Network outage to Tier 2s and some Tier 1s • 25/2: ASGC Fire affecting entire site – services relocated • 27/2: CERN Accidental deletion of RAID volumes • 4/3: CERN General Castor outage for 3 hours • 14/3: CERN ATLAS Castor outage for 12 hours • 24/3: RAL site down after power glitches, knock-on effects for several days • 2/4: IN2P3 tape robotics failure • 11/4: TRIUMF cooling failure • 3/5: IN2P3 cooling down 44 hours (still in degraded mode until new cooling added in June) • 4/5: SARA MSS tape backend down • 14/5: PIC 5 hours cooling down • 19/5: Geant routing problem cut off CERN from all Geant customers (not OPN) • 20/5: dCache at NL-T1 – upgrade problems • Not all sites (yet) reporting consistently, but improving • Power/cooling issues continue at ~1/month

  9. 14

  10. Milestones • Added milestones for: • 2009 procurements • SL5 deployment • SCAS/gLexec deployment • Updates to accounting (Tier 2 report, reporting installed capacity, user level reporting) • STEP’09 specifics • CREAM CE rollout • MSS Metrics • CPU benchmark transition

  11. Milestone table...

  12. MSS Metrics for Tier 1s • Metrics gathered for Tier 1s – large set of metrics • Most sites agree that they can provide almost all (maybe not by VO at the level of tape access) • Published by each site via XML • Displayed in SLS • Data available (automatically) now for: • CERN, TRIUMF, CNAF, BNL, ASGC • Others available soon

  13. STEP’09 • The LHCC mini review recommended a 2009 readiness exercise, specifically to address the issues of • Data recall from tape at Tier 1s, for more than 1 experiment • Analysis activities • At the WLCG workshop prior to CHEP it was agreed that we would have such an exercise despite the difficulties in co-scheduling this between the experiments • “Scale Testing for the Experimental Programme – 2009” (STEP’09) • Implication that each year we foresee increased scaling tests • Timescale: May (preparations), June • Essentially started last week (ATLAS, CMS, ALICE), this week (LHCb) • BUT: • IN2P3 had scheduled MSS upgrade (hw+sw) June 1-4, degraded performance until finished (agreed with experiments) • FZK had problem with tape backend hardware just before start of STEP • ASGC put in huge effort to prepare for STEP according to ATLAS requests after fire (Following summaries thanks to Julia Andreeva)

  14. STEP’09: ATLAS • Goals: • Parallel test of all main tasks at nominal data taking rate • Export from Tier 0 • Reprocessing + reconstruction at Tier 1s; tape reading/writing • Export of processed data to other Tier 1s and Tier 2s • Simulation at Tier 2 • Analysis at Tier 2 using 50% of T2 CPU, 25% pilot submission, 25% via WMS • Progress • Started June 1 • Simulation running at full rate • Load generator for data transfers reached 100% on 2nd June • Reprocessing running in 7 ATLAS clouds • Analysis in progress – using Hammercloud • 10-20k jobs concurrently between WMS, Panda, ARC; 130k jobs submitted so far (June 4 – less than 1 day of activity) • All clouds receive jobs from both WMS and Panda • ATLAS measures efficiency and read performance at each site

  15. STEP’09: CMS • Goals: • Tier 0 data recording in parellel with other experiments • Plan 48 hours run: 10-11 June & 17-18 June • Ideally we would like longer run (5 days) but for CMS this would interfere with weekly cosmics run • Tier 1 focus on tape archiving and prestaging (2-21 June) • Data transfer goals: • Tier 0 – Tier 1 (2-16/6): latency between CERN MSS and Tier 1 • Tier 1 – Tier 1 (1-16/6) replicate 50 TB between all Tier 1s • Tier 1 – Tier 2 (4-9, 11-16/6): stress Tier 1 tapes, latency from Tier 1 MSS to Tier 2 • Analysis at Tier 2 • Demonstrate able to use 50% of pledged resources with analysis jobs, overlaps with MC work. Throughout June. • Progress: • Started June 2 • CRUZET (Cosmics at 0 T) at CERN with export to Tier 1 last week (so no major Tier 0 activities) • First STEP’09 work at Tier 0 foreseen June 6 • Reprocessing at Tier 1s started June 3 • T0T1 transfers started June 3 • T1T1 transfers started June 4 • Analysis: job preparation under way (June 4)

  16. STEP’09: ALICE+LHCb • ALICE goals: • Tier 0 – Tier 1 data replication at 100 MB/s • Reprocessing with data recall from tape at Tier 1s • ALICE status: • Started June 1 • 15k concurrent jobs running • FTS transfers to start this week • LHCb goals: • Data injection into HLT • Data distribution to Tier 1s • Reconstruction at Tier 1s • LHCb status: • Will join STEP’09 this week

  17. Resource planning – post RRB • Next slides are requirements as shown at the RRB at then end of April: Requirement >10% more/less than pledge

  18. ATLAS • Cosmic ray data in Q309 will produce 1.2PB (same as Aug-Nov 08) • In 6x10^6 sec will collect 1.2x10^9 events  2PB raw • Raw stored on disk at T1s for a few weeks • Plan for 990M full sim events and 2200M fast sim events • CERN request was updated last Aug and was seen by RSG • Generally new requirements <= old requirements (except at CERN) • Provide resource needs profile by quarter (see document) • NB. The August 2008 request for 2009 while agreed by the RSG has never been validated by LHCC Requirement >10% more/less than pledge/requirement

  19. CMS • Model foresees 300Hz data taking rate ... • ... and CPU times assume higher lumi in ‘10 • recCPU: 100200 HSO6.s • simCPU: 360540 HSO6.s • Changes • 3 re-reconstr in each ’09, ‘10 • 40% overlap in PD datasets • Added storage needs for ‘09 cosmics • Tier 1: • Finish ‘09 re-reco in 1 month (was spread over full year) • Tier 2: • Require 1.5 more MC events than raw: sw changes and bug fixes • MC events produced in 8 months (can only start after Aug’09) • Tier 0: • Added 1 re-reco in each year • Capacity for express stream • Reco to finish in 2x runtime in ‘09 • Monitoring + commissioning is now 25% of total (was 10%) Requirement >10% more/less than pledge/requirement

  20. ALICE • Will collect p-p data at ~maximum rate: 1.5x10^9 events at 300 Hz • Initial running will give luminosity required without special machine tuning – cleaner data for many physics topics • First pp run energy is important in interpolating results to full Pb-Pb energy • Thus plan to collect large statistics pp in 2009-10 • Assume 1 month Pb-Pb at end of 2010 • Requests are within (or close to) existing ‘09 pledges except for Tier 2 disk • For 2010 – don’t know actual pledge for ALICE, but generally pledges are significantly lower than requirement. (so final column should be mostly pink for T1+T2!) Requirement >10% more/less than pledge/requirement

  21. LHCb • Uncertainty in running mode (pile up)  add contingency on event sizes and simulation time • 2009 Simulation with assumed running conditions • Early data with loose trigger cuts and many reprocessing passes – alignment/calib+early physics • 2010 – several reprocessing passes and many stripping passes • Simulation over full period • CERN increase due to need for fast feedback to detector of alignment/calibration + anticipation of local analysis use • T1 CPU increase in 2010 due to more reprocessing • T2 requirements decrease as less overall simulation needed NB. Previously LHCb had presented integrated CPU needs – now here are shown the total capacity needed in each period – as for the other experiments Requirement >10% more/less than pledge/requirement

  22. Resource Planning - RRB • The Scrutiny Group also reported at the RRB: • Essential message was that they thought that the resources pledged for 2008/2009 should be sufficient for the data taking during 2009/2010 • Discussion followed ... • Conclusion was that the scrutiny group and the experiments together with their LHCC referees should discuss and come back with clarifications before the summer • LHCC will have a mini-review on computing 9-11 July (including LCG and LHCC referees of LCG)

  23. GDB topics – security challenge • Security service challenge: (from report at April GDB) • 2nd challenge run (following last year’s) • site was asked to trace a job back from WN, through CE, WMS, to the submitting UI. To ban a particular user, and to trace certain storage operations. • 9 Tier1s (NIKHEF & SARA, not OSG sites and ASGC), tested plus Prague Tier2 volunteered • 6 of 9 sites equalled or exceeded max score (bonus points were possible). One of the others and the Prague T2 scored >90%. • improvement for the sites previously poor to middling was considerable. • exception was the INFN T1 at CNAF. It took three attempts before there was any response at all and even that was poorer than last year. The EGEE Security Officer will discuss in detail with CNAF but the MB should be concerned at the apparent inability of this T1 to react to standard procedures. • This test is currently being run against Tier2s as well. UK and South-East Europe have completed, Asia Pacific and Benelux are in progress and NDGF and OSG are preparing. The aim is to have completed all regions in time to report to the EGEE09 conference in September.

  24. EGI.eu • Location: Amsterdam, Science Park, Matrix Building IV • Decision taken by PB in Catania • Organizational Task Force • Members from NCF, NIKHEF and EGI_DS • Chair: Arjen van Rijn (NIKHEF) • Preparation of Convention and Statutes 13.5.2009

  25. Memorandum of Understanding • Identify parties ready to commit man power and financial resources • NGIs and EIROforum organizations • Common Fund Administrator to deal with the financial contributions • Defines EGI Collaboration • An interim step towards EGI Council • Body with authority to deal with EGI project(s) preparation • Body with authority to assign interim EGI.eu Director and other personnel • Released last Wednesday (6th May) • Comment till end of May • First round of signatures end of June • A minimal quorum of 10 parties and 150 k€ for MoU to come in force • First financial contributions 1st October 13.5.2009

  26. Letter of Intent • To identify parties interested to sign MoU • Released together with MoU • Deadline for signatures 25th May • Mostly informational, to collect preliminary interest • However, will play role in MoU endorsement 13.5.2009

  27. LoI Signatures

  28. EGI.eu Convention and Statutes • Using MoU as the input • Extending it to define a legal body—EGI.eu • EGI.eu will be a Foundation under Dutch Law • Open path towards ERIC in the future 13.5.2009

  29. EGI Project(s) preparation • Leaders of Editorial Board nominated • Laura Perini for EGI proper—EGI.eu establishment, EGI operations, … • Cal Loomis for EGI application support • The idea is to encourage one project covering several scientific areas (SSCs) and their generic support • Steven Newhouse for interim EGI.eu director • Middleware development outside EGI Blueprint • Discussion within the UMD task force 13.5.2009

  30. Schedule and Milestones • 25th May: LoI signed • 29th May: Next PB meeting • MoU discussion • Draft EGI.eu Convention and Statutes published • 30th May: Deadline for MoU comments • Early—Mid June: Final MoU published • 30th June: Deadline for MoU signature (first round) • Early July: Interim EGI Council convened • Endorsement of steps already taken • Endorsement of Editorial Board • Endorsement/election of EGI.eu director • New version of EGI.eu Convention and Statutes • 30th July: EC Call open • October/November: EGI.eu established • 1st October: Financial contributions to EGI Collaboration due • 5th December: EC Call closed 13.5.2009

  31. WLCG and EGI • In previous meetings we have discussed planning for the EGEE to EGI transition period (or for the case where EGI is not in place) • Updated document (attached here): • Includes status of NGI planning for WLCG countries (presented on May12) in response to a number of questions posed to them (most Tier 1s + a few Tier 2-only countries) • Also includes list of services and responsibilities (present and anticipated) and list of middleware components with responsibilities

  32. Tier 1s were asked: • Which services you currently provide for WLCG (via EGEE) that you will commit to continue to support (see attached slide) – what is the level of effort you currently provide for these (separated into operation, maintenance, and development) • Which services you will not be able to continue to support or where the level of effort may be significantly decreased that may slow developments, bug fixes, etc. • What is the state of the planning for the NGI: • Will it be in place (and fully operational!) by the end of EGEE-III? • What is the management structure of the NGI?, and • How do the Tier 1 and Tier 2s fit into that structure? • How the effort that today is part of the ROCs (e.g. COD, TPM, etc) for supporting the WLCG operations evolve?  How will daily operations support be provided? • Does the country intend to sign the Letter of Intent and MoU expressing the intention to be a full member of EGI? • Which additional services could the Tier 1 offer if other Tier 1s are unable to provide them? • Other issues particular to the country, or general problems to be addressed. • What are the plans to maintain the WLCG service if the NGI is not in place by May 2010, or if EGI.org is not in place? • For ASGC and Triumf it would be useful to hear on their plans in the absence of EGEE ROC support – i.e. do they have plans to continue or build local support centres? For BNL and FNAL it is assumed that nothing will really change on the timescale of the next year.

  33. Responses (May 12) • UK, France, Italy, Nordic, NL • Structures in place – expect to continue to provide existing services • Germany • Situation under discussion (Gauss alliance), but WLCG commitments clear • Spain • Structure for NGI not yet in place, but intend to fulfil Tier 1+Tier 2 service commitments • See document for details

  34. Outside Europe • CERN ROC will close at end of EGEE-III • Today supports several countries/sites outside of Europe • Latin America • Brazil, Mexico, Columbia propose to fund a LA-ROC to support Latin American LHC collaborators • Supported by many other LA countries (list...) • Will send people to CERN for training • Asia-Pacific • A-P ROC in Taipei will remain in much the same way as now • Canada • Will be self-supporting, but is also offering to set up a ROC potentially in support of other sites if necessary

  35. CERN’s roles in EGI • CERN will participate in all aspects of the EGI: • EGI.eu • Hopefully as a full member with voting rights (but still some doubts) • Specialised Support Centres (SSC) • CERN will lead the formation of an SSC for HEP (+astroparticle?) together with other partners • Will be of direct benefit to WLCG • Middleware • The gLite consortium must urgently be put in place – Letters of Intent have been signed by (almost) all key partners • This is a minimum solution for ongoing support of software in production • Hopefully a collaboration between gLite and ARC can eventually participate in a project proposal

More Related