Status of the WLCG Tier-2 Centres

Status of the WLCG Tier-2 Centres M.C. VetterliSimon Fraser University and TRIUMF WLCG Overview Board,CERN, October 27th 2008

Sources of Information • Discussions with experiment representatives in July • APEL monitoring portalhttp://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php • WLCG reliability reportshttp://lcg.web.cern.ch/LCG/accounts.htm • October GDB mtg; dedicated to Tier-2 issueshttp://indico.cern.ch/conferenceDisplay.py?confId=20234 • Talks from the last OB & LHCCSlides labeled with a * are from MV’s LHCC rapporteur talk

Tier-2 Performance Summary* • Overall, the Tier-2s are contributing much more now • Significant fractions of the Monte Carlo simulations are being done in the T2s for all experiments • Reliability is better, but still needs to improve • CCRC’08 exercise is generally considered a success for the Tier2s

Tier-2 Centres in CCRC’08 – General* • Overall, the Tier-2s and the experiments considered the CCRC’08 exercise to be a success • The networking/data transfers were tested extensively; some FTS tuning was needed, but it worked out • Experiments tended to continue other activities in parallel which is a good test of the system, although the load was not as high as anticipated • While CMS did include significant user analysis activities, the chaotic use of the Grid by a large number of inexperienced people is still to be tested

Tier-2 Issues/Concerns As of CB and meetings with experiments this summer • Communications: Do Tier-2s have a voice? Is there a good mechanism for disseminating information? • Better monitoring: Pledges vs actual vs used • Hardware acquisitions:What should be bought? kSI2006? • Tier-2 capacity:Size of datasets? Effect of LHC delay? • …

Tier-2 Issues/Concerns • Upcoming onslaught of users: Some user analysis tests have been done but scaling is a concern • User Support: Ticketing system exists but it is not really used for user support issues. This affects Tier-2s especially. • Federated Tier-2s: Tools to federate? Monitoring? (averaging) • Interoperabilityof EGEE, OSG, and NDGF should be improved • Software/Middleware updates: Could be smoother; too frequent

Communications for Tier-2s • Identified by the T2s at the last CB as a serious problem.Interesting to me that many in experiment computing management did not share this concern. • Should communication be organized according to experiment or to Tier-1 association? There are also differing opinions on this. There are two issues: Grid middleware/operations Experiment software • My view after studying this is that the situation is OK for “tightly coupled” Tier-2s, but not for remote and smaller Tier-2s that are not well coupled to a Tier-1.

Communications for Tier-2s • Many lines of communication do indeed exist. • Some examples are:CMS hastwo Tier-2 coordinators: Ken Bloom (Nebraska) Giuseppe Bagliesi (INFN)- attend all operations meetings - feed T2 issues back to the operations group - write T2-relevant minutes - organize T2 workshops  ALICE has designated 1 Core Offline person in 3 to have privileged contact with a given T2 site manager- weekly coordination meetings - Tier-2 federations provide a single contact person - A Tier-2 coordinates with its regional Tier-1

Communications for Tier-2s ATLAS uses its cloud structure for communications- Every Tier-2 is coupled to a Tier-1 - 5 national clouds; others have foreign members (e.g. “Germany” includes Krakow, Prague, Switzerland; Netherlands includes Russia, Israel, Turkey) - Each cloud has a Tier-2 coordinatorRegional organizations, such as:+ France Tier-2/3 technical group:- coordinates with Tier-1 and with experiments - monthly meetings - coordinates procurement and site management+ GRIF:Tier-2 federation of 5 labs around Paris+ Canada:Weekly teleconferences of technical personnel (T1 & T2) to share information and prepare for upgrades, large production, etc.+ Many others exist; e.g. in the US and the UK

Communications for Tier-2s • Tier-2 Overview Board reps: Michel Jouvin and Atul Gurtu have just been appointed to the OB to give the Tier-2s a voice there. • Tier-2 mailing list:Actually exists and is being reviewed for completeness & accuracy • Tier-2 GDB:The October GDB was dedicated to Tier-2 issues+ reports from experiments: role of the T2s; communications + talks on regional organizations + discussion of accounting + technical talks on storage, batch systems, middleware Seems to have been a success; repeat a couple of times per year?

Tier-2 Installed Resources • But how much of this is a problem of under-use rather than under-contribution? a task force has been set up to extract installed capacities from the Glue schema • Monthly APEL reports still undergo significant modifications from first draft. Good because communication with T2s better Bad because APEL accounting still has problemsAccounting seems to be very finicky; breaks when the CE or MON box is upgraded • How are jobs distributed to the Tier-2s?

Tier-2 Hardware Questions • How does the LHC delay affect the requirements and pledges for 2009?+ We are told to go ahead and buy what was planned but we have already seen some under-use of CPU capacity and we have seen this for storage as well

Tier-2 Hardware Questions • How does the LHC delay affect the requirements and pledges for 2009?+ We are told to go ahead and buy what was planned but we have already seen some under-use of CPU and we are now starting to see this for storage as well • We need to use something other than SpecInt2000!+ this benchmark is totally out-of-date & useless for new CPUs + continued delays in SpecHEP can cause sub-optimal decisions

Tier-2 Hardware Questions • Networking to the nodes is now an issue.+ with 8 cores per node, 1 GigE connection ≈ 16.8 MB/sec/core + Tier-2 analysis jobs run on reduced data sets and can do rather simple operations have seen 7.5 MB/sec at ATLAS and much more (x10?) + Do we need to go to Infiniband? + We certainly need increased capability for the uplinks; we should have a minimum of fully non-blocking GigE the worker nodes.  We need more guidance from the experiments The next round of purchases is now!

Summary • The role of the Tier-2 centres has increased markedly in the last year >50% of Monte Carlo simulation is done in the T2s now. • The CCRC’08 exercise is considered a success by the Tier2s and by the experiments. • Availability and reliability are up, but still need improvement. • Resource acquisition vs pledges is better but still needs work • Issues for Tier2s: - communication should be (& is being) improved - work should ramp up on chaotic user analysis - reporting actual resources should be established - improved user support is needed

Status of the WLCG Tier-2 Centres