CHEP – Mumbai, February 2006

CHEP – Mumbai, February 2006 State of Readiness of LHC Computing Infrastructure Jamie Shiers, CERN

Introduction • Some attempts to define what “readiness” could mean • How we (will) actually measure it… • Where we stand today • What we have left to do – or can do in the time remaining… • Timeline to First Data • Related Talks • Summary & Conclusions

What are the requirements? • Since the last CHEP, we have seen: • The LHC Computing Model documents and Technical Design Reports; • The associated LCG Technical Design Report; • The finalisation of the LCG Memorandum of Understanding (MoU) • Together, these define not only the functionality required (Use Cases), but also the requirements in terms of Computing, Storage (disk & tape) and Network • But not necessarily in an site-accessible format… • We also have close-to-agreement on the Services that must be run at each participating site • Tier0, Tier1, Tier2, VO-variations (few) and specific requirements • We also have close-to-agreement on the roll-out of Service upgrades to address critical missing functionality • We have an on-going programme to ensure that the service delivered meets the requirements, including the essential validation by the experiments themselves

How do we measure success? • By measuring the service we deliver against the MoU targets • Data transfer rates; • Service availability and time to resolve problems; • Resources provisioned across the sites as well as measured usage… • By the “challenge” established at CHEP 2004: • [ The service ]“should not limit ability of physicist to exploit performance of detectors nor LHC’s physics potential“ • “…whilst being stable, reliable and easy to use” • Preferably both… • Equally important is our state of readiness for startup / commissioning, that we know will be anything but steady state • [ Oh yes, and that favourite metric I’ve been saving… ]

LHC Startup • Startup schedule expected to be confirmed around March 2006 • Working hypothesis remains ‘Summer 2007’ • Lower than design luminosity & energy expected initially • But triggers will be opened so that data rate = nominal • Machine efficiency still an open question – look at previous machines??? • Current targets: • Pilot production services from June 2006 • Full production services from October 2006 • Ramp up in capacity & throughput to TWICE NOMINAL by April 2007

LHC Commissioning Expect to be characterised by: • Poorly understood detectors, calibration, software, triggers etc. • Most likely no AOD or TAG from first pass – but ESD will be larger? • The pressure will be on to produce some results as soon as possible! • There will not be sufficient resources at CERN to handle the load • We need a fully functional distributed system, aka Grid • There are many Use Cases we did not yet clearly identify • Nor indeed test --- this remains to be done in the coming 9 months!

Tier-0 – the accelerator centre • Data acquisition & initial processing • Long-term data curation • Distribution of data  Tier-1 centres Tier-1 – “online” to the data acquisition process  high availability • Managed Mass Storage – grid-enabled data service • Data intensive analysis • National, regional support • Continual reprocessing activity(or is that continuous?) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia Sinica (Taipei) UK – CLRC (Didcot) US – FermiLab (Illinois) – Brookhaven (NY) Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschungszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF (Amsterdam) LCG Service Hierarchy Tier-2 – ~100 centres in ~40 countries • Simulation • End-user analysis – batch and interactive Les Robertson

The Dashboard • Sounds like a conventional problem for a ‘dashboard’ • But there is not one single viewpoint… • Funding agency – how well are the resources provided being used? • VO manager – how well is my production proceeding? • Site administrator – are my services up and running? MoU targets? • Operations team – are there any alarms? • LHCC referee – how is the overall preparation progressing? Areas of concern? • … • Nevertheless, much of the information that would need to be collected is common… • So separate the collection from presentation (views…) • As well as the discussion on metrics…

The Requirements • Resource requirements, e.g. ramp-up in TierNCPU, disk, tape and network • Look at the Computing TDRs; • Look at the resources pledged by the sites (MoU etc.); • Look at the plans submitted by the sites regarding acquisition, installation and commissioning; • Measure what is currently (and historically) available; signal anomalies. • Functional requirements, in terms of services and service levels, including operations, problem resolution and support • Implicit / explicit requirements in Computing Models; • Agreements from Baseline Services Working Group and Task Forces; • Service Level definitions in MoU; • Measure what is currently (and historically) delivered; signal anomalies. • Data transfer rates – the TierX TierY matrix • Understand Use Cases; • Measure … And test extensively, both ‘dteam’ and other VOs

The Requirements • Resource requirements, e.g. ramp-up in TierN CPU, disk, tape and network • Look at the Computing TDRs; • Look at the resources pledged by the sites (MoU etc.); • Look at the plans submitted by the sites regarding acquisition, installation and commissioning; • Measure what is currently (and historically) available. • Functional requirements, in terms of services and service levels, including operations, problem resolution and support • Implicit / explicit requirements in Computing Models; • Agreements from Baseline Services Working Group and Task Forces; • Service Level definitions in MoU; • Measure what is currently (and historically) delivered; signal anomalies. • Data transfer rates – the TierX TierY matrix • Understand Use Cases; • Measure … And test extensively, both ‘dteam’ and other VOs

Resource Deployment and Usage Resource Requirements for 2008

ATLAS Resource Ramp-Up Needs Tier-0 Tier-1s CERN Analysis Facility Tier-2s

Site Planning Coordination • Site plans coordinated by LCG Planning Officer, Alberto Aimar • Plans are now collected in a standard format, updated quarterly • These allow tracking of progress towards agreed targets • Capacity ramp-up to MoU deliverables; • Installation and testing of key services; • Preparation for milestones, such as LCG Service Challenges…

Measured Delivered Capacity Various accounting summaries: • LHC View http://goc.grid-support.ac.uk/gridsite/accounting/tree/treeview.php • Data Aggregation across Countries • EGEE View http://www2.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php • Data Aggregation across EGEE ROC • GridPP View http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php • Specific view for GridPP accounting summaries for Tier-2s

Reaching the MoU Service Targets • These define the (high level) services that must be provided by the different Tiers • They also define average availability targets and intervention / resolution times for downtime & degradation • These differ from TierN to TierN+1 (less stringent as N increases) but refer to the ‘compound services’, such as “acceptance of raw data from the Tier0 during accelerator operation” • Thus they depend on the availability of specific components – managed storage, reliable file transfer service, database services, … • Can only be addressed through a combination of appropriate: • Hardware; Middleware and Procedures • Careful Planning & Preparation • Well understood operational & support procedures & staffing

Service Monitoring - Introduction • Service Availability Monitoring Environment (SAME) - uniform platform for monitoring all core services based on SFT experience • Two main end users (and use cases): • project management - overall metrics • operators - alarms, detailed info for debugging, problem tracking • A lot of work already done: • SFT and GStat are monitoring CEs and Site-BDIIs • Data schema (R-GMA) established • Basic displays in place (SFT report, CIC-on-duty dashboard, GStat) and can be reused

Service Level Definitions Tier0 services: C/H, Tier1 services: H/M, Tier2 services M/L

Service Functionalityhttps://twiki.cern.ch/twiki/bin/view/LCG/Planning

Breakdown of a normal year - From Chamonix XIV - 7-8 Service upgrade slots? ~ 140-160 days for physics per year Not forgetting ion and TOTEM operation Leaves ~ 100-120 days for proton luminosity running ? Efficiency for physics 50% ? ~ 50 days ~ 1200 h ~ 4 106 s of proton luminosity running / year R.Bailey, Chamonix XV, January 2006

Site & User Support • Ready to move to single entry point now • Target is to replace all interim mailing lists prior to SC4 Service Phase • i.e. by end – May for 1st June start • Send mail to helpdesk@ggus.org | VO-user-support@ggus.org • Also portal at www.ggus.org

PPS & WLCG Operations • Production-like operation procedures and tools need to be introduced in PPS • Must re-use as much as possible from production service. • This has already started (SFT, site registration) but we need to finish this very quickly – end of February? • PPS operations must be taken over by COD • Target proposed at last “COD meeting” was end March 2006 • This is a natural step also for “WLCG production operations” • And is consistent with the SC4 schedule • Production Services from beginning of June 2006 CERN - Computing Challenges

Summary of Tier0/1/2 Roles • Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times; • Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s; • Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction. N.B. there are differences in roles by experiment Essential to test using complete production chain of each!

Sustained Average Data Rates to Tier1 Sites (To Tape) Need additional capacity to recover from inevitable interruptions…

LCG OPN Status • Based on expected data rates during pp and AA running, 10Gbit/s networks are required between the Tier0 and all Tier1s • Inter-Tier1 traffic (reprocessing and other Use Cases) was one of the key topics discussed at the SC4 workshop this weekend, together with TierX TierY needs for analysis data, calibration activities and other studies • A number of sites already have their 10Gbit/s links in operation • The remaining are expected during the course of the year

Service Challenge Throughput Tests • Currently focussing on Tier0Tier1 transfers with modest Tier2Tier1 upload (simulated data) • Recently achieved target of 1GB/s out of CERN with rates into Tier1s at or close to nominal rates • Still much work to do! • We still do not have the stability required / desired… • The daily average needs to meet / exceed targets • We need to handle this without “heroic efforts” at all times of day / night! • We need to sustain this over many (100) days • We need to test recovery from problems (individual sites – also Tier0) • We need these rates to tape at Tier1s (currently disk) • Agree on milestones for TierXTierY transfers & demonstrate readiness

Achieved (Nominal) pp data rates Meeting or exceeding nominal rate (disk – disk) • To come: • Srm copy support in FTS; • CASTOR2 at remote sites; • SLC4 at CERN; • Network upgrades etc. Met target rate for SC3 (disk & tape) re-run Missing: rock solid stability - nominal tape rates SC4 T0-T1 throughput goals: nominal rates to disk (April) and tape (July)

CMS Tier1 – Tier1 Transfers • In the CMS computing model the Tier-1 to Tier-1 transfers are reasonably small. • The Tier-1 centers are used for re-reconstruction of events so Reconstructed events from some samples and analysis objects from all samples are replicated between Tier-1 centers. Goal for Tier-1 to Tier-1 transfers: • FNAL -> One Tier-1 1TB per day February 2006 • FNAL -> Two Tier-1's 1TB per day each March 2006 • FNAL -> 6Tier-1 Centers 1TB per day each July 2006 • FNAL -> One Tier-1 4TB per day July 2006 • FNAL -> Two Tier-1s 4TB per day each November 2006 ATLAS – 2 copies of ESD? 1 day = 86,400s ~105s Ian Fisk

SC4 milestones (2) Tier-1 to Tier-2 Transfers (target rate 300-500Mb/s) • Sustained transfer of 1TB data to 20% sites by end December • Sustained transfer of 1TB data from 20% sites by end December • Sustained transfer of 1TB data to 50% sites by end January • Sustained transfer of 1TB data from 50% sites by end January • Peak rate tests undertaken for the two largest Tier-2 sites in each Tier-2 by end February • Sustained individual transfers (>1TB continuous) to all sites completed by mid-March • Sustained individual transfers (>1TB continuous) from all sites completed by mid-March • Peak rate tests undertaken for all sites by end March • Aggregate Tier-2 to Tier-1 tests completed at target rate (rate TBC) by end March Tier-2 Transfers (target rate 100 Mb/s) • Sustained transfer of 1TB data between largest site in each Tier-2 to that of another Tier-2 by end February • Peak rate tests undertaken for 50% sites in each Tier-2 by end February

June 12-14 2006 “Tier2” Workshop • Focus on analysis Use Cases and Tier2s in particular • List of Tier2s reasonably well established • Try to attract as many as possible! • Some 20+ already active – target of 40 by September 2006! • Still many to bring up to speed – re-use experience of existing sites! • Important to understand key data flows • How experiments will decide which data goes where • Where does a Tier2 archive its MC data? • Where does it download the relevant Analysis data? • The models have evolved significantly over the past year! • Two-three day workshop followed by 1-2 days of tutorials Bringing remaining sites into play: Identifying remaining Use Cases

Summary of Key Issues • There are clearly many areas where a great deal still remains to be done, including: • Getting stable, reliable, data transfers up to full rates • Identifying and testing all other data transfer needs • Understanding experiments’ data placement policy • Bringing services up to required level – functionality, availability, (operations, support, upgrade schedule, …) • Delivery and commissioning of needed resources • Enabling remaining sites to rapidly and effectively participate • Accurate and concise monitoring, reporting and accounting • Documentation, training, information dissemination…

And Those Other Use Cases? • A small 1 TB dataset transported at "highest priority" to a Tier1 or a Tier2 or even a user group where CPU resources are available. • I would give it 3 Gbps so I can support 2 of them at once (max in the presence of other flows and some headroom). So this takes 45 minutes. • 10 TB needs to moved from one Tier1 to another or a large Tier2. • It takes 450 minutes, as above so only ~two per day can be supported per 10G link.

Timeline - 2006 O/S Upgrade? (SLC4) Sometime before April 2007!

The Dashboard Again…

(Some) Related Talks • The LHC Computing Grid Service (plenary) • BNL Wide Area Data Transfer for RHIC and ATLAS: Experience and Plans • CMS experience in LCG SC3 • The LCG Service Challenges - Results from the Throughput Tests and Service Deployment • Global Grid User Support: the model and experience in the Worldwide LHC Computing Grid • The gLite File Transfer Service: Middleware Lessons Learned from the Service Challenges

Summary • In the 3 key areas addressed by the WLCG MoU: • Data transfer rates; • Service availability and time to resolve problems; • Resources provisioned. we have made good – sometimes excellent - progress over the last year. • There still remains an a huge amount to do, but we have a clear plan of how to address these issues. • Need to be pragmatic, focussed and work together on our common goals.

CHEP – Mumbai, February 2006