Focusing on the first beams CHEP 2006 TIFR, Mumbai 15 February 2006

Focusing on thefirst beams CHEP 2006 TIFR, Mumbai 15 February 2006

Jamie Shiers on Monday gave a view of how ready we are for the start-up of LHC This talk puts the LCG service into the context of the evolving HEP and scientific computing environment .. .. looks at where our expectations were and were not fulfilled .. and outlines where we need to focus our efforts now as we prepare for the first beams

Mission of LCG Prepare and deploy the LHC computing environmentto help the experiments analyse the datacoming from the detectors With a significant Funding Principle – LHC computing resources will NOT be centralised at CERN And a few external constraints

CERN Tier-1 Tier-2 -- A bit of history • 1999 – the MONARC project • A straightforward distributed model • An inverted tree with data flowing out along the branches • Gave us the Tier nomenclature • 2000 - CHEP Padova – growing interest in grid technology • HEP community main driver in launching the DataGrid project in Europe • PPDG  GriPhyN in the US • middleware & testbeds for operational grids • 2001 - CHEP Beijing • Saw HEP infrastructure projects being prepared for launch -- LCG, national projects • 2003 – CHEP San Diego - production grids • LCG-1 – integrating a number of national grid infrastructures • Grid 3 growing out of a Supercomputer demo • 2004 – CHEP Interlaken - expanding to other communities and sciences • EU EGEE project with major EU funding - starts from the LCG grid • Open Science Grid

The Worldwide LCG Collaboration • Members • The experiments • The computing centres – Tier-0, Tier-1, Tier-2 • Memorandum of understanding • Resources, services, defined service levels • Resource commitments pledged for the next year, with a 5-year forward look

LCG services – built on two major science grid infrastructures EGEE - Enabling Grids for E-Science OSG - US Open Science Grid

Enabling Grids for E-SciencE • EU supported project • Develop and operate a multi-science grid • Assist scientific communities to embrace grid technology • First phase concentrated on operations and technology • Second phase (2006-08) Emphasis on extending the scientific, geographical and industrial scope • world-wide Grid infrastructure • international collaboration • in phase 2 will have > 90 partners in 32 countries

Applications >20 applications from 7 domains • High Energy Physics • Biomedicine • Earth Sciences • Computational Chemistry • Astronomy • Geo-Physics • Financial Simulation Another 8 applications from 4 domains are in evaluation stage

Sustainability: Beyond EGEE-II • Need to prepare for permanent Grid infrastructure • Maintain Europe’s leading position in global science Grids • Ensure a reliable and adaptive support for all sciences • Independent of project funding cycles • Modelled on success of GÉANT • Infrastructure managed centrally in collaboration with National Grid Initiatives • Proposal: European Grid Organisation (EGO)

Open Science Grid • Multi-disciplinary Consortium • Running physics experiments: CDF, D0, LIGO, SDSS, STAR • US LHC Collaborations • Biology, Computational Chemistry • Computer Science research • Condor and Globus • DOE Laboratory Computing Divisions • University IT Facilities • OSG today • 50 Compute Elements • 6 Storage Elements • VDT 1.3.9 • 23 VOs

OSG Funding Situation • Core middleware: Condor and Globus supported for 5 more years from NSF. • OSG Proposal: 5 year program of work being submitted to multiple program offices in NSF and DOE. Expect to know by summer 2006. Three Thrusts: • OSG Facility • Education, Training and Outreach • Science Driven Extensions. • Cooperating Proposals in Parallel being submitted to SciDAC-2 and NSF: • dCache extensions within the dCache collaboration. • Advanced networks and monitoring. • Distributed systems and cybersecurity. • Data and storage management; • Petastore data analysis systems.

Open Science Grid as part of the Worldwide LHC Computing Grid • Directly through the CMS & ATLAS US Tier-1 Facilities. • Collaborating with EGEE on interoperability of services, operations etc. • LCG VOs can be registered on OSG - ATLAS, CMS, Geant3, DTEAM. • OSG roadmap and baselines services defined to meet LHC needs and schedule.

How does the computing technology that we have inour hands match up to our expectations?

No thought of GRID in 1996! Processing & Storage Technology – Expectation v. Reality Some of the predictions of PASTA I – 1996 • Processor Conclusions • Processor performance in 2005: 4,000 SPECint92 = 1,000 SPECint2000 • “SMPs with modest number of processors will provide excellent price/performance and will be the basic building block ..” • Storage Conclusions • “It does not seem likely that alternative technologies such as optical disk, flash memories or holographic storage will provide serious competition for magnetic disk in the LHC time-frame.” • “Cheap disk could change the way in which tape based storage is used.” • Tape drive performance: “.. a conservative estimate for standard drives would be 50 MB/sec ..” • “There may not be a suitable [storage management] product - it may be necessary to implement an HEP solution. “ • “The use of object databases could simplify the problem [of storage management]”       

Wide Area NetworksExpectation v. Reality • Monarc Phase 2 Report – March 2000 “.. it should be possible to build a useful distributed architecture computing system provided the availability of CERN->Tier-1 Regional Centre network bandwidth is of the order of 622 Mbps per Regional Centre. This is an important result, as all the projections for the future indicate that such connections should be commonplace in 2005.” • The reality in most of the countries involved in LCG is well ahead of the expectations for bandwidth & cost   

LCG T2 T2 T2 T2 T2 T2 Nordic T2 T2 T2 T2 T2 T2 Wide Area Network T2 T2 T2s and T1s are inter-connectedby the general purpose researchnetworks T2 Any Tier-2 may access data at any Tier-1 GridKa IN2P3 Dedicated10 Gbit links TRIUMF Brookhaven Each T1 with ~10 Gbps to the local NREN ASCC Fermilab RAL CNAF PIC SARA

Node Configuration Management Node Management Affordable heterogeneous cluster management tools  • A major concern for CERN was automation of the management of the very large clusters needed for LHC • ELFms - Extremely Large Fabric Management System

Grid Expectations v. Reality • Grid technology is not the panacea that some of us hoped for in 2000 at CHEP in Padova  the off-the-shelf tools to implement a flexible Monarc model • The grid projects have generated wide interest,-- helped build an active community of service providers-- made new computing resources available-- paid for a good deal of HEP operationand largely with non-HEP funding • But .. it has taken longer than we expected to get to a basic computing service that is reasonably reliable

But if we were realists – like the Gartner Group analysts – that is exactly what we would have expected Gartner Group HEP Grid on the CHEP timeline Beijing San Diego Victoria? Padova Mumbai Interlaken

Production GridsWhat has been achieved • Basic middleware • A set of baseline services agreed and initial versions in production • All major LCG sites active • 1 GB/sec distribution data rate mass storage to mass storage, > 50% of the nominal LHC data rate • Grid job failure rate 5-10% for most experiments,down from ~30% in 2004 • Sustained 10K jobs per day • > 10K simultaneous jobs during prolonged periods

Operations Process & Infrastructure • EGEE operation • Started November 2004 • There is no “central” operation – but 6 teams working in weekly rotation • CERN, IN2P3, CNAF, RAL, Russia,Taipei • Monitoring and alarm management tools • Crucial in improving site stability and management • OSG Operations Centre in Indiana • Joint workshops EGEE/OSG examining common procedures

total grid sites number of sites passing the SFT tests Log data lost Service Metrics • Grid level accounting in place • Site Functional Test (SFT) framework • Regular testing of basic services, .. VO-specific tests • Framework for service level monitoring (MoU) • Marked improvement in site availability since introduced • Investigating using SFT framework in OSG

Pilot Services – stable service from 1 June 06 LHC Service in operation– 1 Oct 06over following six months ramp up to full operational capacity & performance 2006 cosmics LHC service commissioned – 1 Apr 07 first physics 2007 full physics run 2008 LCG Service Deadlines Service Challenge 4

Priorities • There are a lot of energetic and imaginative people involved in this enterprise -- there is a real risk that we are still trying to be do things that are too complicated, while simple and robust models have not yet been demonstrated • There are very few people with the right knowledge and in the right place to work on some of the really important things • Operating reliable services is more difficult than it looks – after the hard work of debugging & automation has been done.. and operating distributed services is even more difficult  Now – 2006 - we must be very clear about the priorities

Service Challenges • Purpose • Understand what it takes to operate a real grid service – run for days/weeks at a time (not just limited to experiment Data Challenges) • Trigger and verify Tier1 & large Tier-2 planning and deployment – - tested with realistic usage patterns • Get the essential grid services ramped up to target levels of reliability, availability, scalability, end-to-end performance • Four progressive steps from October 2004 thru September 2006 • End 2004 - SC1 – data transfer to subset of Tier-1s • Spring 2005 – SC2 – include mass storage, all Tier-1s, some Tier-2s • 2nd half 2005 – SC3 – Tier-1s, >20 Tier-2s –first set of baseline services • Jun-Sep 2006 – SC4 – pilot service

SC4 – the Pilot LHC Service from June 2006 • Must be able to support a demonstration of the complete chain • DAQ  Tier-0  Tier-1data recording, calibration, reconstruction • simulation, batch and end-user analysisTier-1  Tier-2 data exchange • Service metrics  MoU service levels • Extension of the service to most Tier-2 sites

SC4 Planning • 3-day workshop just prior to CHEP to finalise the planning • We have just about enough now to underlie a basic physics service • Functionality - modest evolution from current service • Deploying software that is already in the hands of the integration and test team • Focus on reliability, performance • Some functions still have to be provided by the experiments • In the longer term the evolution must continue, with additional services and enhancements • But now is the time to concentrate on what may be the hardest part of any complicated distributed system – making it work

New functionality Evaluation & developmentcycles Possible components for later years Additional planned Functionality to be agreed & completedin the nextfew months then - testeddeployed Subject to progress& experience ?? Medium Term SRM 2 test and deployment Plan being elaborated SC4 3D distributed database services development test October?

Summary • Two grid infrastructures are now in operation, on which we are able to complete the computing services for LHC • Reliability and performance have improved significantly over the past year • The focus of Service Challenge 4 is to demonstrate a basic but reliable service that can be scaled up by April 2007 to the capacity and performance needed for the first beams. • Development of new functionality andservices must continue, but we must be careful that this does not interferewith the main priority for this year – reliable operation of the baseline services

Focusing on the first beams CHEP 2006 TIFR, Mumbai 15 February 2006

Focusing on the first beams CHEP 2006 TIFR, Mumbai 15 February 2006

Presentation Transcript

February 3, 2006

February, 2006

February, 2006

Settlement Party Briefing February 15, 2006

February 23, 2006

February 2006

CHEP – Mumbai, February 2006

DFSI mumbai 2006

February 14, 2006

2006-2006 First Term Exam

February 15, 2006

IWLSC – CHEP 2006

February 2006

16 February 2006

February 15 2006

February 16, 2006

FEBRUARY 2006

February 14, 2006

CHEP – Mumbai, February 2006

IWLSC – CHEP 2006

BVR, 2006 February 15