Status of the LCG project

Status of the LCG project Julia Andreeva On behalf of LCG project CERN Geneva, Switzerland NEC 2005 16.09.2005

Contents • LCG project short overview • LHC computing model and requirements to the LCG project (as estimated in the LCG TDR) • Middleware evolution, new generation- gLite • ARDA prototypes • Summary Julia Andreeva CERN NEC2005

LCG project • LCG project approved by CERN Council in September 2001 • LHC Experiments • Grid projects: Europe, US • Regional & national centres Goal • Prepare and deploy the computing environment to help the experiments analyse the data from the LHC detectors Phase 1 – 2002-05 • development of common software prototype • operation of a pilot computing service Phase 2 – 2006-08 • acquire, build and operate the LHC computing service Julia Andreeva CERN NEC2005

LCG activities Applications Area Common projects Libraries and tools, data management Middleware Area Provision of gridmiddleware – acquisition,development, integration,testing, support Distributed Analysis Joint project on distributed analysis with the LHC experiments Grid Deployment Area Establishing and managing theGrid Service - Middleware certification, security, operations. Service Challenges CERN Fabric Area Cluster management Data handling Cluster technology Networking (WAN+local) Computing service at CERN Julia Andreeva CERN NEC2005

Cooperation with other projects • Network Services • LCG will be one of the most demanding applications of national research networks such as the pan-European backbone network, GÉANT • Grid Software • Globus, Condor and VDT have provided key components of the middleware used. Key members participate in OSG and EGEE • Enabling Grids for E-sciencE (EGEE) includes a substantial middleware activity. • Grid Operations • The majority of the resources used are made available as part of the EGEE Grid (~140 sites, 12,000 processors). EGEE also supports Core Infrastructure Centres and Regional Operations Centres. • The US LHC programmes contribute to and depend on the Open Science Grid (OSG). Formal relationship with LCG through US-Atlas and US-CMS computing projects. • The Nordic Data Grid Facility (NDGF) will begin operation in 2006. Prototype work is based on the NorduGrid middleware ARC. Julia Andreeva CERN NEC2005

Country providing resources Country anticipating joining EGEE/LCG In EGEE-0 (LCG-2): • 150 sites • ~14,000 CPUs • ~100 PB storage Operations: Computing Resources This greatly exceeds the project expectations for numbers of sites New middleware Number of sites Heterogeneity Complexity Julia Andreeva CERN NEC2005

RC RC RC RC ROC RC RC RC RC RC ROC RC RC RC CIC CIC RC ROC RC CIC OMC CIC CIC CIC RC RC RC ROC RC RC RC Grid Operations • The grid is flat, but there is a Hierarchy of responsibility • Essential to scale the operation • Operations Management Centre (OMC): • At CERN – coordination etc… • Core Infrastructure Centres (CIC) • Acts as single operations centres (one centre in shift) • Daily grid operations – oversight, troubleshooting • Run essential infrastructure services • Provide 2nd level support to ROCs • UK/I, Fr, It, CERN, + Russia + Taipei • Regional Operations Centres (ROC) • Front-line support for user and operations • Provide local knowledge and adaptations • One in each region – many distributed • User Support Centre (GGUS) • In FZK (Karlsruhe) (service desk) RC - Resource Centre Julia Andreeva CERN NEC2005

LCG-2 (=EGEE-0) 2004 prototyping prototyping product 2005 product LCG-3 (=EGEE-x?) Operations focus • Main focus of activities now: • Improving the operational reliability and application efficiency: • Automating monitoring  alarms • Ensuring a 24x7 service • Removing sites that fail functional tests • Operations interoperability with OSG and others • Improving user support: • Demonstrate to users a reliable and trusted support infrastructure • Deployment of gLite components: • Testing, certification  pre-production service • Migration planning and deployment – while maintaining/growing interoperability • Further developments now have to be driven by experience in real use Julia Andreeva CERN NEC2005

The LHC Computing Hierarchical Model • Tier-0 at CERN • Record RAW data (1.25 GB/s ALICE) • Distribute second copy to Tier-1s • Calibrate and do first-pass reconstruction • Tier-1 centres (11 defined) • Manage permanent storage – RAW, simulated, processed • Capacity for reprocessing, bulk analysis • Tier-2 centres (>~ 100 identified) • Monte Carlo event simulation • End-user analysis • Tier-3 • Facilities at universities and laboratories • Access to data and processing in Tier-2s, Tier-1s • Outside the scope of the project Julia Andreeva CERN NEC2005

Tier-1s Julia Andreeva CERN NEC2005

Tier-2s ~100 identified – number still growing Julia Andreeva CERN NEC2005

Experiments’ Requirements • Single Virtual Organization (VO) across the Grid • Standard interfaces for Grid access to Storage Elements (SEs) and Computing Elements (CEs) • Need of a reliable Workload Management System (WMS) to efficiently exploit distributed resources. • Non-event data such as calibration and alignment data but also detector construction descriptions will be held in data bases • read/write access to central (Oracle) databases at Tier-0 and read access at Tier-1s with a local database cache at Tier-2s • Analysis scenarios and specific requirements are still evolving • Prototype work is in progress (ARDA) • Online requirements are outside of the scope of LCG, but there are connections: • Raw data transfer and buffering • Database management and data export • Some potential use of Event Filter Farms for offline processing Julia Andreeva CERN NEC2005

Architecture – Grid services • Storage Element • Mass Storage System (MSS) (CASTOR, Enstore, HPSS, dCache, etc.) • Storage Resource Manager (SRM) provides a common way to access MSS, independent of implementation • File Transfer Services (FTS) provided e.g. by GridFTP or srmCopy • Computing Element • Interface to local batch system e.g. Globus gatekeeper. • Accounting, status query, job monitoring • Virtual Organization Management • Virtual Organization Management Services (VOMS) • Authentication and authorization based on VOMS model. • Grid Catalogue Services • Mapping of Globally Unique Identifiers (GUID) to local file name • Hierarchical namespace, access control • Interoperability • EGEE and OSG both use the Virtual Data Toolkit (VDT) • Different implementations are hidden by common interfaces Julia Andreeva CERN NEC2005

Technology - Middleware • Currently, the LCG-2 middleware is deployed in more than 100 sites • It originated from Condor, EDG, Globus, VDT, and other projects. • Will evolve now to include functionalities of the gLite middleware provided by the EGEE project which has just been made available. • Site services include security, the Computing Element (CE), the Storage Element (SE), Monitoring and Accounting Services – currently available both form LCG-2 and gLite. • VO services such as Workload Management System (WMS), File Catalogues, Information Services, File Transfer Services exist in both flavours (LCG-2 and gLite) maintaining close relations with VDT, Condor and Globus. Julia Andreeva CERN NEC2005

gLite middleware • The 1st release of gLite (v1.0) made end March’05 • http://glite.web.cern.ch/glite/packages/R1.0/R20050331 • http://glite.web.cern.ch/glite/documentation • Lightweight services • Interoperability & Co-existence with deployed infrastructure • Performance & Fault Tolerance • Portable • Service oriented approach • Site autonomy • Open source license Julia Andreeva CERN NEC2005

Main Differences to LCG-2 • Workload Management System works in push and pull mode • Computing Element moving towards a VO based scheduler guarding the jobs of the VO (reduces load on GRAM) • Re-factored file & replica catalogs • Secure catalogs (based on user DN; VOMS certificates being integrated) • Scheduled data transfers • SRM-based storage • Information Services: R-GMA with improved API, Service Discovery and registry replication • Move towards Web Services Julia Andreeva CERN NEC2005

Prototypes • It is important that the hardware and software systems developed in the framework of LCG be exercised in more and more demanding challenges • Data Challenges have been recommended by the ‘Hoffmann Review’ of 2001. They have now been done by all experiments. Though the main goal was to validate the distributed computing model and to gradually build the computing systems, the results have been used for physics performance studies and for detector, trigger, and DAQ design. Limitations of the Grids have been identified and are being addressed. • Presently, a series of Service Challenges aim to realistic end-to-end testing of experiment use-cases over in extended period leading to stable production services. • The project ‘A Realisation of Distributed Analysis for LHC’ (ARDA) is developing end-to-end prototypes of distributed analysis systems using the EGEE middleware gLite for each of the LHC experiments. Julia Andreeva CERN NEC2005

ARDA- A Realisation of Distributed Analysis for LHC • Distributed analysis on the Grid is the most difficult and least defined topic • ARDA sets out to develop end-to-end analysis prototypes using the LCG-supported middleware. • ALICE uses the AliROOT framework based on PROOF. • ATLAS has used DIAL services with the gLite prototype as backend. • CMS has prototyped the ‘ARDA Support for CMS Analysis Processing’ (ASAP) that is used by CMS physicists for daily analysis work. • LHCb has based its prototype on GANGA, a common project between ATLAS and LHCb. Julia Andreeva CERN NEC2005

Running parallel instances of ATHENA on gLite (ATLAS/ARDA and Taipei ASCC) Julia Andreeva CERN NEC2005

CMS: ASAP prototype Monalisa RefDB PubDB Job running on the Worker Node gLite JDL ASAP UI Job submission Checking job status Resubmission in case of failure Fetching results Storing results to Castor Delegates user credentials using MyProxy Application,applicationversion, Executable, Orca data cards Data sample, Working directory, Castor directory to save output, Number of events to be processed Number of events per job ASAP Job Monitoring service Publishing Job status On the WEB Output files location Julia Andreeva CERN NEC2005

Summary • The LCG infrastructure is proving to be an essential tool for the experiments • Development and deployment of the gLite middleware aim to provide additional functionality and improved performance and satisfy challenging requirements of the LHC experiments Julia Andreeva CERN NEC2005

Backup slide What is EGEE? EGEE is the largest Grid infrastructure project in Europe: • 70 leading institutions in 27 countries, federated in regional Grids • Leveraging national and regional grid activities • Started April 2004 (end March 2006) • EU review, February 2005 successful • Preparing 2nd phase of the project – proposal to EU Grid call September 2005 • 2 years starting April 2006 • Promoting scientific partnership outside EU Goal of EGEE: develop a service grid infrastructure which is available to scientists 24 hours-a-day LCG and EGEE are different projects But collaboration is ensured (sharing instead duplication) Julia Andreeva CERN NEC2005

Backup slide Tier-0 -1 -2 Connectivity National Reasearch Networks (NRENs) at Tier-1s:ASnetLHCnet/ESnetGARRLHCnet/ESnetRENATERDFNSURFnet6NORDUnetRedIRISUKERNACANARIE Julia Andreeva CERN NEC2005

Backup slides The Eventflow 50 days running in 2007107 seconds/year pp from 2008 on  ~109 events/experiment106 seconds/year heavy ion Julia Andreeva CERN NEC2005

Backup slides CPU Requirements Tier-2 Tier-1 58%pledged CERN Julia Andreeva CERN NEC2005

Backup slide Disk Requirements Tier-2 Tier-1 54%pledged CERN Julia Andreeva CERN NEC2005

Backup slide Tape Requirements Tier-1 CERN 75%pledged Julia Andreeva CERN NEC2005

Backup slide Tier-0 components • Batch system (LSF) manage CPU resources • Shared file system (AFS) • Disk pool and mass storage (MSS) manager (CASTOR) • Extremely Large Fabric management system (ELFms) • Quattor – system administration – installation and configuration • LHC Era MONitoring (LEMON) system, server/client based • LHC-Era Automated Fabric (LEAF) – high-level commands to sets of nodes • CPU servers – ‘white boxes’, INTEL processors, (scientific) Linux • Disk Storage – Network Attached Storage (NAS) – mostly mirrored • Tape Storage – currently STK robots – future system under evaluation • Network – fast gigabit Ethernet switches connected to multigigabit backbone routers Julia Andreeva CERN NEC2005

Data Challenges • ALICE • PDC04 using AliEn services native or interfaced to LCG-Grid. 400,000 jobs run producing 40 TB of data for the Physics Performance Report. • PDC05: Event simulation, first-pass reconstruction, transmission to Tier-1 sites, second pass reconstruction (calibration and storage), analysis with PROOF – using Grid services from LCG SC3 and AliEn • ATLAS • Using tools and resources from LCG, NorduGrid, and Grid3 at 133 sites in 30 countries using over 10,000 processors where 235,000 jobs produced more than 30 TB of data using an automatic production system. • CMS • 100 TB simulated data reconstructed at a rate of 25 Hz, distributed to the Tier-1 sites and reprocessed there. • LHCb • LCG provided more than 50% of the capacity for the first data challenge 2004-2005. The production used the DIRAC system. Julia Andreeva CERN NEC2005

Service Challenges • Aseries of Service Challenges (SC) set out to successively approach the production needs of LHC • While SC1 did not meet the goal to transfer for 2 weeks continuously at a rate of 500 MB/s, SC2 did exceed the goal (500 MB/s) by sustaining throughput of 600 MB/s to 7 sites. • SC3 starts soon, using gLite middleware components, with disk-to-disk throughput tests, 10 Gb networking of Tier-1s to CERN providing SRM (1.1) interface to managed storage at Tier-1s. The goal is to achieve 150 MB/s disk-to disk and 60 MB/s to managed tape. There will be also Tier-1 to Tier-2 transfer tests. • SC4 aims to demonstrate that all requirements from raw data taking to analysis can be met at least 6 months prior to data taking. The aggregate rate out of CERN is required to be 1.6 GB/s to tape at Tier-1s. • The Service Challenges will turn into production services for the experiments. Julia Andreeva CERN NEC2005

2005 2006 2007 2008 SC3 First physics cosmics First beams Full physics run SC4 LHC Service Operation Backup slide Key dates for Service Preparation Sep05 - SC3 Service Phase May06 –SC4 Service Phase Sep06 – Initial LHC Service in stable operation Apr07 – LHC Service commissioned SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 1GB/sec, including mass storage 500 MB/sec (150 MB/sec & 60 MB/sec at Tier-1s) SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput (~ 1.5 GB/sec mass storage throughput) LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput Julia Andreeva CERN NEC2005

Status of the LCG project

Status of the LCG project

Presentation Transcript

LCG Applications Area Status

Status of the project

LCG Generator Services project

Status of the project

TW LCG Network Status Update

TW LCG Network Status Update

Status of the Project

LCG Fabric status

The LCG Project

Status of LCG-2 porting

LCG Generator Project

LCG-1 Status

LCG-1 Status

LCG-France Project Status

Status of the use of LCG

LCG ARDA project Status and plans

LCG Fabric status

LCG Status and Plans

LHC Computing Grid Project (LCG): Status and Prospects

Status of the JINR GRID-infrastructure and p articipation in W LCG project

LCG 3D Project Status and Production Plans

LCG-1 Status