The lhc computing grid project technical design report lhcc 29 june 2005
Download
1 / 29

lcg.web.cern.ch - PowerPoint PPT Presentation


  • 410 Views
  • Uploaded on

The LHC Computing Grid Project Technical Design Report LHCC, 29 June 2005. Jürgen Knobloch, IT Department, CERN This file is available at: http://cern.ch/lcg/tdr/LCG_TDR.ppt. Technical Design Report - limitations. Computing is different from detector building

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'lcg.web.cern.ch' - elina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The lhc computing grid project technical design report lhcc 29 june 2005 l.jpg

The LHC Computing Grid ProjectTechnical Design ReportLHCC, 29 June 2005

Jürgen Knobloch, IT Department, CERNThis file is available at:http://cern.ch/lcg/tdr/LCG_TDR.ppt


Technical design report limitations l.jpg
Technical Design Report - limitations

  • Computing is different from detector building

    • It’s ‘only’ software – can and will be adapted as needed

    • Technology evolves rapidly – we have no control

    • Prices go down – Moore’s law – Buy just in time – Understand startup

  • We are in the middle of planning the next phase

    • The Memorandum of Understanding (MoU) is being finalized

    • The list of Tier-2 centres is evolving

    • Baseline Services have been agreed

    • EGEE continuation is being discussed

    • Experience from Service Challenges will be incorporated

    • Some of the information is made available from (dynamic) Web-sites

  • The LCG TDR appears simultaneously with the experiments’ ones

    • Some inconsistencies may have passed undetected

    • Some people were occupied on both sides


The lcg project l.jpg
The LCG Project

  • Approved by the CERN Council in September 2001

    • Phase 1 (2001-2004):Development and prototyping a distributed production prototype at CERN and elsewhere that will be operated as a platform for the data challenges- leading to a Technical Design Report, which will serve as a basis for agreeing the relations between the distributed Grid nodes and their co-ordinated deployment and exploitation.

    • Phase 2 (2005-2007):Installation and operation of the full world-wide initial production Grid system, requiring continued manpower efforts and substantial material resources.

  • A Memorandum of Understanding

    • … has been developed defining the Worldwide LHC Computing Grid Collaboration of CERN as host lab and the major computing centres.

    • Defines the organizational structure for Phase 2 of the project.


Organizational structure for phase 2 l.jpg
Organizational Structure for Phase 2

Computing Resources

Computing Resources

LHC Committee

LHC Committee

LHCC

LHCC

Review Board

Review Board

-

-

C

C

-

-

RRB

RRB

Scientific Review

Scientific Review

Funding Agencies

Funding Agencies

Collaboration Board

Collaboration Board

CB

CB

Experiments and Regional Centres

Experiments and Regional Centres

Overview Board

Overview Board

-

-

OB

OB

Management Board

Management Board

-

-

MB

MB

Management of the Project

Management of the Project

Grid Deployment Board

Grid Deployment Board

Architects Forum

Architects Forum

Coordination of

Coordination of

Coordination of

Coordination of

Grid Operation

Grid Operation

Common Applications

Common Applications


Cooperation with other projects l.jpg
Cooperation with other projects

  • Network Services

    • LCG will be one of the most demanding applications of national research networks such as the pan-European backbone network, GÉANT

  • Grid Software

    • Globus, Condor and VDT have provided key components of the middleware used. Key members participate in OSG and EGEE

    • Enabling Grids for E-sciencE (EGEE) includes a substantial middleware activity.

  • Grid Operational Groupings

    • The majority of the resources used are made available as part of the EGEE Grid (~140 sites, 12,000 processors). EGEE also supports Core Infrastructure Centres and Regional Operations Centres.

    • The US LHC programmes contribute to and depend on the Open Science Grid (OSG). Formal relationship with LCG through US-Atlas and US-CMS computing projects.

    • The Nordic Data Grid Facility (NDGF) will begin operation in 2006. Prototype work is based on the NorduGrid middleware ARC.


The hierarchical model l.jpg
The Hierarchical Model

  • Tier-0 at CERN

    • Record RAW data (1.25 GB/s ALICE)

    • Distribute second copy to Tier-1s

    • Calibrate and do first-pass reconstruction

  • Tier-1 centres (11 defined)

    • Manage permanent storage – RAW, simulated, processed

    • Capacity for reprocessing, bulk analysis

  • Tier-2 centres (>~ 100 identified)

    • Monte Carlo event simulation

    • End-user analysis

  • Tier-3

    • Facilities at universities and laboratories

    • Access to data and processing in Tier-2s, Tier-1s

    • Outside the scope of the project



Tier 2s l.jpg
Tier-2s

~100 identified – number still growing


The eventflow l.jpg
The Eventflow

50 days running in 2007107 seconds/year pp from 2008 on  ~109 events/experiment106 seconds/year heavy ion


Cpu requirements l.jpg
CPU Requirements

Tier-2

Tier-1

58%pledged

CERN


Disk requirements l.jpg
Disk Requirements

Tier-2

Tier-1

54%pledged

CERN


Tape requirements l.jpg
Tape Requirements

Tier-1

CERN

75%pledged


Experiments requirements l.jpg
Experiments’ Requirements

  • Single Virtual Organization (VO) across the Grid

  • Standard interfaces for Grid access to Storage Elements (SEs) and Computing Elements (CEs)

  • Need of a reliable Workload Management System (WMS) to efficiently exploit distributed resources.

  • Non-event data such as calibration and alignment data but also detector construction descriptions will be held in data bases

    • read/write access to central (Oracle) databases at Tier-0 and read access at Tier-1s with a local database cache at Tier-2s

  • Analysis scenarios and specific requirements are still evolving

    • Prototype work is in progress (ARDA)

  • Online requirements are outside of the scope of LCG, but there are connections:

    • Raw data transfer and buffering

    • Database management and data export

    • Some potential use of Event Filter Farms for offline processing


Architecture grid services l.jpg
Architecture – Grid services

  • Storage Element

    • Mass Storage System (MSS) (CASTOR, Enstore, HPSS, dCache, etc.)

    • Storage Resource Manager (SRM) provides a common way to access MSS, independent of implementation

    • File Transfer Services (FTS) provided e.g. by GridFTP or srmCopy

  • Computing Element

    • Interface to local batch system e.g. Globus gatekeeper.

    • Accounting, status query, job monitoring

  • Virtual Organization Management

    • Virtual Organization Management Services (VOMS)

    • Authentication and authorization based on VOMS model.

  • Grid Catalogue Services

    • Mapping of Globally Unique Identifiers (GUID) to local file name

    • Hierarchical namespace, access control

  • Interoperability

    • EGEE and OSG both use the Virtual Data Toolkit (VDT)

    • Different implementations are hidden by common interfaces


Baseline services l.jpg
Baseline Services

Mandate

The goal of the working group is to forge an agreement between the experiments and the LHC regional centres on the baseline services to be provided to support the computing models for the initial period of LHC running, which must therefore be in operation by September 2006.

The services concerned are those that supplement the basic services for which there is already general agreement and understanding (e.g. provision of operating system services, local cluster scheduling, compilers, ..) and which are not already covered by other LCG groups such as the Tier-0/1 Networking Group or the 3D Project. …

Members

Experiments: ALICE: L. Betev, ATLAS: M. Branco, A. de Salvo, CMS: P. Elmer, S. Lacaprara, LHCb: P. Charpentier, A. Tsaragorodtsev

Projects: ARDA: J. Andreeva, Apps Area: D. Düllmann, gLite: E. Laure

Sites: F. Donno (It), A. Waananen (Nordic), S. Traylen (UK), R. Popescu, R. Pordes (US)

Chair: I. Bird, Secretary: M. Schulz

Timescale: 15 February to 17 June 2005


Baseline services preliminary priorities l.jpg
Baseline Services – preliminary priorities


Architecture tier 0 l.jpg
Architecture – Tier-0

WAN

Gigabit Ethernet

Ten Gigabit Ethernet

Double ten gigabit Ethernet

10 Gb/s to 32×1 Gb/s

2.4 Tb/s CORE

Experimental

areas

Campus network

Distribution layer

….

..32..

..10..

..96..

..96..

..96..

~2000 Tape and Disk servers

~6000 CPU servers x 8000 SPECINT2000 (2008)


Tier 0 components l.jpg
Tier-0 components

  • Batch system (LSF) manage CPU resources

  • Shared file system (AFS)

  • Disk pool and mass storage (MSS) manager (CASTOR)

  • Extremely Large Fabric management system (ELFms)

    • Quattor – system administration – installation and configuration

    • LHC Era MONitoring (LEMON) system, server/client based

    • LHC-Era Automated Fabric (LEAF) – high-level commands to sets of nodes

  • CPU servers – ‘white boxes’, INTEL processors, (scientific) Linux

  • Disk Storage – Network Attached Storage (NAS) – mostly mirrored

  • Tape Storage – currently STK robots – future system under evaluation

  • Network – fast gigabit Ethernet switches connected to multigigabit backbone routers


Tier 0 1 2 connectivity l.jpg
Tier-0 -1 -2 Connectivity

National Reasearch Networks (NRENs) at Tier-1s:ASnetLHCnet/ESnetGARRLHCnet/ESnetRENATERDFNSURFnet6NORDUnetRedIRISUKERNACANARIE


Technology middleware l.jpg
Technology - Middleware

  • Currently, the LCG-2 middleware is deployed in more than 100 sites

  • It originated from Condor, EDG, Globus, VDT, and other projects.

  • Will evolve now to include functionalities of the gLite middleware provided by the EGEE project which has just been made available.

  • In the TDR, we describe the basic functionality of LCG-2 middleware as well as the enhancements expected from gLite components.

  • Site services include security, the Computing Element (CE), the Storage Element (SE), Monitoring and Accounting Services – currently available both form LCG-2 and gLite.

  • VO services such as Workload Management System (WMS), File Catalogues, Information Services, File Transfer Services exist in both flavours (LCG-2 and gLite) maintaining close relations with VDT, Condor and Globus.


Technology fabric technology l.jpg
Technology – Fabric Technology

  • Moore’s law still holds for processors and disk storage

    • For CPU and disks we count a lot on the evolution of the consumer market

    • For processors we expect an increasing importance of 64-bit architectures and multicore chips

    • The cost break-even point between disk and tape store will not be reached for the initial LHC computing

  • Mass storage (tapes and robots) is still a computer centre item with computer centre pricing

    • It is too early to conclude on new tape drives and robots

  • Networking has seen a rapid evolution recently

    • Ten-gigabit Ethernet is now in the production environment

    • Wide-area networking can already now count on 10 Gb connections between Tier-0 and Tier-1s. This will move gradually to the Tier-1 – Tier-2 connections.


Common physics applications l.jpg

Core software libraries

SEAL-ROOT merger

Scripting: CINT, Python

Mathematical libraries

Fitting, MINUIT (in C++)

Data management

POOL: ROOT I/O for bulk dataRDBMS for metadata

Conditions database – COOL

Event simulation

Event generators: generator library (GENSER)

Detector simulation: GEANT4 (ATLAS, CMS, LHCb)

Physics validation, compare GEANT4, FLUKA, test beam

Software development infrastructure

External libraries

Software development and documentation tools

Quality assurance and testing

Project portal: Savannah

Common Physics Applications


Prototypes l.jpg
Prototypes

  • It is important that the hardware and software systems developed in the framework of LCG be exercised in more and more demanding challenges

  • Data Challenges have been recommended by the ‘Hoffmann Review’ of 2001. They have now been done by all experiments. Though the main goal was to validate the distributed computing model and to gradually build the computing systems, the results have been used for physics performance studies and for detector, trigger, and DAQ design. Limitations of the Grids have been identified and are being addressed.

  • Presently, a series of Service Challenges aim to realistic end-to-end testing of experiment use-cases over in extended period leading to stable production services.

  • The project ‘A Realisation of Distributed Analysis for LHC’ (ARDA) is developing end-to-end prototypes of distributed analysis systems using the EGEE middleware gLite for each of the LHC experiments.


Data challenges l.jpg
Data Challenges

  • ALICE

    • PDC04 using AliEn services native or interfaced to LCG-Grid. 400,000 jobs run producing 40 TB of data for the Physics Performance Report.

    • PDC05: Event simulation, first-pass reconstruction, transmission to Tier-1 sites, second pass reconstruction (calibration and storage), analysis with PROOF – using Grid services from LCG SC3 and AliEn

  • ATLAS

    • Using tools and resources from LCG, NorduGrid, and Grid3 at 133 sites in 30 countries using over 10,000 processors where 235,000 jobs produced more than 30 TB of data using an automatic production system.

  • CMS

    • 100 TB simulated data reconstructed at a rate of 25 Hz, distributed to the Tier-1 sites and reprocessed there.

  • LHCb

    • LCG provided more than 50% of the capacity for the first data challenge 2004-2005. The production used the DIRAC system.


Service challenges l.jpg
Service Challenges

  • A series of Service Challenges (SC) set out to successively approach the production needs of LHC

  • While SC1 did not meet the goal to transfer for 2 weeks continuously at a rate of 500 MB/s, SC2 did exceed the goal (500 MB/s) by sustaining throughput of 600 MB/s to 7 sites.

  • SC3 will start now, using gLite middleware components, with disk-to-disk throughput tests, 10 Gb networking of Tier-1s to CERN providing SRM (1.1) interface to managed storage at Tier-1s. The goal is to achieve 150 MB/s disk-to disk and 60 MB/s to managed tape. There will be also Tier-1 to Tier-2 transfer tests.

  • SC4 aims to demonstrate that all requirements from raw data taking to analysis can be met at least 6 months prior to data taking. The aggregate rate out of CERN is required to be 1.6 GB/s to tape at Tier-1s.

  • The Service Challenges will turn into production services for the experiments.



Key dates for service preparation l.jpg

2005

2006

2007

2008

SC3

First physics

cosmics

First beams

Full physics run

SC4

LHC Service Operation

Key dates for Service Preparation

Sep05 - SC3 Service Phase

May06 –SC4 Service Phase

Sep06 – Initial LHC Service in stable operation

Apr07 – LHC Service commissioned

  • SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 1GB/sec, including mass storage 500 MB/sec (150 MB/sec & 60 MB/sec at Tier-1s)

  • SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput (~ 1.5 GB/sec mass storage throughput)

  • LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput


Arda a realisation of distributed analysis for lhc l.jpg
ARDA- A Realisation of Distributed Analysis for LHC

  • Distributed analysis on the Grid is the most difficult and least defined topic

  • ARDA sets out to develop end-to-end analysis prototypes using the LCG-supported middleware.

  • ALICE uses the AliROOT framework based on PROOF.

  • ATLAS has used DIAL services with the gLite prototype as backend.

  • CMS has prototyped the ‘ARDA Support for CMS Analysis Processing’ (ASAP) that us used by several CMS physicists for daily analysis work.

  • LHCb has based its prototype on GANGA, a common project between ATLAS and LHCb.


Thanks to l.jpg
Thanks to …

  • EDITORIAL BOARD

    • I. Bird, K. Bos, N. Brook, D. Duellmann, C. Eck, I. Fisk, D. Foster, B. Gibbard, C. Grandi, F. Grey, J. Harvey, A. Heiss, F. Hemmer, S. Jarp, R. Jones, D. Kelsey, J. Knobloch, M. Lamanna, H. Marten, P. Mato Vila, F. Ould-Saada, B. Panzer-Steindel, L. Perini, L. Robertson, Y. Schutz, U. Schwickerath, J. Shiers, T. Wenaus

  • Contributions from

    • J.P. Baud, E. Laure, C. Curran, G. Lee, A. Marchioro, A. Pace, and D. Yocum, A. Aimar, I. Antcheva, J. Apostolakis, G. Cosmo, O. Couet, M. Girone, M. Marino, L. Moneta, W. Pokorski, F. Rademakers, A. Ribon, S. Roiser, and R. Veenhof

  • Quality assurance by

    • The CERN Print Shop, F. Baud-Lavigne, S. Leech O’Neale, R. Mondardini, and C. Vanoli

  • … and the members of the Computing Groups of the LHC experiments who either directly contributed or have provided essential feed-back.


ad