The DataGRID A Testbed for Worldwide Distributed Scientific Data Analysis

The DataGRIDA Testbed for Worldwide Distributed Scientific Data Analysis Nordunet Conference Helsinki Les Robertson CERN - IT Division 29 September 2000 les.robertson@cern.ch

Summary • HEP offline computing – the needs of LHC • An ideal testbed for Grids? • The DataGRID project • Concluding remarks

CERN The European Organisation for Nuclear Research 20 European countries 2,700 staff 6,000 users

The LEP tunnel

40 MHz (40 TB/sec) level 1 - special hardware 75 KHz (75 GB/sec) level 2 - embedded processors 5 KHz (5 GB/sec) level 3 - PCs 100 Hz (100 MB/sec) data recording & offline analysis One of the four LHC detectors online system multi-level trigger filter out background reduce data volume

CERN Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) detector processed data event summary data raw data batch physics analysis event reprocessing analysis objects (extracted by physics topic) event simulation interactive physics analysis les.robertson@cern.ch

HEP Computing Characteristics • Large numbers of independent events • trivial parallelism • Large data sets • smallish records • mostly read-only • Modest I/O rates • few MB/sec per fast processor • Modest floating point requirement • SPECint performance • Very largeaggregate requirements – computation, data • Scaling up is not just big – it is also complex • …and once you exceed the capabilities of a single geographical installation ………?

Generic computing farm network servers application servers tape servers les.robertson@cern.ch disk servers Cern/it/pdp-les.robertson 10-98-8

LHC Computing Requirements

The LHC Detectors CMS ATLAS Raw recording rate 0.1 – 1 GB/sec 3.5 PetaBytes / year ~108 events/year LHCb

Estimated CPU capacity required at CERN Other experiments LHC experiments Moore’s law – some measure of the capacity technology advances provide for a constant number of processors or investment Jan 2000:3.5K SI95 < 50% of the main analysis capacity will be at CERN les.robertson@cern.ch

Components to Fabrics Commodity components are just fine for HEP • Masses of experience with inexpensive farms • Long experience with mass storage • LAN technology is going the right way • Inexpensive high performance PC attachments • Compatible with hefty backbone switches • Good ideas for improving automated operation and management • Just needs some solid computer engineering R&D?

World Wide Collaboration  distributed computing & storage capacity CMS: 1800 physicists 150 institutes 32 countries

Two Problems • Funding • will funding bodies place all their investment at CERN? • Geography • does a geographically distributed model better serve the needs of the world-wide distributed community? No Maybe – if it is reliable and easy to use

Solution? - Regional Computing Centres • Exploit established computing expertise & infrastructure in national labs, universities • Reduce dependence on links to CERN full summary data available nearby – through a fat, fast, reliable network link • Tap funding sources not otherwise available to HEP at CERN • Devolve control over resource allocation • national interests? • regional interests? • at the expense of physics interests?

CERN – Tier 0 2.5 Gbps IN2P3 622 Mbps RAL FNAL Tier 1 155 mbps 155 mbps 622 Mbps Uni n Lab a Tier2 Uni b Lab c   Department  Desktop MONARC report: http://home.cern.ch/~barone/monarc/RCArchitecture.html Regional Centres - a Multi-Tier Model les.robertson@cern.ch

CERN – Tier 0 IN2P3 2.5 Gbps 622 Mbps DHL RAL FNAL Tier 1 155 mbps Uni n 155 mbps Lab a Tier2 622 Mbps Uni b Lab c   Department  Desktop More realistically - a Grid Topology les.robertson@cern.ch

The Basic Problem - Summary • Scalability cost  complexity  management • Thousands of processors, thousands of disks, PetaBytes of data, Terabits/second of I/O bandwidth, …. • Wide-area distribution complexity  management  bandwidth • WANs are only and will only be ~1% of LANs • Distribute, replicate, cache, synchronise the data • Multiple ownership, policies, …. • Integration of this amorphous collection of Regional Centres .. • .. with some attempt at optimisation • Adaptability  flexibility  simplicity • We shall only know how analysis will be done once the data arrives

Surely this is an ideal testbed for Grid technology?

Are Grids a solution? Computational Grids • Change of orientation of Meta-computing activity • From inter-connected super-computers … .. towards a more general concept of a computational power Grid (The Grid – Ian Foster, Carl Kesselman**) • Has found resonance with the press, funding agencies But what is a Grid? “Dependable, consistent, pervasive access to resources**” So, in some way Grid technology makes it easy to use diverse, geographically distributed, locally managed and controlled computing facilities – as if they formed a coherent local cluster ** Ian Foster and Carl Kesselman, editors, “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, 1999

What does the Grid do for you? • You submit your work • And the Grid • Finds convenient places for it to be run • Organises efficient access to your data • Caching, migration, replication • Deals with authentication to the different sites that you will be using • Interfaces to local site resource allocation mechanisms, policies • Runs your jobs • Monitors progress • Recovers from problems • Tells you when your work is complete • If there is scope for parallelism, it can also decompose your work into convenient execution units based on the available resources, data distribution

Current state (HEP-centric view) Globus project (http://www.globus.org) • Basic middleware • Authentication • Information service • Resource management • Good basis to build on - open, flexible, collaborative community (http://www.gridforum.org) But - • Who is handling lots of data? • Where are the production quality implementations? • Can you do real work with Grids?

The DataGRID Project www.cern.ch/grid

The Data Grid Project • Proposal for EC Fifth Framework funding • Principal goals: • Middleware for fabric & grid management • Large scale testbed • Production quality demonstrations • “mock data”, simulation analysis, current experiments • Three year phased developments & demos • Collaborate with and complement other European and US projects • Open source and communication – • Global GRID Forum • DataGRID Industry and Research Forum

DataGRID Partners Managing partners UK PPARC Italy INFN France CNRS Holland NIKHEF Italy ESA/ESRIN CERN – proj.mgt. - Fabrizio Gagliardi Industry IBM (UK), Compagnie des Signaux (F), Datamat (I) Associate partners • Finland- Helsinki Institute of Physics & CSC, Swedish Natural Science Research Council (Parallelldatorcentrum–KTH, Karolinska Institute),Istituto Trentino di Cultura, Zuse Institut Berlin, University of Heidelberg, CEA/DAPNIA (F), IFAE Barcelona, CNR (I), CESNET (CZ), KNMI (NL), SARA (NL), SZTAKI (HU)

Programme of work Middleware • Grid Workload Management, Data Management, Monitoring services • Management of the Local Computing Fabric • Mass Storage Production quality testbed • Testbed Integration & Network Services Scientific Applications • High Energy Physics • Earth Observation • Biology

Middleware Wide-area - building on an existing framework (Globus) • workload management - Cristina Vistoli/INFN-CNAF • The workload is chaotic – unpredictable job arrival rates, data access patterns • The goal is maximising the global system throughput (events processed per second) • data management - Ben Segal/CERN • Management of petabyte-scale data volumes, in an environment with limited network bandwidth and heavy use of mass storage (tape) • Caching, replication, synchronisation, object database model • application monitoring - Robin Middleton/RAL • Tens of thousands of components, thousands of jobs and individual users • End-user - tracking of the progress of jobs and aggregates of jobs • Understanding application and grid level performance • Administrator – understanding which global-level applications were affected by failures, and whether and how to recover

Middleware Local fabric – • Effective local site management of giant computing fabrics- Tim Smith/CERN • Automated installation, configuration management, system maintenance • Automated monitoring and error recovery - resilience, self-healing • Performance monitoring • Characterisation, mapping, management of local Grid resources • Mass storage management - John Gordon/RAL • multi-PetaByte data storage • “real-time” data recording requirement • active tape layer – 1,000s of users • uniform mass storage interface • exchange of data and meta-data between mass storage systems

Infrastructure Operate a production quality trans European “testbed” interconnecting clusters in several sites – • Initial tesbed participants:CERN, RAL, INFN (several sites), IN2P3-Lyon, ESRIN (ESA-Italy), SARA/NIKHEF (Amsterdam), ZUSE Institut (Berlin), CESNET (Prague), IFAE (Barcelona), LIP (Lisbon), IFCA (Santander) …… • Define, integrate and build successive releases of the project middleware • Define, negotiate and manage the network infrastructure • assume that this is largely Ten-155 and then Géant • Stage demonstrations, data challenges • Monitor, measure, evaluate, report Work package managers - Testbed – François Etienne/IN2P3-Marseille Networking - Christian Michau/CNRS

Applications • HEP - Federico Carmiati/CERN • The four LHC experiments • Live testbed for the Regional Centre model • Earth Observation - Luigi Fusco/ESA-ESRIN • ESA-ESRIN • KNMI (Dutch meteo) climatology • Processing of atmospheric ozone data derived from ERS GOME and ENVISAT SCIAMACHY sensors • Biology - Christian Michau/CNRS • CNRS (France), Karolinska (Sweden) • Application being defined

DataGRID non-technical challenges • Large, diverse, dispersed project • but one of the motivations was coordinating this type of activity in Europe • Collaboration, convergence with US and other Grid activities - this area is very dynamic • Organising access to adequate Network bandwidth – a vital ingredient for success of a Grid • Keeping the feet on the ground – The GRID is a good idea but it will be hard to meet the expectations of some recent press articles

Concluding remarks (i) • The scale of the computing needs of the LHC experiments is large, by the standards of most scientific applications – • the hardware technology will be there to evolve the current architecture of “commodity clusters” into large scale computing fabrics • there are many performance, organisational and management problems to be solved • disappointingly, solutions on this scale are not (yet?) emerging from industry • The scale and cost of LHC computing imposes a geographically distributed model • the Grid technologies look very promising – to deliver a major step forward in wide area computing usability, effectiveness • LHC computing offers an ideal testbed for Grids – • “Simple” basic model – but - large scale, chaotic workload • Data intensive - real data, real users, production environment • Trans-border data  worldwide collaborations

Concluding remarks (ii) • The vision is easy access to shared wide-area distributed computing facilities, without the user having to know the details • The DataGRID project aims to provide practical experience and tools that can be adapted to the needs of a wide range of scientific and engineering applications

The DataGRID A Testbed for Worldwide Distributed Scientific Data Analysis

The DataGRID A Testbed for Worldwide Distributed Scientific Data Analysis

Presentation Transcript

A Metadata Binding Store for Distributed Scientific Data

Building an Information System for a Distributed Testbed

The ESA SCIENTIFIC TESTBED in CASPAR

NERC DataGrid: Googling for Secure Data

Distributed Data, Distributed Governance, Distributed Vocabularies: The NERC DataGrid

The ESA Scientific TESTBED Scenarios

Using HDF5 for Scientific Data Analysis

A demonstration of the use of Datagrid testbed and services for the biomedical community

EU DataGrid Project TestBed Status and Plans

EU DataGrid segment in Russia. Testbed WP6.

The European DataGRID Production Testbed

EU DataGrid Testbed

JAS – Distributed Data Analysis

Data Placement for Scientific Applications in Distributed Environments

Using HDF5 for Scientific Data Analysis

EU DataGrid segment in Russia. Testbed WP6.

DOT: Distributed Optical Testbed

The EU DataGrid Testbed

EU DataGrid Project TestBed Status and Plans

DATAGRID Testbed Work Package (report)

EU DataGrid Testbed