330 likes | 464 Views
The DataGRID A Testbed for Worldwide Distributed Scientific Data Analysis. Nordunet Conference Helsinki Les Robertson CERN - IT Division 29 September 2000 les.robertson@cern.ch. Summary. HEP offline computing – the needs of LHC An ideal testbed for Grids? The DataGRID project
E N D
The DataGRIDA Testbed for Worldwide Distributed Scientific Data Analysis Nordunet Conference Helsinki Les Robertson CERN - IT Division 29 September 2000 les.robertson@cern.ch
Summary • HEP offline computing – the needs of LHC • An ideal testbed for Grids? • The DataGRID project • Concluding remarks
CERN The European Organisation for Nuclear Research 20 European countries 2,700 staff 6,000 users
40 MHz (40 TB/sec) level 1 - special hardware 75 KHz (75 GB/sec) level 2 - embedded processors 5 KHz (5 GB/sec) level 3 - PCs 100 Hz (100 MB/sec) data recording & offline analysis One of the four LHC detectors online system multi-level trigger filter out background reduce data volume
CERN Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) detector processed data event summary data raw data batch physics analysis event reprocessing analysis objects (extracted by physics topic) event simulation interactive physics analysis les.robertson@cern.ch
HEP Computing Characteristics • Large numbers of independent events • trivial parallelism • Large data sets • smallish records • mostly read-only • Modest I/O rates • few MB/sec per fast processor • Modest floating point requirement • SPECint performance • Very largeaggregate requirements – computation, data • Scaling up is not just big – it is also complex • …and once you exceed the capabilities of a single geographical installation ………?
Generic computing farm network servers application servers tape servers les.robertson@cern.ch disk servers Cern/it/pdp-les.robertson 10-98-8
The LHC Detectors CMS ATLAS Raw recording rate 0.1 – 1 GB/sec 3.5 PetaBytes / year ~108 events/year LHCb
Estimated CPU capacity required at CERN Other experiments LHC experiments Moore’s law – some measure of the capacity technology advances provide for a constant number of processors or investment Jan 2000:3.5K SI95 < 50% of the main analysis capacity will be at CERN les.robertson@cern.ch
Components to Fabrics Commodity components are just fine for HEP • Masses of experience with inexpensive farms • Long experience with mass storage • LAN technology is going the right way • Inexpensive high performance PC attachments • Compatible with hefty backbone switches • Good ideas for improving automated operation and management • Just needs some solid computer engineering R&D?
World Wide Collaboration distributed computing & storage capacity CMS: 1800 physicists 150 institutes 32 countries
Two Problems • Funding • will funding bodies place all their investment at CERN? • Geography • does a geographically distributed model better serve the needs of the world-wide distributed community? No Maybe – if it is reliable and easy to use
Solution? - Regional Computing Centres • Exploit established computing expertise & infrastructure in national labs, universities • Reduce dependence on links to CERN full summary data available nearby – through a fat, fast, reliable network link • Tap funding sources not otherwise available to HEP at CERN • Devolve control over resource allocation • national interests? • regional interests? • at the expense of physics interests?
CERN – Tier 0 2.5 Gbps IN2P3 622 Mbps RAL FNAL Tier 1 155 mbps 155 mbps 622 Mbps Uni n Lab a Tier2 Uni b Lab c Department Desktop MONARC report: http://home.cern.ch/~barone/monarc/RCArchitecture.html Regional Centres - a Multi-Tier Model les.robertson@cern.ch
CERN – Tier 0 IN2P3 2.5 Gbps 622 Mbps DHL RAL FNAL Tier 1 155 mbps Uni n 155 mbps Lab a Tier2 622 Mbps Uni b Lab c Department Desktop More realistically - a Grid Topology les.robertson@cern.ch
The Basic Problem - Summary • Scalability cost complexity management • Thousands of processors, thousands of disks, PetaBytes of data, Terabits/second of I/O bandwidth, …. • Wide-area distribution complexity management bandwidth • WANs are only and will only be ~1% of LANs • Distribute, replicate, cache, synchronise the data • Multiple ownership, policies, …. • Integration of this amorphous collection of Regional Centres .. • .. with some attempt at optimisation • Adaptability flexibility simplicity • We shall only know how analysis will be done once the data arrives
Surely this is an ideal testbed for Grid technology?
Are Grids a solution? Computational Grids • Change of orientation of Meta-computing activity • From inter-connected super-computers … .. towards a more general concept of a computational power Grid (The Grid – Ian Foster, Carl Kesselman**) • Has found resonance with the press, funding agencies But what is a Grid? “Dependable, consistent, pervasive access to resources**” So, in some way Grid technology makes it easy to use diverse, geographically distributed, locally managed and controlled computing facilities – as if they formed a coherent local cluster ** Ian Foster and Carl Kesselman, editors, “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, 1999
What does the Grid do for you? • You submit your work • And the Grid • Finds convenient places for it to be run • Organises efficient access to your data • Caching, migration, replication • Deals with authentication to the different sites that you will be using • Interfaces to local site resource allocation mechanisms, policies • Runs your jobs • Monitors progress • Recovers from problems • Tells you when your work is complete • If there is scope for parallelism, it can also decompose your work into convenient execution units based on the available resources, data distribution
Current state (HEP-centric view) Globus project (http://www.globus.org) • Basic middleware • Authentication • Information service • Resource management • Good basis to build on - open, flexible, collaborative community (http://www.gridforum.org) But - • Who is handling lots of data? • Where are the production quality implementations? • Can you do real work with Grids?
The DataGRID Project www.cern.ch/grid
The Data Grid Project • Proposal for EC Fifth Framework funding • Principal goals: • Middleware for fabric & grid management • Large scale testbed • Production quality demonstrations • “mock data”, simulation analysis, current experiments • Three year phased developments & demos • Collaborate with and complement other European and US projects • Open source and communication – • Global GRID Forum • DataGRID Industry and Research Forum
DataGRID Partners Managing partners UK PPARC Italy INFN France CNRS Holland NIKHEF Italy ESA/ESRIN CERN – proj.mgt. - Fabrizio Gagliardi Industry IBM (UK), Compagnie des Signaux (F), Datamat (I) Associate partners • Finland- Helsinki Institute of Physics & CSC, Swedish Natural Science Research Council (Parallelldatorcentrum–KTH, Karolinska Institute),Istituto Trentino di Cultura, Zuse Institut Berlin, University of Heidelberg, CEA/DAPNIA (F), IFAE Barcelona, CNR (I), CESNET (CZ), KNMI (NL), SARA (NL), SZTAKI (HU)
Programme of work Middleware • Grid Workload Management, Data Management, Monitoring services • Management of the Local Computing Fabric • Mass Storage Production quality testbed • Testbed Integration & Network Services Scientific Applications • High Energy Physics • Earth Observation • Biology
Middleware Wide-area - building on an existing framework (Globus) • workload management - Cristina Vistoli/INFN-CNAF • The workload is chaotic – unpredictable job arrival rates, data access patterns • The goal is maximising the global system throughput (events processed per second) • data management - Ben Segal/CERN • Management of petabyte-scale data volumes, in an environment with limited network bandwidth and heavy use of mass storage (tape) • Caching, replication, synchronisation, object database model • application monitoring - Robin Middleton/RAL • Tens of thousands of components, thousands of jobs and individual users • End-user - tracking of the progress of jobs and aggregates of jobs • Understanding application and grid level performance • Administrator – understanding which global-level applications were affected by failures, and whether and how to recover
Middleware Local fabric – • Effective local site management of giant computing fabrics- Tim Smith/CERN • Automated installation, configuration management, system maintenance • Automated monitoring and error recovery - resilience, self-healing • Performance monitoring • Characterisation, mapping, management of local Grid resources • Mass storage management - John Gordon/RAL • multi-PetaByte data storage • “real-time” data recording requirement • active tape layer – 1,000s of users • uniform mass storage interface • exchange of data and meta-data between mass storage systems
Infrastructure Operate a production quality trans European “testbed” interconnecting clusters in several sites – • Initial tesbed participants:CERN, RAL, INFN (several sites), IN2P3-Lyon, ESRIN (ESA-Italy), SARA/NIKHEF (Amsterdam), ZUSE Institut (Berlin), CESNET (Prague), IFAE (Barcelona), LIP (Lisbon), IFCA (Santander) …… • Define, integrate and build successive releases of the project middleware • Define, negotiate and manage the network infrastructure • assume that this is largely Ten-155 and then Géant • Stage demonstrations, data challenges • Monitor, measure, evaluate, report Work package managers - Testbed – François Etienne/IN2P3-Marseille Networking - Christian Michau/CNRS
Applications • HEP - Federico Carmiati/CERN • The four LHC experiments • Live testbed for the Regional Centre model • Earth Observation - Luigi Fusco/ESA-ESRIN • ESA-ESRIN • KNMI (Dutch meteo) climatology • Processing of atmospheric ozone data derived from ERS GOME and ENVISAT SCIAMACHY sensors • Biology - Christian Michau/CNRS • CNRS (France), Karolinska (Sweden) • Application being defined
DataGRID non-technical challenges • Large, diverse, dispersed project • but one of the motivations was coordinating this type of activity in Europe • Collaboration, convergence with US and other Grid activities - this area is very dynamic • Organising access to adequate Network bandwidth – a vital ingredient for success of a Grid • Keeping the feet on the ground – The GRID is a good idea but it will be hard to meet the expectations of some recent press articles
Concluding remarks (i) • The scale of the computing needs of the LHC experiments is large, by the standards of most scientific applications – • the hardware technology will be there to evolve the current architecture of “commodity clusters” into large scale computing fabrics • there are many performance, organisational and management problems to be solved • disappointingly, solutions on this scale are not (yet?) emerging from industry • The scale and cost of LHC computing imposes a geographically distributed model • the Grid technologies look very promising – to deliver a major step forward in wide area computing usability, effectiveness • LHC computing offers an ideal testbed for Grids – • “Simple” basic model – but - large scale, chaotic workload • Data intensive - real data, real users, production environment • Trans-border data worldwide collaborations
Concluding remarks (ii) • The vision is easy access to shared wide-area distributed computing facilities, without the user having to know the details • The DataGRID project aims to provide practical experience and tools that can be adapted to the needs of a wide range of scientific and engineering applications