Data Grids for Next Generation Experiments Harvey B Newman California Institute of Technology

Data Grids for Next Generation Experiments • Harvey B NewmanCalifornia Institute of Technology • ACAT2000Fermilab, October 19, 2000 • http://l3www.cern.ch/~newman/grids_acat2k.ppt

Physics and Technical Goals • The extraction of small or subtle new “discovery” signals from large and potentially overwhelming backgrounds; or “precision” analysis of large samples • Providing rapid access to event samples and subsets from massive data stores, from ~300 Terabytes in 2001 Petabytes by ~2003, ~10 Petabytes by 2006, to ~100 Petabytes by ~2010. • Providing analyzed results with rapid turnaround, bycoordinating and managing the LIMITED computing, data handling and network resources effectively • Enabling rapid access to the data and the collaboration, across an ensemble of networks of varying capability, using heterogeneous resources.

Four LHC Experiments: The Petabyte to Exabyte Challenge • ATLAS, CMS, ALICE, LHCBHiggs + New particles; Quark-Gluon Plasma; CP Violation • Data written to tape ~25 Petabytes/Year and UP (CPU: 6 MSi95 and UP) • 0.1 to 1 Exabyte (1 EB = 1018 Bytes) (~2010) (~2020 ?) Total for the LHC Experiments

Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS LHC Vision: Data Grid Hierarchy 1 Bunch crossing; ~17 interactions per 25 nsecs; 100 triggers per second. Event is ~1 MByte in size ~PBytes/sec ~100 MBytes/sec Online System Experiment Offline Farm,CERN Computer Ctr > 30 TIPS Tier 0 +1 ~0.6-2.5Gbits/sec HPSS Tier 1 FNAL Center Italy Center UK Center FranceCenter ~2.5 Gbits/sec Tier 2 ~622 Mbits/sec Tier 3 Institute ~0.25TIPS Institute Institute Institute Physicists work on analysis “channels” Each institute has ~10 physicists working on one or more channels 100 - 1000 Mbits/sec Physics data cache Tier 4 Workstations

Why Worldwide Computing?Regional Center Concept: Advantages • Managed, fair-shared access for Physicists everywhere • Maximize total funding resources while meeting the total computing and data handling needs • Balance between proximity of datasets to appropriate resources, and to the users • Tier-N Model • Efficient use of network: higher throughput • Per Flow: Local > regional > national > international • Utilizing all intellectual resources, in several time zones • CERN, national labs, universities, remote sites • Involving physicists and students at their home institutions • Greater flexibility to pursue different physics interests, priorities, and resource allocation strategies by region • And/or by Common Interests (physics topics, subdetectors,…) • Manage the System’s Complexity • Partitioning facility tasks, to manage and focus resources

SDSS Data Grid (In GriPhyN): A Shared Vision • Three main functions: • Raw data processing on a Grid (FNAL) • Rapid turnaround with TBs of data • Accessible storage of all image data • Fast science analysis environment (JHU) • Combined data access + analysis of calibrated data • Distributed I/O layer and processing layer; shared by whole collaboration • Public data access • SDSS data browsing for astronomers, and students • Complex query engine for the public

US-CERN BW RequirementsProjection (PRELIMINARY) [#] Includes ~1.5 Gbps Each for ATLAS and CMS, Plus Babar, Run2 and Other [*] D0 and CDF at Run2: Needs Presumed to Be to be Comparable to BaBar

Daily, Weekly, Monthly and Yearly Statistics on the 45 Mbps US-CERN Link

Roles of Projectsfor HENP Distributed Analysis • RD45, GIOD Networked Object Databases • Clipper/GC High speed access to Objects or File data FNAL/SAM for processing and analysis • SLAC/OOFS Distributed File System + Objectivity Interface • NILE, Condor: Fault Tolerant Distributed Computing • MONARC LHC Computing Models: Architecture, Simulation, Strategy, Politics • ALDAP OO Database Structures & Access Methods for Astrophysics and HENP Data • PPDG First Distributed Data Services and Data Grid System Prototype • GriPhyN Production-Scale Data Grids • EU Data Grid

Grid Services Architecture [*] A Rich Set of HEP Data-Analysis Related Applications Applns Remote data toolkit Remote comp. toolkit Remote viz toolkit Remote collab. toolkit Remote sensors toolkit Appln Toolkits ... Grid Services Protocols, authentication, policy, resource management, instrumentation, discovery,etc. Data stores, networks, computers, display devices,… ; associated local services Grid Fabric [*] Adapted from Ian Foster: there are computing grids, access (collaborative) grids, data grids, ...

University CPU, Disk, Users University CPU, Disk, Users Satellite Site Tape, CPU, Disk, Robot University CPU, Disk, Users PRIMARY SITE DAQ, Tape, CPU, Disk, Robot University CPU, Disk, Users University CPU, Disk, Users Satellite Site Tape, CPU, Disk, Robot The Particle Physics Data Grid (PPDG) ANL, BNL, Caltech, FNAL, JLAB, LBNL, SDSC, SLAC, U.Wisc/CS Site to Site Data Replication Service 100 Mbytes/sec PRIMARY SITE Data Acquisition, CPU, Disk, Tape Robot SECONDARY SITE CPU, Disk, Tape Robot • First Round Goal: Optimized cached read access to 10-100 Gbytes drawn from a total data set of 0.1 to ~1 Petabyte Multi-Site Cached File Access Service • Matchmaking, Co-Scheduling: SRB, Condor, Globus services; HRM, NWS

Request Planner(Matchmaking) Request Interpreter Request Executor PPDG WG1: Request Manager REQUEST MANAGER CLIENT CLIENT Logical Request Event-file Index Replica catalog Logical Set of Files Disk Cache DRM Network Weather Service Physical file transfer requests GRID DRM HRM Disk Cache Disk Cache tape system

ANL GSI-wuftpd ISI GSI-wuftpd Disk Disk Earth Grid System Prototype Inter-communication Diagram LLNL ANL Client Replica Catalog LDAP Script Disk Request Manager LDAP C API or Script GIS with NWS GSI-ncftp GSI-ncftp GSI-ncftp GSI-ncftp GSI-ncftp GSI-ncftp CORBA LBNL GSI-wuftpd LBNL NCAR GSI-wuftpd SDSC GSI-pftpd HPSS HPSS Disk on Clipper HRM Disk Disk

GDMP V1.1: Caltech + EU DataGrid WP2 Tests by CALTECH, CERN, FNAL, Pisa for CMS “HLT” Production 10/2000; Integration with ENSTORE, HPSS, Castor Grid Data Management Prototype (GDMP) • Distributed Job Execution and Data Handling:Goals • Transparency • Performance • Security • Fault Tolerance • Automation Site A Site B Submit job Replicate data Job writes data locally Replicate data • Jobs are executed locally or remotely • Data is always written locally • Data is replicated to remote sites Site C

EU-Grid ProjectWork Packages        

GriPhyN: PetaScale Virtual Data Grids • Build the Foundation for Petascale Virtual Data Grids Production Team Individual Investigator Workgroups Interactive User Tools Request Planning & Request Execution & Virtual Data Tools Management Tools Scheduling Tools Resource Other Grid • Resource • Security and • Other Grid Security and Management • Management • Policy • Services Policy Services Services • Services • Services Services Transforms Distributed resources Raw data (code, storage, computers, and network) source

Data Grids: Better Global Resource Use and Faster Turnaround • Build Information and Security Infrastructures • Across Several World Regions • Authentication: Prioritization, Resource Allocation • Coordinated use of computing, data handling and network resources through: • Data caching, query estimation, co-scheduling • Network and site “instrumentation”: performance tracking, monitoring, problem trapping and handling • Robust Transactions • Agent Based: Autonomous, Adaptive, Network Efficient, Resilient • Heuristic, Adaptive Load-Balancing E.g. Self-Organzing Neural Nets (Legrand)

GRIDs In 2000: Summary • Grids will change the way we do science and engineering: computation to large scale data • Key services and concepts have been identified, and development has started • Major IT challenges remain • AnOpportunity & Obligation for HEP/CSCollaboration • Transition of services and applications to production use is starting to occur • In future more sophisticated integrated services and toolsets (Inter- and IntraGrids+) could drive advances in many fields of science & engineering • HENP, facing the need for Petascale Virtual Data, is both an early adopter, and a leading developer of Data Grid technology

Data Grids for Next Generation Experiments Harvey B Newman California Institute of Technology