Paul Avery University of Florida avery@phys.ufl

Grid Computing in High Energy Physics Enabling Data Intensive Global Science Paul Avery University of Florida avery@phys.ufl.edu Lishep2004 Workshop UERJ, Rio de JaneiroFebruary 17, 2004 Paul Avery

The Grid Concept • Grid: Geographically distributed computing resources configured for coordinated use • Fabric: Physical resources & networks provide raw capability • Ownership: Resources controlled by owners and shared w/ others • Middleware: Software ties it all together: tools, services, etc. • Goal: Transparent resource sharing US-CMS “Virtual Organization” Paul Avery

LHC: Key Driver for Data Grids • Complexity: Millions of individual detector channels • Scale: PetaOps (CPU), 100s of Petabytes (Data) • Distribution: Global distribution of people & resources 2000+ Physicists 159 Institutes 36 Countries CMS Collaboration Paul Avery

LHC Data and CPU Requirements CMS ATLAS Storage • Raw recording rate 0.1 – 1 GB/s • Monte Carlo data • 100 PB total by ~2010 Processing • PetaOps (> 300,000 3 GHz PCs) LHCb Paul Avery

Complexity: Higgs Decay into 4 muons 109 collisions/sec, selectivity: 1 in 1013 Paul Avery

LHC Global Collaborations CMS ATLAS Several 1000 per expt. Paul Avery

Driver for Transatlantic Networks (Gb/s) 2001 estimates, now seen as conservative! Paul Avery

HEP Bandwidth Roadmap (Gb/s) HEP: Leading role in network development/deployment Paul Avery

Korea Russia UK USA Tier2 Center Tier2 Center Tier2 Center Tier2 Center Institute Institute Institute Institute Global LHC Data Grid Hierarchy CMS Experiment Online System 0.1 - 1.5 GBytes/s CERN Computer Center Tier 0 10-40 Gb/s Tier 1 2.5-10 Gb/s Tier 2 1-2.5 Gb/s Tier 3 Physics caches 1-10 Gb/s ~10s of Petabytes/yr by 2007-8~1000 Petabytes in < 10 yrs? Tier 4 PCs Paul Avery

Most IT Resources Outside CERN 10% 17% 38% Est. 2008 Resources Tier0 + Tier1 + Tier2 Paul Avery

Analysis by Globally Distributed Teams • Non-hierarchical: Chaotic analyses + productions • Superimpose significant random data flows Paul Avery

Data Grid Projects Paul Avery

U.S. Projects GriPhyN (NSF) iVDGL (NSF) Particle Physics Data Grid (DOE) PACIs and TeraGrid (NSF) DOE Science Grid (DOE) NEESgrid (NSF) NSF Middleware Initiative (NSF) EU, Asia projects European Data Grid (EU) EDG-related national Projects DataTAG (EU) LHC Computing Grid (CERN) EGEE (EU) CrossGrid (EU) GridLab (EU) Japanese, Korea Projects Global Context of Data Grid Projects HEP-related Grid infrastructure projects • Two primary project clusters: US + EU • Not exclusively HEP, but driven/led by HEP (with CS) • Many 10s  $M brought into the field Paul Avery

International Grid Coordination • Global Grid Forum (GGF) • Grid standards body • New PNPA research group • HICB: HEP Inter-Grid Coordination Board • Non-competitive forum, strategic issues, consensus • Cross-project policies, procedures and technology, joint projects • HICB-JTB Joint Technical Board • Technical coordination, e.g. GLUE interoperability effort • LHC Computing Grid (LCG) • Technical work, deployments • POB, SC2, PEB, GDB, GAG, etc. Paul Avery

LCG: LHC Computing Grid Project • A persistent Grid infrastructure for LHC experiments • Matched to decades long research program of LHC • Prepare & deploy computing environment for LHC expts • Common applications, tools, frameworks and environments • Deployment oriented: no middleware development(?) • Move from testbed systems to real production services • Operated and supported 24x7 globally • Computing fabrics run as production physics services • A robust, stable, predictable, supportable infrastructure Paul Avery

LCG Activities • Relies on strong leverage • CERN IT • LHC experiments: ALICE, ATLAS, CMS, LHCb • Grid projects: Trillium, DataGrid, EGEE • Close relationship with EGEE project (see Gagliardi talk) • EGEE will create EU production Grid (funds 70 institutions) • Middleware activities done in common (same manager) • LCG deployment in common (LCG Grid is EGEE prototype) • LCG-1 deployed (Sep. 2003) • Tier-1 laboratories (including FNAL and BNL) • LCG-2 in process of being deployed Paul Avery

Sep. 29, 2003 announcement Paul Avery

LCG-1 Sites Paul Avery

Grid Analysis Environment • GAE and analysis teams • Extending locally, nationally, globally • Sharing worldwide computing, storage & network resources • Utilizing intellectual contributions regardless of location • HEPCAL and HEPCAL II • Delineate common LHC use cases & requirements (HEPCAL) • Same for analysis (HEPCAL II) • ARDA: Architectural Roadmap for Distributed Analysis • ARDA document released LCG-2003-033 (Oct. 2003) • Workshop January 2004, presentations by the 4 experiments CMS: CAIGEE ALICE: AliEn ATLAS: ADA LHCb: DIRAC Paul Avery

UserApplication StorageElement JobMonitor ComputingElement One View of ARDA Core Services InformationService JobProvenance Authentication Authorization Auditing Grid AccessService Accounting MetadataCatalog GridMonitoring FileCatalog WorkloadManagement SiteGatekeeper DataManagement PackageManager LCG-2003-033 (Oct. 2003) Paul Avery

CMS Example: CAIGEE HTTP, SOAP, XML/RPC • Clients talk standard protocols to “Grid Services Web Server”, a.k.a. Clarens data/services portal with simple WS API • Clarens portal hides complexity of Grid Services from client • Key features: Global scheduler, catalogs, monitoring, and Grid-wide execution service • Clarens servers formglobal peerpeer network Analysis Client Analysis Client Grid Services Web Server Scheduler Catalogs Fully- Abstract Planner Metadata Partially- Abstract Planner Virtual Data Applications Data Management Monitoring Replica Fully- Concrete Planner Grid Execution Priority Manager Grid Wide Execution Service Paul Avery

LCG Timeline • Phase 1: 2002 – 2005 • Build a service prototype, based on existing grid middleware • Gain experience in running a production grid service • Produce the Computing TDR for the final system • Phase 2: 2006 – 2008 • Build and commission the initial LHC computing environment • Phase 3: ? Paul Avery

“Trillium”: U.S. Physics Grid Projects • Trillium = PPDG + GriPhyN + iVDGL • Large overlap in leadership, people, experiments • HEP members are main drivers, esp. LHC experiments • Benefit of coordination • Common software base + packaging: VDT + PACMAN • Collaborative / joint projects: monitoring, demos, security, … • Wide deployment of new technologies, e.g. Virtual Data • Forum for US Grid projects • Joint strategies, meetings and work • Unified U.S. entity to interact with international Grid projects • Build significant Grid infrastructure: Grid2003 Paul Avery

2009 2007 2005 Community growth Data growth 2003 2001 Science Drivers for U.S. HEP Grids • LHC experiments • 100s of Petabytes • Current HENP experiments • ~1 Petabyte (1000 TB) • LIGO • 100s of Terabytes • Sloan Digital Sky Survey • 10s of Terabytes Future Grid resources • Massive CPU (PetaOps) • Large distributed datasets (>100PB) • Global communities (1000s) Paul Avery

Will use NMI processes soon Virtual Data Toolkit NMI VDT Test Sources (CVS) Build Binaries Build & Test Condor pool (37 computers) Pacman cache Package Patching … RPMs Build Binaries GPT src bundles Build Binaries Test Contributors (VDS, etc.) A unique resource for managing, testing, supporting, deploying, packaging, upgrading, & troubleshooting complex sets of software Paul Avery

Globus Alliance Grid Security Infrastructure (GSI) Job submission (GRAM) Information service (MDS) Data transfer (GridFTP) Replica Location (RLS) Condor Group Condor/Condor-G DAGMan Fault Tolerant Shell ClassAds EDG & LCG Make Gridmap Cert. Revocation List Updater Glue Schema/Info provider ISI & UC Chimera & related tools Pegasus NCSA MyProxy GSI OpenSSH LBL PyGlobus Netlogger Caltech MonaLisa VDT VDT System Profiler Configuration software Others KX509 (U. Mich.) Virtual Data Toolkit: Tools in VDT 1.1.12 Paul Avery

VDT Growth (1.1.13 Currently) VDT 1.1.8 First real use by LCG VDT 1.1.11 Grid2003 VDT 1.0 Globus 2.0b Condor 6.3.1 VDT 1.1.7 Switch to Globus 2.2 VDT 1.1.3, 1.1.4 & 1.1.5 pre-SC 2002 Paul Avery

Virtual Data: Derivation and Provenance • Most scientific data are not simple “measurements” • They are computationally corrected/reconstructed • They can be produced by numerical simulation • Science & eng. projects are more CPU and data intensive • Programs are significant community resources (transformations) • So are the executions of those programs (derivations) • Management of dataset dependencies critical! • Derivation: Instantiation of a potential data product • Provenance: Complete history of any existing data product • Previously: Manual methods • GriPhyN: Automated, robust tools (Chimera Virtual Data System) Paul Avery

Other cuts Other cuts Other cuts Other cuts LHC Higgs Analysis with Virtual Data Scientist adds a new derived data branch & continues analysis decay = bb decay = WW WW  leptons decay = ZZ mass = 160 decay = WW WW  e Pt > 20 decay = WW WW  e decay = WW Paul Avery

Sloan Data Galaxy cluster size distribution Chimera: Sloan Galaxy Cluster Analysis Paul Avery

Outreach: QuarkNet-Trillium Virtual Data Portal • More than a web site • Organize datasets • Perform simple computations • Create new computations & analyses • View & share results • Annotate & enquire (metadata) • Communicate and collaborate • Easy to use, ubiquitous, • No tools to install • Open to the community • Grow & extend Initial prototype implemented by graduate student Yong Zhao and M. Wilde (U. of Chicago) Paul Avery

CHEPREO: Center for High Energy Physics Research and Educational OutreachFlorida International University • Physics Learning Center • CMS Research • iVDGL Grid Activities • AMPATH network (S. America) Funded September 2003

Grid2003: An Operational Grid • 28 sites (2100-2800 CPUs) • 400-1300 concurrent jobs • 10 applications • Running since October 2003 Korea http://www.ivdgl.org/grid2003 Paul Avery

Grid2003 Grid2003 Participants Korea US-CMS iVDGL GriPhyN US-ATLAS DOE Labs PPDG Biology Paul Avery

Grid2003 Applications • High energy physics • US-ATLAS analysis (DIAL), • US-ATLAS GEANT3 simulation (GCE) • US-CMS GEANT4 simulation (MOP) • BTeV simulation • Gravity waves • LIGO: blind search for continuous sources • Digital astronomy • SDSS: cluster finding (maxBcg) • Bioinformatics • Bio-molecular analysis (SnB) • Genome analysis (GADU/Gnare) • CS Demonstrators • Job Exerciser, GridFTP, NetLogger-grid2003 Paul Avery

Grid2003 Site Monitoring Paul Avery

Grid2003: Three Months Usage Paul Avery

Grid2003 Success • Much larger than originally planned • More sites (28), CPUs (2800), simultaneous jobs (1300) • More applications (10) in more diverse areas • Able to accommodate new institutions & applications • U Buffalo (Biology) Nov. 2003 • Rice U. (CMS) Feb. 2004 • Continuous operation since October 2003 • Strong operations team (iGOC at Indiana) • US-CMS using it for production simulations (next slide) Paul Avery

Production Simulations on Grid2003 US-CMS Monte Carlo Simulation Used = 1.5  US-CMS resources Non-USCMS USCMS Paul Avery

Grid2003: A Necessary Step • Learning how to operate a Grid • Add sites, recover from errors, provide info, update, test, etc. • Need tools, services, procedures, documentation, organization • Need reliable, intelligent, skilled people • Learning how to cope with “large” scale • “Interesting” failure modes as scale increases • Increasing scale must not overwhelm human resources • Learning how to delegate responsibilities • Multiple levels: Project, Virtual Org., service, site, application • Essential for future growth • Grid2003 experience critical for building “useful” Grids • Frank discussion in “Grid2003 Project Lessons” doc Paul Avery

Grid2003 Lessons (1): Investment • Building momentum • PPDG 1999 • GriPhyN 2000 • iVDGL 2001 • Time for projects to ramp up • Building collaborations • HEP: ATLAS, CMS, Run 2, RHIC, Jlab • Non-HEP: Computer science, LIGO, SDSS • Time for collaboration sociology to kick in • Building testbeds • Build expertise, debug Grid software, develop Grid tools & services • US-CMS 2002 – 2004+ • US-ATLAS 2002 – 2004+ • WorldGrid 2002 (Dec.) Paul Avery

Grid2003 Lessons (2): Deployment • Building something useful draws people in • (Similar to a large HEP detector) • Cooperation, willingness to invest time, striving for excellence! • Grid development requires significant deployments • Required to learn what works, what fails, what’s clumsy, … • Painful, but pays for itself • Deployment provides powerful training mechanism Paul Avery

Grid2003 Lessons (3): Packaging • Installation and configuration (VDT + Pacman) • Simplifies installation, configuration of Grid tools + applications • Major advances over 13 VDT releases • A strategic issue and critical to us • Provides uniformity + automation • Lowers barriers to participation  scaling • Expect great improvements (Pacman 3) • Automation: the next frontier • Reduce FTE overhead, communication traffic • Automate installation, configuration, testing, validation, updates • Remote installation, etc. Paul Avery

Grid2003 and Beyond • Further evolution of Grid3: (Grid3+, etc.) • Contribute resources to persistent Grid • Maintain development Grid, test new software releases • Integrate software into the persistent Grid • Participate in LHC data challenges • Involvement of new sites • New institutions and experiments • New international partners (Brazil, Taiwan, Pakistan?, …) • Improvements in Grid middleware and services • Integrating multiple VOs • Monitoring • Troubleshooting • Accounting • … Paul Avery

U.S. Open Science Grid • Goal: Build an integrated Grid infrastructure • Support US-LHC research program, other scientific efforts • Resources from laboratories and universities • Federate with LHC Computing Grid • Getting there: OSG-1 (Grid3+), OSG-2, … • Series of releases  increasing functionality & scale • Constant use of facilities for LHC production computing • Jan. 12 meeting in Chicago • Public discussion, planning sessions • Next steps • Creating interim Steering Committee (now) • White paper to be expanded into roadmap • Presentation to funding agencies (April/May?) Paul Avery

Inputs to Open Science Grid Technologists Trillium US-LHC Universityfacilities OpenScienceGrid Educationcommunity Multi-disciplinaryfacilities Laboratorycenters ComputerScience Otherscienceapplications Paul Avery

Grids: Enhancing Research & Learning • Fundamentally alters conduct of scientific research • “Lab-centric”: Activities center around large facility • “Team-centric”: Resources shared by distributed teams • “Knowledge-centric”: Knowledge generated/used by a community • Strengthens role of universities in research • Couples universities to data intensive science • Couples universities to national & international labs • Brings front-line research and resources to students • Exploits intellectual resources of formerly isolated schools • Opens new opportunities for minority and women researchers • Builds partnerships to drive advances in IT/science/eng • HEP  Physics, astronomy, biology, CS, etc. • “Application” sciences  Computer Science • Universities  Laboratories • Scientists  Students • Research Community  IT industry Paul Avery

Grids and HEP’s Broad Impact • Grids enable 21st century collaborative science • Linking research communities and resources for scientific discovery • Needed by LHC global collaborations pursuing “petascale” science • HEP Grid projects driving important network developments • Recent US – Europe “land speed records” • ICFA-SCIC, I-HEPCCC, US-CERN link, ESNET, Internet2 • HEP Grids impact education and outreach • Providing technologies, resources for training, education, outreach • HEP has recognized leadership in Grid development • Many national and international initiatives • Partnerships with other disciplines • Influencing funding agencies (NSF, DOE, EU, S.A., Asia) • HEP Grids will provide scalable computing infrastructures for scientific research Paul Avery

Grid References • Grid2003 • www.ivdgl.org/grid2003 • Globus • www.globus.org • PPDG • www.ppdg.net • GriPhyN • www.griphyn.org • iVDGL • www.ivdgl.org • LCG • www.cern.ch/lcg • EU DataGrid • www.eu-datagrid.org • EGEE • egee-ei.web.cern.ch 2nd Edition www.mkp.com/grid2 Paul Avery

Extra Slides Paul Avery

Paul Avery University of Florida avery@phys.ufl