Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Grids for 21st CenturyData Intensive Science Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu University of MichiganMay 8, 2003 Paul Avery

Grids and Science Paul Avery

The Grid Concept • Grid: Geographically distributed computing resources configured for coordinated use • Fabric: Physical resources & networks provide raw capability • Middleware: Software ties it all together (tools, services, etc.) • Goal: Transparent resource sharing Paul Avery

Fundamental Idea: Resource Sharing • Resources for complex problems are distributed • Advanced scientific instruments (accelerators, telescopes, …) • Storage, computing, people, institutions • Communities require access to common services • Research collaborations (physics, astronomy, engineering, …) • Government agencies, health care organizations, corporations, … • “Virtual Organizations” • Create a “VO” from geographically separated components • Make all community resources available to any VO member • Leverage strengths at different institutions • Grids require a foundation of strong networking • Communication tools, visualization • High-speed data transmission, instrument operation Paul Avery

Some (Realistic) Grid Examples • High energy physics • 3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data • Fusion power (ITER, etc.) • Physicists quickly generate 100 CPU-years of simulations of a new magnet configuration to compare with data • Astronomy • An international team remotely operates a telescope in real time • Climate modeling • Climate scientists visualize, annotate, & analyze Terabytes of simulation data • Biology • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour Paul Avery

Grids: Enhancing Research & Learning • Fundamentally alters conduct of scientific research • Central model: People, resources flow inward to labs • Distributed model: Knowledge flows between distributed teams • Strengthens universities • Couples universities to data intensive science • Couples universities to national & international labs • Brings front-line research and resources to students • Exploits intellectual resources of formerly isolated schools • Opens new opportunities for minority and women researchers • Builds partnerships to drive advances in IT/science/eng • “Application” sciences  Computer Science • Physics  Astronomy, biology, etc. • Universities  Laboratories • Scientists  Students • Research Community  IT industry Paul Avery

Grid Challenges • Operate a fundamentally complex entity • Geographically distributed resources • Each resource under different administrative control • Many failure modes • Manage workflow across Grid • Balance policy vs. instantaneous capability to complete tasks • Balance effective resource use vs. fast turnaround for priority jobs • Match resource usage to policy over the long term • Goal-oriented algorithms: steering requests according to metrics • Maintain a global view of resources and system state • Coherent end-to-end system monitoring • Adaptive learning for execution optimization • Build high level services & integrated user environment Paul Avery

Data Grids Paul Avery

Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by data collection • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Internationally distributed collaborations • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handleadditional dimension of data access & movement Paul Avery

Data Intensive Physical Sciences • High energy & nuclear physics • Including new experiments at CERN’s Large Hadron Collider • Astronomy • Digital sky surveys: SDSS, VISTA, other Gigapixel arrays • VLBI arrays: multiple- Gbps data streams • “Virtual” Observatories (multi-wavelength astronomy) • Gravity wave searches • LIGO, GEO, VIRGO, TAMA • Time-dependent 3-D systems (simulation & data) • Earth Observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Dispersal of pollutants in atmosphere Paul Avery

Data Intensive Biology and Medicine • Medical data • X-Ray, mammography data, etc. (many petabytes) • Radiation Oncology (real-time display of 3-D images) • X-ray crystallography • Bright X-Ray sources, e.g. Argonne Advanced Photon Source • Molecular genomics and related disciplines • Human Genome, other genome databases • Proteomics (protein structure, activities, …) • Protein interactions, drug delivery • Brain scans (1-10m, time dependent) Paul Avery

Driven by LHC Computing Challenges • Complexity: Millions of individual detector channels • Scale: PetaOps (CPU), Petabytes (Data) • Distribution: Global distribution of people & resources 1800 Physicists 150 Institutes 32 Countries Paul Avery

CMS Experiment at LHC “Compact” Muon Solenoid at the LHC (CERN) Smithsonianstandard man Paul Avery

LHC Data Rates: Detector to Storage 40 MHz ~1000 TB/sec Physics filtering Level 1 Trigger: Special Hardware 75 GB/sec 75 KHz Level 2 Trigger: Commodity CPUs 5 GB/sec 5 KHz Level 3 Trigger: Commodity CPUs 100 – 1500 MB/sec 100 Hz Raw Data to storage Paul Avery

(+30 minimum bias events) All charged tracks with pt > 2 GeV Reconstructed tracks with pt > 25 GeV 109 events/sec, selectivity: 1 in 1013 LHC: Higgs Decay into 4 muons Paul Avery

Korea Russia USA UK Tier2 Center Tier2 Center Tier2 Center Tier2 Center Institute Institute Institute Institute Hierarchy of LHC Data Grid Resources Tier0/( Tier1)/( Tier2) ~ 1:1:1 CMS Experiment Online System 100-1500 MBytes/s CERN Computer Center > 20 TIPS Tier 0 10-40 Gbps Tier 1 2.5-10 Gbps Tier 2 1-2.5 Gbps Tier 3 Physics cache 1-10 Gbps ~10s of Petabytes by 2007-8~1000 Petabytes in 5-7 years PCs Tier 4 Paul Avery

Digital Astronomy Future dominated by detector improvements • Moore’s Law growth in CCDs • Gigapixel arrays on horizon • Growth in CPU/storage tracking data volumes Glass MPixels • Total area of 3m+ telescopes in the world in m2 • Total number of CCD pixels in Mpixels • 25 year growth: 30x in glass, 3000x in pixels Paul Avery

The Age of Astronomical Mega-Surveys • Next generation mega-surveys will change astronomy • Large sky coverage • Sound statistical plans, uniform systematics • The technology to store and access the data is here • Following Moore’s law • Integrating these archives for the whole community • Astronomical data mining will lead to stunning new discoveries • “Virtual Observatory” (next slides) Paul Avery

Image Data Standards Source Catalogs Specialized Data: Spectroscopy, Time Series, Polarization Information Archives: Derived & legacy data: NED,Simbad,ADS, etc Discovery Tools: Visualization, Statistics Virtual Observatories Multi-wavelength astronomy,Multiple surveys Paul Avery

Virtual Observatory Data Challenge • Digital representation of the sky • All-sky + deep fields • Integrated catalog and image databases • Spectra of selected samples • Size of the archived data • 40,000 square degrees • Resolution < 0.1 arcsec  > 50 trillion pixels • One band (2 bytes/pixel) 100 Terabytes • Multi-wavelength: 500-1000 Terabytes • Time dimension: Many Petabytes • Large, globally distributed database engines • Multi-Petabyte data size, distributed widely • Thousands of queries per day, Gbyte/s I/O speed per site • Data Grid computing infrastructure Paul Avery

Sloan Sky Survey Data Grid Paul Avery

International Grid/Networking ProjectsUS, EU, E. Europe, Asia, S. America, … Paul Avery

Global Context: Data Grid Projects • U.S. Projects • Particle Physics Data Grid (PPDG) DOE • GriPhyN NSF • International Virtual Data Grid Laboratory (iVDGL) NSF • TeraGrid NSF • DOE Science Grid DOE • NSF Middleware Initiative (NMI) NSF • EU, Asia major projects • European Data Grid (EU, EC) • LHC Computing Grid (LCG) (CERN) • EU national Projects (UK, Italy, France, …) • CrossGrid (EU, EC) • DataTAG (EU, EC) • Japanese Project • Korea project Paul Avery

Particle Physics Data Grid • Funded 2001 – 2004 @ US$9.5M (DOE) • Driven by HENP experiments: D0, BaBar, STAR, CMS, ATLAS Paul Avery

PPDG Goals • Serve high energy & nuclear physics (HENP) experiments • Unique challenges, diverse test environments • Develop advanced Grid technologies • Focus on end to end integration • Maintain practical orientation • Networks, instrumentation, monitoring • DB file/object replication, caching, catalogs, end-to-end movement • Make tools general enough for wide community • Collaboration with GriPhyN, iVDGL, EDG, LCG • ESNet Certificate Authority work, security Paul Avery

GriPhyN and iVDGL • Both funded through NSF ITR program • GriPhyN: $11.9M (NSF) + $1.6M (matching) (2000 – 2005) • iVDGL: $13.7M (NSF) + $2M (matching) (2001 – 2006) • Basic composition • GriPhyN: 12 funded universities, SDSC, 3 labs (~80 people) • iVDGL: 16 funded institutions, SDSC, 3 labs (~80 people) • Expts: US-CMS, US-ATLAS, LIGO, SDSS/NVO • Large overlap of people, institutions, management • Grid research vs Grid deployment • GriPhyN: CS research, Virtual Data Toolkit (VDT) development • iVDGL: Grid laboratory deployment • 4 physics experiments provide frontier challenges • VDT in common Paul Avery

GriPhyN Computer Science Challenges • Virtual data (more later) • Data + programs (content) +programs (executions) • Representation, discovery, & manipulation of workflows and associated data & programs • Planning • Mapping workflows in an efficient, policy-aware manner to distributed resources • Execution • Executing workflows, inc. data movements, reliably and efficiently • Performance • Monitoring system performance for scheduling & troubleshooting Paul Avery

Goal: “PetaScale” Virtual-Data Grids Production Team Single Researcher Workgroups Interactive User Tools Request Execution & Management Tools Request Planning &Scheduling Tools Virtual Data Tools ResourceManagementServices Security andPolicyServices Other GridServices • PetaOps • Petabytes • Performance Transforms Distributed resources(code, storage, CPUs,networks) Raw datasource Paul Avery

2007 2002 Community growth Data growth 2001 GriPhyN/iVDGL Science Drivers • US-CMS & US-ATLAS • HEP experiments at LHC/CERN • 100s of Petabytes • LIGO • Gravity wave experiment • 100s of Terabytes • Sloan Digital Sky Survey • Digital astronomy (1/4 sky) • 10s of Terabytes • Massive CPU • Large, distributed datasets • Large, distributed communities Paul Avery

Virtual Data: Derivation and Provenance • Most scientific data are not simple “measurements” • They are computationally corrected/reconstructed • They can be produced by numerical simulation • Science & eng. projects are more CPU and data intensive • Programs are significant community resources (transformations) • So are the executions of those programs (derivations) • Management of dataset transformations important! • Derivation: Instantiation of a potential data product • Provenance: Exact history of any existing data product We already do this, but manually! Paul Avery

Virtual Data Motivations (1) “I’ve found some interesting data, but I need to know exactly what corrections were applied before I can trust it.” “I’ve detected a muon calibration error and want to know which derived data products need to be recomputed.” Data consumed-by/ generated-by product-of Derivation Transformation execution-of “I want to search a database for 3 muon SUSY events. If a program that does this analysis exists, I won’t have to write one from scratch.” “I want to apply a forward jet analysis to 100M events. If the results already exist, I’ll save weeks of computation.” Paul Avery

Virtual Data Motivations (2) • Data track-ability and result audit-ability • Universally sought by scientific applications • Facilitates tool and data sharing and collaboration • Data can be sent along with its recipe • Repair and correction of data • Rebuild data products—cf., “make” • Workflow management • Organizing, locating, specifying, and requesting data products • Performance optimizations • Ability to re-create data rather than move it Manual /error prone  Automated /robust Paul Avery

“Chimera” Virtual Data System • Virtual Data API • A Java class hierarchy to represent transformations & derivations • Virtual Data Language • Textual for people & illustrative examples • XML for machine-to-machine interfaces • Virtual Data Database • Makes the objects of a virtual data definition persistent • Virtual Data Service (future) • Provides a service interface (e.g., OGSA) to persistent objects • Version 1.0 available • To be put into VDT 1.1.7 Paul Avery

Chimera Application: SDSS Analysis Galaxy cluster data Size distribution Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Paul Avery

Virtual Data and LHC Computing • US-CMS • Chimera prototype tested with CMS MC (~200K events) • Currently integrating Chimera into standard CMS production tools • Integrating virtual data into Grid-enabled analysis tools • US-ATLAS • Integrating Chimera into ATLAS software • HEPCAL document includes first virtual data use cases • Very basic cases, need elaboration • Discuss with LHC expts: requirements, scope, technologies • New ITR proposal to NSF ITR program ($15M) • Dynamic Workspaces for Scientific Analysis Communities • Continued progress requires collaboration with CS groups • Distributed scheduling, workflow optimization, … • Need collaboration with CS to develop robust tools Paul Avery

iVDGL Goals and Context • International Virtual-Data Grid Laboratory • A global Grid laboratory (US, EU, E. Europe, Asia, S. America, …) • A place to conduct Data Grid tests “at scale” • A mechanism to create common Grid infrastructure • A laboratory for other disciplines to perform Data Grid tests • A focus of outreach efforts to small institutions • Context of iVDGL in US-LHC computing program • Develop and operate proto-Tier2 centers • Learn how to do Grid operations (GOC) • International participation • DataTag • UK e-Science programme: support 6 CS Fellows per year in U.S. Paul Avery

SKC Boston U Wisconsin Michigan PSU BNL Fermilab LBL Argonne J. Hopkins NCSA Indiana Hampton Caltech Oklahoma Vanderbilt UCSD/SDSC FSU Arlington UF Tier1 FIU Tier2 Brownsville Tier3 US-iVDGL Sites (Spring 2003) Partners? • EU • CERN • Brazil • Australia • Korea • Japan Paul Avery

Korea Wisconsin CERN Fermilab Caltech UCSD FSU Florida FIU Brazil US-CMS Grid Testbed Paul Avery

US-CMS Testbed Success Story • Production Run for Monte Carlo data production • Assigned 1.5 million events for “eGamma Bigjets” • ~500 sec per event on 750 MHz processor; all production stages from simulation to ntuple • 2 months continuous running across 5 testbed sites • Demonstrated at Supercomputing 2002 Paul Avery

Creation of WorldGrid • Joint iVDGL/DataTag/EDG effort • Resources from both sides (15 sites) • Monitoring tools (Ganglia, MDS, NetSaint, …) • Visualization tools (Nagios, MapCenter, Ganglia) • Applications: ScienceGrid • CMS: CMKIN, CMSIM • ATLAS: ATLSIM • Submit jobs from US or EU • Jobs can run on any cluster • Demonstrated at IST2002 (Copenhagen) • Demonstrated at SC2002 (Baltimore) Paul Avery

WorldGrid Sites Paul Avery

Grid Coordination Paul Avery

U.S. Project Coordination: Trillium • Trillium = GriPhyN + iVDGL + PPDG • Large overlap in leadership, people, experiments • Driven primarily by HENP, particularly LHC experiments • Benefit of coordination • Common software base + packaging: VDT + PACMAN • Collaborative / joint projects: monitoring, demos, security, … • Wide deployment of new technologies, e.g. Virtual Data • Stronger, broader outreach effort • Forum for US Grid projects • Joint view, strategies, meetings and work • Unified entity to deal with EU & other Grid projects Paul Avery

International Grid Coordination • Global Grid Forum (GGF) • International forum for general Grid efforts • Many working groups, standards definitions • Close collaboration with EU DataGrid (EDG) • Many connections with EDG activities • HICB: HEP Inter-Grid Coordination Board • Non-competitive forum, strategic issues, consensus • Cross-project policies, procedures and technology, joint projects • HICB-JTB Joint Technical Board • Definition, oversight and tracking of joint projects • GLUE interoperability group • Participation in LHC Computing Grid (LCG) • Software Computing Committee (SC2) • Project Execution Board (PEB) • Grid Deployment Board (GDB) Paul Avery

HEP and International Grid Projects • HEP continues to be the strongest science driver • (In collaboration with computer scientists) • Many national and international initiatives • LHC a particularly strong driving function • US-HEP committed to working with international partners • Many networking initiatives with EU colleagues • Collaboration on LHC Grid Project • Grid projects driving & linked to network developments • DataTag, SCIC, US-CERN link, Internet2 • New partners being actively sought • Korea, Russia, China, Japan, Brazil, Romania, … • Participate in US-CMS and US-ATLAS Grid testbeds • Link to WorldGrid, once some software is fixed Paul Avery

New Grid Efforts Paul Avery

An Inter-Regional Center for High Energy Physics Research and Educational Outreach (CHEPREO) at Florida International University Status: • Proposal submitted Dec. 2002 • Presented to NSF review panel • Project Execution Plan submitted • Funding in June? • E/O Center in Miami area • iVDGL Grid Activities • CMS Research • AMPATH network (S. America) • Int’l Activities (Brazil, etc.)

$4M ITR proposal from Caltech (HN PI,JB:CoPI) Michigan (CoPI,CoPI) Maryland (CoPI) Plus senior personnel from Lawrence Berkeley Lab Oklahoma Fermilab Arlington (U. Texas) Iowa Florida State First Grid-enabled Collaboratory Tight integration between Science of Collaboratories Globally scalable work environment Sophisticated collaborative tools (VRVS, VNC; Next-Gen) Agent based monitoring & decision support system (MonALISA) A Global Grid Enabled Collaboratory for Scientific Research (GECSR) • Initial targets are the global HENP collaborations, but GESCR is expected to be widely applicable to other large scale collaborative scientific endeavors • “Giving scientists from all world regions the means to function as full partners in the process of search and discovery” Paul Avery

Large ITR Proposal: $15M Dynamic Workspaces:Enabling Global Analysis Communities Paul Avery

UltraLight Proposal to NSF 10 Gb/s+ network • Caltech, UF, FIU, UM, MIT • SLAC, FNAL • Int’l partners • Cisco Applications • HEP • VLBI • Radiation Oncology • Grid Projects Paul Avery

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Presentation Transcript

University of Central Florida

University of Florida

University of Florida Foundation

University of Florida

University Of Florida.

University of Florida Foundation

University of Florida

University of Florida

Jeremy Bolton Paul Gader CSI Laboratory University of Florida

University Of Florida.

University of Central Florida

University Of South Florida

University of Florida

University of Florida

University of Florida

University of Florida

University of Florida

University of West Florida

University of Florida

University of Florida