530 likes | 682 Views
Grids for 21 st Century Data Intensive Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu. University of Michigan May 8, 2003. Grids and Science. The Grid Concept.
E N D
Grids for 21st CenturyData Intensive Science Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu University of MichiganMay 8, 2003 Paul Avery
Grids and Science Paul Avery
The Grid Concept • Grid: Geographically distributed computing resources configured for coordinated use • Fabric: Physical resources & networks provide raw capability • Middleware: Software ties it all together (tools, services, etc.) • Goal: Transparent resource sharing Paul Avery
Fundamental Idea: Resource Sharing • Resources for complex problems are distributed • Advanced scientific instruments (accelerators, telescopes, …) • Storage, computing, people, institutions • Communities require access to common services • Research collaborations (physics, astronomy, engineering, …) • Government agencies, health care organizations, corporations, … • “Virtual Organizations” • Create a “VO” from geographically separated components • Make all community resources available to any VO member • Leverage strengths at different institutions • Grids require a foundation of strong networking • Communication tools, visualization • High-speed data transmission, instrument operation Paul Avery
Some (Realistic) Grid Examples • High energy physics • 3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data • Fusion power (ITER, etc.) • Physicists quickly generate 100 CPU-years of simulations of a new magnet configuration to compare with data • Astronomy • An international team remotely operates a telescope in real time • Climate modeling • Climate scientists visualize, annotate, & analyze Terabytes of simulation data • Biology • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour Paul Avery
Grids: Enhancing Research & Learning • Fundamentally alters conduct of scientific research • Central model: People, resources flow inward to labs • Distributed model: Knowledge flows between distributed teams • Strengthens universities • Couples universities to data intensive science • Couples universities to national & international labs • Brings front-line research and resources to students • Exploits intellectual resources of formerly isolated schools • Opens new opportunities for minority and women researchers • Builds partnerships to drive advances in IT/science/eng • “Application” sciences Computer Science • Physics Astronomy, biology, etc. • Universities Laboratories • Scientists Students • Research Community IT industry Paul Avery
Grid Challenges • Operate a fundamentally complex entity • Geographically distributed resources • Each resource under different administrative control • Many failure modes • Manage workflow across Grid • Balance policy vs. instantaneous capability to complete tasks • Balance effective resource use vs. fast turnaround for priority jobs • Match resource usage to policy over the long term • Goal-oriented algorithms: steering requests according to metrics • Maintain a global view of resources and system state • Coherent end-to-end system monitoring • Adaptive learning for execution optimization • Build high level services & integrated user environment Paul Avery
Data Grids Paul Avery
Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by data collection • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Internationally distributed collaborations • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handleadditional dimension of data access & movement Paul Avery
Data Intensive Physical Sciences • High energy & nuclear physics • Including new experiments at CERN’s Large Hadron Collider • Astronomy • Digital sky surveys: SDSS, VISTA, other Gigapixel arrays • VLBI arrays: multiple- Gbps data streams • “Virtual” Observatories (multi-wavelength astronomy) • Gravity wave searches • LIGO, GEO, VIRGO, TAMA • Time-dependent 3-D systems (simulation & data) • Earth Observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Dispersal of pollutants in atmosphere Paul Avery
Data Intensive Biology and Medicine • Medical data • X-Ray, mammography data, etc. (many petabytes) • Radiation Oncology (real-time display of 3-D images) • X-ray crystallography • Bright X-Ray sources, e.g. Argonne Advanced Photon Source • Molecular genomics and related disciplines • Human Genome, other genome databases • Proteomics (protein structure, activities, …) • Protein interactions, drug delivery • Brain scans (1-10m, time dependent) Paul Avery
Driven by LHC Computing Challenges • Complexity: Millions of individual detector channels • Scale: PetaOps (CPU), Petabytes (Data) • Distribution: Global distribution of people & resources 1800 Physicists 150 Institutes 32 Countries Paul Avery
CMS Experiment at LHC “Compact” Muon Solenoid at the LHC (CERN) Smithsonianstandard man Paul Avery
LHC Data Rates: Detector to Storage 40 MHz ~1000 TB/sec Physics filtering Level 1 Trigger: Special Hardware 75 GB/sec 75 KHz Level 2 Trigger: Commodity CPUs 5 GB/sec 5 KHz Level 3 Trigger: Commodity CPUs 100 – 1500 MB/sec 100 Hz Raw Data to storage Paul Avery
(+30 minimum bias events) All charged tracks with pt > 2 GeV Reconstructed tracks with pt > 25 GeV 109 events/sec, selectivity: 1 in 1013 LHC: Higgs Decay into 4 muons Paul Avery
Korea Russia USA UK Tier2 Center Tier2 Center Tier2 Center Tier2 Center Institute Institute Institute Institute Hierarchy of LHC Data Grid Resources Tier0/( Tier1)/( Tier2) ~ 1:1:1 CMS Experiment Online System 100-1500 MBytes/s CERN Computer Center > 20 TIPS Tier 0 10-40 Gbps Tier 1 2.5-10 Gbps Tier 2 1-2.5 Gbps Tier 3 Physics cache 1-10 Gbps ~10s of Petabytes by 2007-8~1000 Petabytes in 5-7 years PCs Tier 4 Paul Avery
Digital Astronomy Future dominated by detector improvements • Moore’s Law growth in CCDs • Gigapixel arrays on horizon • Growth in CPU/storage tracking data volumes Glass MPixels • Total area of 3m+ telescopes in the world in m2 • Total number of CCD pixels in Mpixels • 25 year growth: 30x in glass, 3000x in pixels Paul Avery
The Age of Astronomical Mega-Surveys • Next generation mega-surveys will change astronomy • Large sky coverage • Sound statistical plans, uniform systematics • The technology to store and access the data is here • Following Moore’s law • Integrating these archives for the whole community • Astronomical data mining will lead to stunning new discoveries • “Virtual Observatory” (next slides) Paul Avery
Image Data Standards Source Catalogs Specialized Data: Spectroscopy, Time Series, Polarization Information Archives: Derived & legacy data: NED,Simbad,ADS, etc Discovery Tools: Visualization, Statistics Virtual Observatories Multi-wavelength astronomy,Multiple surveys Paul Avery
Virtual Observatory Data Challenge • Digital representation of the sky • All-sky + deep fields • Integrated catalog and image databases • Spectra of selected samples • Size of the archived data • 40,000 square degrees • Resolution < 0.1 arcsec > 50 trillion pixels • One band (2 bytes/pixel) 100 Terabytes • Multi-wavelength: 500-1000 Terabytes • Time dimension: Many Petabytes • Large, globally distributed database engines • Multi-Petabyte data size, distributed widely • Thousands of queries per day, Gbyte/s I/O speed per site • Data Grid computing infrastructure Paul Avery
Sloan Sky Survey Data Grid Paul Avery
International Grid/Networking ProjectsUS, EU, E. Europe, Asia, S. America, … Paul Avery
Global Context: Data Grid Projects • U.S. Projects • Particle Physics Data Grid (PPDG) DOE • GriPhyN NSF • International Virtual Data Grid Laboratory (iVDGL) NSF • TeraGrid NSF • DOE Science Grid DOE • NSF Middleware Initiative (NMI) NSF • EU, Asia major projects • European Data Grid (EU, EC) • LHC Computing Grid (LCG) (CERN) • EU national Projects (UK, Italy, France, …) • CrossGrid (EU, EC) • DataTAG (EU, EC) • Japanese Project • Korea project Paul Avery
Particle Physics Data Grid • Funded 2001 – 2004 @ US$9.5M (DOE) • Driven by HENP experiments: D0, BaBar, STAR, CMS, ATLAS Paul Avery
PPDG Goals • Serve high energy & nuclear physics (HENP) experiments • Unique challenges, diverse test environments • Develop advanced Grid technologies • Focus on end to end integration • Maintain practical orientation • Networks, instrumentation, monitoring • DB file/object replication, caching, catalogs, end-to-end movement • Make tools general enough for wide community • Collaboration with GriPhyN, iVDGL, EDG, LCG • ESNet Certificate Authority work, security Paul Avery
GriPhyN and iVDGL • Both funded through NSF ITR program • GriPhyN: $11.9M (NSF) + $1.6M (matching) (2000 – 2005) • iVDGL: $13.7M (NSF) + $2M (matching) (2001 – 2006) • Basic composition • GriPhyN: 12 funded universities, SDSC, 3 labs (~80 people) • iVDGL: 16 funded institutions, SDSC, 3 labs (~80 people) • Expts: US-CMS, US-ATLAS, LIGO, SDSS/NVO • Large overlap of people, institutions, management • Grid research vs Grid deployment • GriPhyN: CS research, Virtual Data Toolkit (VDT) development • iVDGL: Grid laboratory deployment • 4 physics experiments provide frontier challenges • VDT in common Paul Avery
GriPhyN Computer Science Challenges • Virtual data (more later) • Data + programs (content) +programs (executions) • Representation, discovery, & manipulation of workflows and associated data & programs • Planning • Mapping workflows in an efficient, policy-aware manner to distributed resources • Execution • Executing workflows, inc. data movements, reliably and efficiently • Performance • Monitoring system performance for scheduling & troubleshooting Paul Avery
Goal: “PetaScale” Virtual-Data Grids Production Team Single Researcher Workgroups Interactive User Tools Request Execution & Management Tools Request Planning &Scheduling Tools Virtual Data Tools ResourceManagementServices Security andPolicyServices Other GridServices • PetaOps • Petabytes • Performance Transforms Distributed resources(code, storage, CPUs,networks) Raw datasource Paul Avery
2007 2002 Community growth Data growth 2001 GriPhyN/iVDGL Science Drivers • US-CMS & US-ATLAS • HEP experiments at LHC/CERN • 100s of Petabytes • LIGO • Gravity wave experiment • 100s of Terabytes • Sloan Digital Sky Survey • Digital astronomy (1/4 sky) • 10s of Terabytes • Massive CPU • Large, distributed datasets • Large, distributed communities Paul Avery
Virtual Data: Derivation and Provenance • Most scientific data are not simple “measurements” • They are computationally corrected/reconstructed • They can be produced by numerical simulation • Science & eng. projects are more CPU and data intensive • Programs are significant community resources (transformations) • So are the executions of those programs (derivations) • Management of dataset transformations important! • Derivation: Instantiation of a potential data product • Provenance: Exact history of any existing data product We already do this, but manually! Paul Avery
Virtual Data Motivations (1) “I’ve found some interesting data, but I need to know exactly what corrections were applied before I can trust it.” “I’ve detected a muon calibration error and want to know which derived data products need to be recomputed.” Data consumed-by/ generated-by product-of Derivation Transformation execution-of “I want to search a database for 3 muon SUSY events. If a program that does this analysis exists, I won’t have to write one from scratch.” “I want to apply a forward jet analysis to 100M events. If the results already exist, I’ll save weeks of computation.” Paul Avery
Virtual Data Motivations (2) • Data track-ability and result audit-ability • Universally sought by scientific applications • Facilitates tool and data sharing and collaboration • Data can be sent along with its recipe • Repair and correction of data • Rebuild data products—cf., “make” • Workflow management • Organizing, locating, specifying, and requesting data products • Performance optimizations • Ability to re-create data rather than move it Manual /error prone Automated /robust Paul Avery
“Chimera” Virtual Data System • Virtual Data API • A Java class hierarchy to represent transformations & derivations • Virtual Data Language • Textual for people & illustrative examples • XML for machine-to-machine interfaces • Virtual Data Database • Makes the objects of a virtual data definition persistent • Virtual Data Service (future) • Provides a service interface (e.g., OGSA) to persistent objects • Version 1.0 available • To be put into VDT 1.1.7 Paul Avery
Chimera Application: SDSS Analysis Galaxy cluster data Size distribution Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Paul Avery
Virtual Data and LHC Computing • US-CMS • Chimera prototype tested with CMS MC (~200K events) • Currently integrating Chimera into standard CMS production tools • Integrating virtual data into Grid-enabled analysis tools • US-ATLAS • Integrating Chimera into ATLAS software • HEPCAL document includes first virtual data use cases • Very basic cases, need elaboration • Discuss with LHC expts: requirements, scope, technologies • New ITR proposal to NSF ITR program ($15M) • Dynamic Workspaces for Scientific Analysis Communities • Continued progress requires collaboration with CS groups • Distributed scheduling, workflow optimization, … • Need collaboration with CS to develop robust tools Paul Avery
iVDGL Goals and Context • International Virtual-Data Grid Laboratory • A global Grid laboratory (US, EU, E. Europe, Asia, S. America, …) • A place to conduct Data Grid tests “at scale” • A mechanism to create common Grid infrastructure • A laboratory for other disciplines to perform Data Grid tests • A focus of outreach efforts to small institutions • Context of iVDGL in US-LHC computing program • Develop and operate proto-Tier2 centers • Learn how to do Grid operations (GOC) • International participation • DataTag • UK e-Science programme: support 6 CS Fellows per year in U.S. Paul Avery
SKC Boston U Wisconsin Michigan PSU BNL Fermilab LBL Argonne J. Hopkins NCSA Indiana Hampton Caltech Oklahoma Vanderbilt UCSD/SDSC FSU Arlington UF Tier1 FIU Tier2 Brownsville Tier3 US-iVDGL Sites (Spring 2003) Partners? • EU • CERN • Brazil • Australia • Korea • Japan Paul Avery
Korea Wisconsin CERN Fermilab Caltech UCSD FSU Florida FIU Brazil US-CMS Grid Testbed Paul Avery
US-CMS Testbed Success Story • Production Run for Monte Carlo data production • Assigned 1.5 million events for “eGamma Bigjets” • ~500 sec per event on 750 MHz processor; all production stages from simulation to ntuple • 2 months continuous running across 5 testbed sites • Demonstrated at Supercomputing 2002 Paul Avery
Creation of WorldGrid • Joint iVDGL/DataTag/EDG effort • Resources from both sides (15 sites) • Monitoring tools (Ganglia, MDS, NetSaint, …) • Visualization tools (Nagios, MapCenter, Ganglia) • Applications: ScienceGrid • CMS: CMKIN, CMSIM • ATLAS: ATLSIM • Submit jobs from US or EU • Jobs can run on any cluster • Demonstrated at IST2002 (Copenhagen) • Demonstrated at SC2002 (Baltimore) Paul Avery
WorldGrid Sites Paul Avery
Grid Coordination Paul Avery
U.S. Project Coordination: Trillium • Trillium = GriPhyN + iVDGL + PPDG • Large overlap in leadership, people, experiments • Driven primarily by HENP, particularly LHC experiments • Benefit of coordination • Common software base + packaging: VDT + PACMAN • Collaborative / joint projects: monitoring, demos, security, … • Wide deployment of new technologies, e.g. Virtual Data • Stronger, broader outreach effort • Forum for US Grid projects • Joint view, strategies, meetings and work • Unified entity to deal with EU & other Grid projects Paul Avery
International Grid Coordination • Global Grid Forum (GGF) • International forum for general Grid efforts • Many working groups, standards definitions • Close collaboration with EU DataGrid (EDG) • Many connections with EDG activities • HICB: HEP Inter-Grid Coordination Board • Non-competitive forum, strategic issues, consensus • Cross-project policies, procedures and technology, joint projects • HICB-JTB Joint Technical Board • Definition, oversight and tracking of joint projects • GLUE interoperability group • Participation in LHC Computing Grid (LCG) • Software Computing Committee (SC2) • Project Execution Board (PEB) • Grid Deployment Board (GDB) Paul Avery
HEP and International Grid Projects • HEP continues to be the strongest science driver • (In collaboration with computer scientists) • Many national and international initiatives • LHC a particularly strong driving function • US-HEP committed to working with international partners • Many networking initiatives with EU colleagues • Collaboration on LHC Grid Project • Grid projects driving & linked to network developments • DataTag, SCIC, US-CERN link, Internet2 • New partners being actively sought • Korea, Russia, China, Japan, Brazil, Romania, … • Participate in US-CMS and US-ATLAS Grid testbeds • Link to WorldGrid, once some software is fixed Paul Avery
New Grid Efforts Paul Avery
An Inter-Regional Center for High Energy Physics Research and Educational Outreach (CHEPREO) at Florida International University Status: • Proposal submitted Dec. 2002 • Presented to NSF review panel • Project Execution Plan submitted • Funding in June? • E/O Center in Miami area • iVDGL Grid Activities • CMS Research • AMPATH network (S. America) • Int’l Activities (Brazil, etc.)
$4M ITR proposal from Caltech (HN PI,JB:CoPI) Michigan (CoPI,CoPI) Maryland (CoPI) Plus senior personnel from Lawrence Berkeley Lab Oklahoma Fermilab Arlington (U. Texas) Iowa Florida State First Grid-enabled Collaboratory Tight integration between Science of Collaboratories Globally scalable work environment Sophisticated collaborative tools (VRVS, VNC; Next-Gen) Agent based monitoring & decision support system (MonALISA) A Global Grid Enabled Collaboratory for Scientific Research (GECSR) • Initial targets are the global HENP collaborations, but GESCR is expected to be widely applicable to other large scale collaborative scientific endeavors • “Giving scientists from all world regions the means to function as full partners in the process of search and discovery” Paul Avery
Large ITR Proposal: $15M Dynamic Workspaces:Enabling Global Analysis Communities Paul Avery
UltraLight Proposal to NSF 10 Gb/s+ network • Caltech, UF, FIU, UM, MIT • SLAC, FNAL • Int’l partners • Cisco Applications • HEP • VLBI • Radiation Oncology • Grid Projects Paul Avery