1 / 53

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Grids for 21 st Century Data Intensive Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu. University of Michigan May 8, 2003. Grids and Science. The Grid Concept.

cala
Download Presentation

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grids for 21st CenturyData Intensive Science Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu University of MichiganMay 8, 2003 Paul Avery

  2. Grids and Science Paul Avery

  3. The Grid Concept • Grid: Geographically distributed computing resources configured for coordinated use • Fabric: Physical resources & networks provide raw capability • Middleware: Software ties it all together (tools, services, etc.) • Goal: Transparent resource sharing Paul Avery

  4. Fundamental Idea: Resource Sharing • Resources for complex problems are distributed • Advanced scientific instruments (accelerators, telescopes, …) • Storage, computing, people, institutions • Communities require access to common services • Research collaborations (physics, astronomy, engineering, …) • Government agencies, health care organizations, corporations, … • “Virtual Organizations” • Create a “VO” from geographically separated components • Make all community resources available to any VO member • Leverage strengths at different institutions • Grids require a foundation of strong networking • Communication tools, visualization • High-speed data transmission, instrument operation Paul Avery

  5. Some (Realistic) Grid Examples • High energy physics • 3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data • Fusion power (ITER, etc.) • Physicists quickly generate 100 CPU-years of simulations of a new magnet configuration to compare with data • Astronomy • An international team remotely operates a telescope in real time • Climate modeling • Climate scientists visualize, annotate, & analyze Terabytes of simulation data • Biology • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour Paul Avery

  6. Grids: Enhancing Research & Learning • Fundamentally alters conduct of scientific research • Central model: People, resources flow inward to labs • Distributed model: Knowledge flows between distributed teams • Strengthens universities • Couples universities to data intensive science • Couples universities to national & international labs • Brings front-line research and resources to students • Exploits intellectual resources of formerly isolated schools • Opens new opportunities for minority and women researchers • Builds partnerships to drive advances in IT/science/eng • “Application” sciences  Computer Science • Physics  Astronomy, biology, etc. • Universities  Laboratories • Scientists  Students • Research Community  IT industry Paul Avery

  7. Grid Challenges • Operate a fundamentally complex entity • Geographically distributed resources • Each resource under different administrative control • Many failure modes • Manage workflow across Grid • Balance policy vs. instantaneous capability to complete tasks • Balance effective resource use vs. fast turnaround for priority jobs • Match resource usage to policy over the long term • Goal-oriented algorithms: steering requests according to metrics • Maintain a global view of resources and system state • Coherent end-to-end system monitoring • Adaptive learning for execution optimization • Build high level services & integrated user environment Paul Avery

  8. Data Grids Paul Avery

  9. Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by data collection • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Internationally distributed collaborations • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handleadditional dimension of data access & movement Paul Avery

  10. Data Intensive Physical Sciences • High energy & nuclear physics • Including new experiments at CERN’s Large Hadron Collider • Astronomy • Digital sky surveys: SDSS, VISTA, other Gigapixel arrays • VLBI arrays: multiple- Gbps data streams • “Virtual” Observatories (multi-wavelength astronomy) • Gravity wave searches • LIGO, GEO, VIRGO, TAMA • Time-dependent 3-D systems (simulation & data) • Earth Observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Dispersal of pollutants in atmosphere Paul Avery

  11. Data Intensive Biology and Medicine • Medical data • X-Ray, mammography data, etc. (many petabytes) • Radiation Oncology (real-time display of 3-D images) • X-ray crystallography • Bright X-Ray sources, e.g. Argonne Advanced Photon Source • Molecular genomics and related disciplines • Human Genome, other genome databases • Proteomics (protein structure, activities, …) • Protein interactions, drug delivery • Brain scans (1-10m, time dependent) Paul Avery

  12. Driven by LHC Computing Challenges • Complexity: Millions of individual detector channels • Scale: PetaOps (CPU), Petabytes (Data) • Distribution: Global distribution of people & resources 1800 Physicists 150 Institutes 32 Countries Paul Avery

  13. CMS Experiment at LHC “Compact” Muon Solenoid at the LHC (CERN) Smithsonianstandard man Paul Avery

  14. LHC Data Rates: Detector to Storage 40 MHz ~1000 TB/sec Physics filtering Level 1 Trigger: Special Hardware 75 GB/sec 75 KHz Level 2 Trigger: Commodity CPUs 5 GB/sec 5 KHz Level 3 Trigger: Commodity CPUs 100 – 1500 MB/sec 100 Hz Raw Data to storage Paul Avery

  15. (+30 minimum bias events) All charged tracks with pt > 2 GeV Reconstructed tracks with pt > 25 GeV 109 events/sec, selectivity: 1 in 1013 LHC: Higgs Decay into 4 muons Paul Avery

  16. Korea Russia USA UK Tier2 Center Tier2 Center Tier2 Center Tier2 Center Institute Institute Institute Institute Hierarchy of LHC Data Grid Resources Tier0/( Tier1)/( Tier2) ~ 1:1:1 CMS Experiment Online System 100-1500 MBytes/s CERN Computer Center > 20 TIPS Tier 0 10-40 Gbps Tier 1 2.5-10 Gbps Tier 2 1-2.5 Gbps Tier 3 Physics cache 1-10 Gbps ~10s of Petabytes by 2007-8~1000 Petabytes in 5-7 years PCs Tier 4 Paul Avery

  17. Digital Astronomy Future dominated by detector improvements • Moore’s Law growth in CCDs • Gigapixel arrays on horizon • Growth in CPU/storage tracking data volumes Glass MPixels • Total area of 3m+ telescopes in the world in m2 • Total number of CCD pixels in Mpixels • 25 year growth: 30x in glass, 3000x in pixels Paul Avery

  18. The Age of Astronomical Mega-Surveys • Next generation mega-surveys will change astronomy • Large sky coverage • Sound statistical plans, uniform systematics • The technology to store and access the data is here • Following Moore’s law • Integrating these archives for the whole community • Astronomical data mining will lead to stunning new discoveries • “Virtual Observatory” (next slides) Paul Avery

  19. Image Data Standards Source Catalogs Specialized Data: Spectroscopy, Time Series, Polarization Information Archives: Derived & legacy data: NED,Simbad,ADS, etc Discovery Tools: Visualization, Statistics Virtual Observatories Multi-wavelength astronomy,Multiple surveys Paul Avery

  20. Virtual Observatory Data Challenge • Digital representation of the sky • All-sky + deep fields • Integrated catalog and image databases • Spectra of selected samples • Size of the archived data • 40,000 square degrees • Resolution < 0.1 arcsec  > 50 trillion pixels • One band (2 bytes/pixel) 100 Terabytes • Multi-wavelength: 500-1000 Terabytes • Time dimension: Many Petabytes • Large, globally distributed database engines • Multi-Petabyte data size, distributed widely • Thousands of queries per day, Gbyte/s I/O speed per site • Data Grid computing infrastructure Paul Avery

  21. Sloan Sky Survey Data Grid Paul Avery

  22. International Grid/Networking ProjectsUS, EU, E. Europe, Asia, S. America, … Paul Avery

  23. Global Context: Data Grid Projects • U.S. Projects • Particle Physics Data Grid (PPDG) DOE • GriPhyN NSF • International Virtual Data Grid Laboratory (iVDGL) NSF • TeraGrid NSF • DOE Science Grid DOE • NSF Middleware Initiative (NMI) NSF • EU, Asia major projects • European Data Grid (EU, EC) • LHC Computing Grid (LCG) (CERN) • EU national Projects (UK, Italy, France, …) • CrossGrid (EU, EC) • DataTAG (EU, EC) • Japanese Project • Korea project Paul Avery

  24. Particle Physics Data Grid • Funded 2001 – 2004 @ US$9.5M (DOE) • Driven by HENP experiments: D0, BaBar, STAR, CMS, ATLAS Paul Avery

  25. PPDG Goals • Serve high energy & nuclear physics (HENP) experiments • Unique challenges, diverse test environments • Develop advanced Grid technologies • Focus on end to end integration • Maintain practical orientation • Networks, instrumentation, monitoring • DB file/object replication, caching, catalogs, end-to-end movement • Make tools general enough for wide community • Collaboration with GriPhyN, iVDGL, EDG, LCG • ESNet Certificate Authority work, security Paul Avery

  26. GriPhyN and iVDGL • Both funded through NSF ITR program • GriPhyN: $11.9M (NSF) + $1.6M (matching) (2000 – 2005) • iVDGL: $13.7M (NSF) + $2M (matching) (2001 – 2006) • Basic composition • GriPhyN: 12 funded universities, SDSC, 3 labs (~80 people) • iVDGL: 16 funded institutions, SDSC, 3 labs (~80 people) • Expts: US-CMS, US-ATLAS, LIGO, SDSS/NVO • Large overlap of people, institutions, management • Grid research vs Grid deployment • GriPhyN: CS research, Virtual Data Toolkit (VDT) development • iVDGL: Grid laboratory deployment • 4 physics experiments provide frontier challenges • VDT in common Paul Avery

  27. GriPhyN Computer Science Challenges • Virtual data (more later) • Data + programs (content) +programs (executions) • Representation, discovery, & manipulation of workflows and associated data & programs • Planning • Mapping workflows in an efficient, policy-aware manner to distributed resources • Execution • Executing workflows, inc. data movements, reliably and efficiently • Performance • Monitoring system performance for scheduling & troubleshooting Paul Avery

  28. Goal: “PetaScale” Virtual-Data Grids Production Team Single Researcher Workgroups Interactive User Tools Request Execution & Management Tools Request Planning &Scheduling Tools Virtual Data Tools ResourceManagementServices Security andPolicyServices Other GridServices • PetaOps • Petabytes • Performance Transforms Distributed resources(code, storage, CPUs,networks) Raw datasource Paul Avery

  29. 2007 2002 Community growth Data growth 2001 GriPhyN/iVDGL Science Drivers • US-CMS & US-ATLAS • HEP experiments at LHC/CERN • 100s of Petabytes • LIGO • Gravity wave experiment • 100s of Terabytes • Sloan Digital Sky Survey • Digital astronomy (1/4 sky) • 10s of Terabytes • Massive CPU • Large, distributed datasets • Large, distributed communities Paul Avery

  30. Virtual Data: Derivation and Provenance • Most scientific data are not simple “measurements” • They are computationally corrected/reconstructed • They can be produced by numerical simulation • Science & eng. projects are more CPU and data intensive • Programs are significant community resources (transformations) • So are the executions of those programs (derivations) • Management of dataset transformations important! • Derivation: Instantiation of a potential data product • Provenance: Exact history of any existing data product We already do this, but manually! Paul Avery

  31. Virtual Data Motivations (1) “I’ve found some interesting data, but I need to know exactly what corrections were applied before I can trust it.” “I’ve detected a muon calibration error and want to know which derived data products need to be recomputed.” Data consumed-by/ generated-by product-of Derivation Transformation execution-of “I want to search a database for 3 muon SUSY events. If a program that does this analysis exists, I won’t have to write one from scratch.” “I want to apply a forward jet analysis to 100M events. If the results already exist, I’ll save weeks of computation.” Paul Avery

  32. Virtual Data Motivations (2) • Data track-ability and result audit-ability • Universally sought by scientific applications • Facilitates tool and data sharing and collaboration • Data can be sent along with its recipe • Repair and correction of data • Rebuild data products—cf., “make” • Workflow management • Organizing, locating, specifying, and requesting data products • Performance optimizations • Ability to re-create data rather than move it Manual /error prone  Automated /robust Paul Avery

  33. “Chimera” Virtual Data System • Virtual Data API • A Java class hierarchy to represent transformations & derivations • Virtual Data Language • Textual for people & illustrative examples • XML for machine-to-machine interfaces • Virtual Data Database • Makes the objects of a virtual data definition persistent • Virtual Data Service (future) • Provides a service interface (e.g., OGSA) to persistent objects • Version 1.0 available • To be put into VDT 1.1.7 Paul Avery

  34. Chimera Application: SDSS Analysis Galaxy cluster data Size distribution Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Paul Avery

  35. Virtual Data and LHC Computing • US-CMS • Chimera prototype tested with CMS MC (~200K events) • Currently integrating Chimera into standard CMS production tools • Integrating virtual data into Grid-enabled analysis tools • US-ATLAS • Integrating Chimera into ATLAS software • HEPCAL document includes first virtual data use cases • Very basic cases, need elaboration • Discuss with LHC expts: requirements, scope, technologies • New ITR proposal to NSF ITR program ($15M) • Dynamic Workspaces for Scientific Analysis Communities • Continued progress requires collaboration with CS groups • Distributed scheduling, workflow optimization, … • Need collaboration with CS to develop robust tools Paul Avery

  36. iVDGL Goals and Context • International Virtual-Data Grid Laboratory • A global Grid laboratory (US, EU, E. Europe, Asia, S. America, …) • A place to conduct Data Grid tests “at scale” • A mechanism to create common Grid infrastructure • A laboratory for other disciplines to perform Data Grid tests • A focus of outreach efforts to small institutions • Context of iVDGL in US-LHC computing program • Develop and operate proto-Tier2 centers • Learn how to do Grid operations (GOC) • International participation • DataTag • UK e-Science programme: support 6 CS Fellows per year in U.S. Paul Avery

  37. SKC Boston U Wisconsin Michigan PSU BNL Fermilab LBL Argonne J. Hopkins NCSA Indiana Hampton Caltech Oklahoma Vanderbilt UCSD/SDSC FSU Arlington UF Tier1 FIU Tier2 Brownsville Tier3 US-iVDGL Sites (Spring 2003) Partners? • EU • CERN • Brazil • Australia • Korea • Japan Paul Avery

  38. Korea Wisconsin CERN Fermilab Caltech UCSD FSU Florida FIU Brazil US-CMS Grid Testbed Paul Avery

  39. US-CMS Testbed Success Story • Production Run for Monte Carlo data production • Assigned 1.5 million events for “eGamma Bigjets” • ~500 sec per event on 750 MHz processor; all production stages from simulation to ntuple • 2 months continuous running across 5 testbed sites • Demonstrated at Supercomputing 2002 Paul Avery

  40. Creation of WorldGrid • Joint iVDGL/DataTag/EDG effort • Resources from both sides (15 sites) • Monitoring tools (Ganglia, MDS, NetSaint, …) • Visualization tools (Nagios, MapCenter, Ganglia) • Applications: ScienceGrid • CMS: CMKIN, CMSIM • ATLAS: ATLSIM • Submit jobs from US or EU • Jobs can run on any cluster • Demonstrated at IST2002 (Copenhagen) • Demonstrated at SC2002 (Baltimore) Paul Avery

  41. WorldGrid Sites Paul Avery

  42. Grid Coordination Paul Avery

  43. U.S. Project Coordination: Trillium • Trillium = GriPhyN + iVDGL + PPDG • Large overlap in leadership, people, experiments • Driven primarily by HENP, particularly LHC experiments • Benefit of coordination • Common software base + packaging: VDT + PACMAN • Collaborative / joint projects: monitoring, demos, security, … • Wide deployment of new technologies, e.g. Virtual Data • Stronger, broader outreach effort • Forum for US Grid projects • Joint view, strategies, meetings and work • Unified entity to deal with EU & other Grid projects Paul Avery

  44. International Grid Coordination • Global Grid Forum (GGF) • International forum for general Grid efforts • Many working groups, standards definitions • Close collaboration with EU DataGrid (EDG) • Many connections with EDG activities • HICB: HEP Inter-Grid Coordination Board • Non-competitive forum, strategic issues, consensus • Cross-project policies, procedures and technology, joint projects • HICB-JTB Joint Technical Board • Definition, oversight and tracking of joint projects • GLUE interoperability group • Participation in LHC Computing Grid (LCG) • Software Computing Committee (SC2) • Project Execution Board (PEB) • Grid Deployment Board (GDB) Paul Avery

  45. HEP and International Grid Projects • HEP continues to be the strongest science driver • (In collaboration with computer scientists) • Many national and international initiatives • LHC a particularly strong driving function • US-HEP committed to working with international partners • Many networking initiatives with EU colleagues • Collaboration on LHC Grid Project • Grid projects driving & linked to network developments • DataTag, SCIC, US-CERN link, Internet2 • New partners being actively sought • Korea, Russia, China, Japan, Brazil, Romania, … • Participate in US-CMS and US-ATLAS Grid testbeds • Link to WorldGrid, once some software is fixed Paul Avery

  46. New Grid Efforts Paul Avery

  47. An Inter-Regional Center for High Energy Physics Research and Educational Outreach (CHEPREO) at Florida International University Status: • Proposal submitted Dec. 2002 • Presented to NSF review panel • Project Execution Plan submitted • Funding in June? • E/O Center in Miami area • iVDGL Grid Activities • CMS Research • AMPATH network (S. America) • Int’l Activities (Brazil, etc.)

  48. $4M ITR proposal from Caltech (HN PI,JB:CoPI) Michigan (CoPI,CoPI) Maryland (CoPI) Plus senior personnel from Lawrence Berkeley Lab Oklahoma Fermilab Arlington (U. Texas) Iowa Florida State First Grid-enabled Collaboratory Tight integration between Science of Collaboratories Globally scalable work environment Sophisticated collaborative tools (VRVS, VNC; Next-Gen) Agent based monitoring & decision support system (MonALISA) A Global Grid Enabled Collaboratory for Scientific Research (GECSR) • Initial targets are the global HENP collaborations, but GESCR is expected to be widely applicable to other large scale collaborative scientific endeavors • “Giving scientists from all world regions the means to function as full partners in the process of search and discovery” Paul Avery

  49. Large ITR Proposal: $15M Dynamic Workspaces:Enabling Global Analysis Communities Paul Avery

  50. UltraLight Proposal to NSF 10 Gb/s+ network • Caltech, UF, FIU, UM, MIT • SLAC, FNAL • Int’l partners • Cisco Applications • HEP • VLBI • Radiation Oncology • Grid Projects Paul Avery

More Related