1 / 41

Data Grids and Data-Intensive Science

Data Grids and Data-Intensive Science. Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster.

kenda
Download Presentation

Data Grids and Data-Intensive Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Grids andData-Intensive Science Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster Keynote Talk, Research Library Group Annual Conference, Amsterdam, April 22, 2002

  2. Overview • The technology landscape • Living in an exponential world • Grid concepts • Resource sharing in virtual organizations • Petascale Virtual Data Grids • Data, programs, computers, computations as community resources

  3. Living in an Exponential World(1) Computing & Sensors Moore’s Law: transistor count doubles each 18 months Magnetohydro- dynamics star formation

  4. Living in an Exponential World:(2) Storage • Storage density doubles every 12 months • Dramatic growth in online data (1 petabyte = 1000 terabyte = 1,000,000 gigabyte) • 2000 ~0.5 petabyte • 2005 ~10 petabytes • 2010 ~100 petabytes • 2015 ~1000 petabytes? • Transforming entire disciplines in physical and, increasingly, biological sciences; humanities next?

  5. Data Intensive Physical Sciences • High energy & nuclear physics • Including new experiments at CERN • Gravity wave searches • LIGO, GEO, VIRGO • Time-dependent 3-D systems (simulation, data) • Earth Observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Pollutant dispersal scenarios • Astronomy: Digital sky surveys

  6. Ongoing Astronomical Mega-Surveys • Large number of new surveys • Multi-TB in size, 100M objects or larger • In databases • Individual archives planned and under way • Multi-wavelength view of the sky • > 13 wavelength coverage within 5 years • Impressive early discoveries • Finding exotic objects by unusual colors • L,T dwarfs, high redshift quasars • Finding objects by time variability • Gravitational micro-lensing MACHO 2MASS SDSS DPOSS GSC-II COBE MAP NVSS FIRST GALEX ROSAT OGLE ...

  7. Crab Nebula in 4 Spectral Regions X-ray Optical Infrared Radio

  8. Coming Floods of Astronomy Data • The planned Large Synoptic Survey Telescope will produce over 10 petabytes per year by 2008! • All-sky survey every few days, so will have fine-grain time series for the first time

  9. Data Intensive Biology and Medicine • Medical data • X-Ray, mammography data, etc. (many petabytes) • Digitizing patient records (ditto) • X-ray crystallography • Molecular genomics and related disciplines • Human Genome, other genome databases • Proteomics (protein structure, activities, …) • Protein interactions, drug delivery • Virtual Population Laboratory (proposed) • Simulate likely spread of disease outbreaks • Brain scans (3-D, time dependent)

  10. A Brainis a Lotof Data!(Mark Ellisman, UCSD) And comparisons must be made among many We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns – Grids will help get us there and further

  11. An Exponential World: (3) Networks(Or, Coefficients Matter …) • Network vs. computer performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Difference = order of magnitude per 5 years • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010 • Computers: x 60 • Networks: x 4000 Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

  12. Evolution of the Scientific Process • Pre-electronic • Theorize &/or experiment, alone or in small teams; publish paper • Post-electronic • Construct and mine very large databases of observational or simulation data • Develop computer simulations & analyses • Exchange information quasi-instantaneously within large, distributed, multidisciplinary teams

  13. The Grid “Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations”

  14. An Example Virtual Organization: CERN’s Large Hadron Collider 1800 Physicists, 150 Institutes, 32 Countries 100 PB of data by 2010; 50,000 CPUs?

  15. ~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Grid Communities & Applications:Data Grids for High Energy Physics www.griphyn.org www.ppdg.net www.eu-datagrid.org

  16. The Grid Opportunity:eScience and eBusiness • Physicists worldwide pool resources for peta-op analyses of petabytes of data • Civil engineers collaborate to design, execute, & analyze shake table experiments • An insurance company mines data from partner hospitals for fraud detection • An application service provider offloads excess load to a compute cycle provider • An enterprise configures internal & external resources to support eBusiness workload

  17. Grid Computing

  18. The Grid:A Brief History • Early 90s • Gigabit testbeds, metacomputing • Mid to late 90s • Early experiments (e.g., I-WAY), academic software projects (e.g., Globus, Legion), application experiments • 2002 • Dozens of application communities & projects • Major infrastructure deployments • Significant technology base (esp. Globus ToolkitTM) • Growing industrial interest • Global Grid Forum: ~500 people, 20+ countries

  19. Challenging Technical Requirements • Dynamic formation and management of virtual organizations • Online negotiation of access to services: who, what, why, when, how • Establishment of applications and systems able to deliver multiple qualities of service • Autonomic management of infrastructure elements Open Grid Services Architecture http://www.globus.org/ogsa

  20. Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by IT • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Geographically distributed collaboration • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handleadditional dimension of data access & movement

  21. Data Grid Projects • Particle Physics Data Grid (US, DOE) • Data Grid applications for HENP expts. • GriPhyN (US, NSF) • Petascale Virtual-Data Grids • iVDGL (US, NSF) • Global Grid lab • TeraGrid (US, NSF) • Dist. supercomp. resources (13 TFlops) • European Data Grid (EU, EC) • Data Grid technologies, EU deployment • CrossGrid (EU, EC) • Data Grid technologies, EU emphasis • DataTAG (EU, EC) • Transatlantic network, Grid applications • Japanese Grid Projects (APGrid) (Japan) • Grid deployment throughout Japan • Collaborations of application scientists & computer scientists • Infrastructure devel. & deployment • Globus based

  22. Biomedical InformaticsResearch Network (BIRN) • Evolving reference set of brains provides essential data for developing therapies for neurological disorders (multiple sclerosis, Alzheimer’s, etc.). • Today • One lab, small patient base • 4 TB collection • Tomorrow • 10s of collaborating labs • Larger population sample • 400 TB data collection: more brains, higher resolution • Multiple scale data integration and analysis

  23. GriPhyN = App. Science + CS + Grids • GriPhyN = Grid Physics Network • US-CMS High Energy Physics • US-ATLAS High Energy Physics • LIGO/LSC Gravity wave research • SDSS Sloan Digital Sky Survey • Strong partnership with computer scientists • Design and implement production-scale grids • Develop common infrastructure, tools and services • Integration into the 4 experiments • Application to other sciences via “Virtual Data Toolkit” • Multi-year project • R&D for grid architecture (funded at $11.9M +$1.6M) • Integrate Grid infrastructure into experiments through VDT

  24. U Florida U Chicago Boston U Caltech U Wisconsin, Madison USC/ISI Harvard Indiana Johns Hopkins Northwestern Stanford U Illinois at Chicago U Penn U Texas, Brownsville U Wisconsin, Milwaukee UC Berkeley UC San Diego San Diego Supercomputer Center Lawrence Berkeley Lab Argonne Fermilab Brookhaven GriPhyN Institutions

  25. GriPhyN: PetaScale Virtual Data Grids Production Team Individual Investigator Workgroups ~1 Petaop/s ~100 Petabytes Interactive User Tools Request Planning & Request Execution & Virtual Data Tools Management Tools Scheduling Tools Resource Other Grid • Resource • Security and • Other Grid Security and Management • Management • Policy • Services Policy Services Services • Services • Services Services Transforms Distributed resources(code, storage, CPUs,networks) Raw data source

  26. GriPhyN Research Agenda • Virtual Data technologies • Derived data, calculable via algorithm • Instantiated 0, 1, or many times (e.g., caches) • “Fetch value” vs. “execute algorithm” • Potentially complex (versions, cost calculation, etc) • E.g., LIGO: “Get gravitational strain for 2 minutes around 200 gamma-ray bursts over last year” • For each requested data value, need to • Locate item materialization, location, and algorithm • Determine costs of fetching vs. calculating • Plan data movements, computations to obtain results • Execute the plan

  27. Fetch item Virtual Data in Action • Data request may • Compute locally • Compute remotely • Access local data • Access remote data • Scheduling based on • Local policies • Global policies • Cost Major facilities, archives Regional facilities, caches Local facilities, caches

  28. GriPhyN Research Agenda (cont.) • Execution management • Co-allocation (CPU, storage, network transfers) • Fault tolerance, error reporting • Interaction, feedback to planning • Performance analysis (with PPDG) • Instrumentation, measurement of all components • Understand and optimize grid performance • Virtual Data Toolkit (VDT) • VDT = virtual data services + virtual data tools • One of the primary deliverables of R&D effort • Technology transfer to other scientific domains

  29. iVDGL: A Global Grid Laboratory “We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.” From NSF proposal, 2001 • International Virtual-Data Grid Laboratory • A global Grid lab (US, EU, South America, Asia, …) • A place to conduct Data Grid tests “at scale” • A mechanism to create common Grid infrastructure • A production facility for LHC experiments • An experimental laboratory for other disciplines • A focus of outreach efforts to small institutions • Funded for $13.65M by NSF

  30. Tier1 (FNAL) Proto-Tier2 Tier3 university Initial US-iVDGL Data Grid SKC BU Wisconsin PSU BNL Fermilab Hampton Indiana JHU Caltech UCSD Florida Brownsville Other sites to be added in 2002

  31. Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link iVDGL Map (2002-2003) Surfnet DataTAG Later • Brazil • Pakistan • Russia • China

  32. Programs as Community Resources:Data Derivation and Provenance • Most scientific data are not simple “measurements”; essentially all are: • Computationally corrected/reconstructed • And/or produced by numerical simulation • And thus, as data and computers become ever larger and more expensive: • Programs are significant community resources • So are the executions of those programs

  33. “I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.” “I’ve detected a calibration error in an instrument and want to know which derived data to recompute.” Data consumed-by/ generated-by created-by Transformation Derivation execution-of “I want to apply an astronomical analysis program to millions of objects. If the results already exist, I’ll save weeks of computation.” “I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.”

  34. The Chimera Virtual Data System(GriPhyN Project) • Virtual data catalog • Transformations, derivations, data • Virtual data language • Data definition + query • Applications include browsers and data analysis applications

  35. Early GriPhyN Challenge Problem:CMS Data Reconstruction 2) Launch secondary job on WI pool; input files via Globus GASS Master Condor job running at Caltech Secondary Condor job on WI pool 5) Secondary reports complete to master Caltech workstation 6) Master starts reconstruction jobs via Globus jobmanager on cluster 3) 100 Monte Carlo jobs on Wisconsin Condor pool 9) Reconstruction job reports complete to master 4) 100 data files transferred via GridFTP, ~ 1 GB each 7) GridFTP fetches data from UniTree NCSA Linux cluster NCSA UniTree - GridFTP-enabled FTP server 8) Processed objectivity database stored to UniTree Scott Koranda, Miron Livny, others

  36. Pre / Simulation Jobs / Post (UW Condor) ooDigis at NCSA ooHits at NCSA Delay due to script error Trace of a Condor-G Physics Run

  37. Knowledge-based Data Grid Roadmap(Reagan Moore, SDSC) Ingest Services Management Access Services Relationships Between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query / Browse Knowledge XTM DTD • Rules - KQL (Model-based Access) Attribute-based Query Information Repository Attributes Semantics SDLIP Information XML DTD (Data Handling System ) Data Fields Containers Folders Storage (Replicas, Persistent IDs) Grids Feature-based Query MCAT/HDF

  38. New Programs • U.K. eScience program • EU 6th Framework • U.S. Committee on Cyberinfrastructure • Japanese Grid initiative

  39. U.S. Cyberinfrastructure:Draft Recommendations • New INITIATIVE to revolutionize science and engineering research at NSF and worldwide to capitalize on new computing and communications opportunities 21st Century Cyberinfrastructure includes supercomputing, but also massive storage, networking, software, collaboration, visualization, and human resources • Current centers (NCSA, SDSC, PSC) are a key resource for the INITIATIVE • Budget estimate: incremental $650M/year (continuing) • An INITIATIVE OFFICE with a highly placed, credible leader empowered to • Initiate competitive, discipline-driven path-breaking applications within NSF of cyberinfrastructure which contribute to the shared goals of the INITIATIVE • Coordinate policy and allocations across fields and projects. Participants across NSF directorates, Federal agencies, and international e-science • Develop high quality middleware and other software that is essential and special to scientific research • Manage individual computational, storage, and networking resources at least 100x larger than individual projects or universities can provide.

  40. Summary • Technology exponentials are changing the shape of scientific investigation & knowledge • More computing, even more data, yet more networking • The Grid: Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations • Petascale Virtual Data Grids represent the future in science infrastructure • Data, programs, computers, computations as community resources

  41. For More Information • Grid Book • www.mkp.com/grids • The Globus Project™ • www.globus.org • Global Grid Forum • www.gridforum.org • TeraGrid • www.teragrid.org • EU DataGrid • www.eu-datagrid.org • GriPhyN • www.griphyn.org • iVDGL • www.ivdgl.org • Background papers • www.mcs.anl.gov/~foster

More Related