1 / 41

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Global Data Grids for 21 st Century Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu. GriPhyN/iVDGL Outreach Workshop University of Texas, Brownsville March 1, 2002. What is a Grid?.

stesha
Download Presentation

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Global Data Grids for21st Century Science Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu GriPhyN/iVDGL Outreach WorkshopUniversity of Texas, BrownsvilleMarch 1, 2002 Paul Avery

  2. What is a Grid? • Grid: Geographically distributed computing resources configured for coordinated use • Physical resources & networks provide raw capability • “Middleware” software ties it together Paul Avery

  3. What Are Grids Good For? • Climate modeling • Climate scientists visualize, annotate, & analyze Terabytes of simulation data • Biology • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour • High energy physics • 3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data • Engineering • Civil engineers collaborate to design, execute, & analyze shake table experiments • A multidisciplinary analysis in aerospace couples code and data in four companies From Ian Foster Paul Avery

  4. What Are Grids Good For? • Application Service Providers • A home user invokes architectural design functions at an application service provider… • …which purchases computing cycles from cycle providers • Commercial • Scientists at a multinational toy company design a new product • Cities, communities • An emergency response team couples real time data, weather model, population data • A community group pools members’ PCs to analyze alternative designs for a local road • Health • Hospitals and international agencies collaborate on stemming a major disease outbreak From Ian Foster Paul Avery

  5. Proto-Grid: SETI@home • Community: SETI researchers + enthusiasts • Arecibo radio data sent to users (250KB data chunks) • Over 2M PCs used Paul Avery

  6. More Advanced Proto-Grid:Evaluation of AIDS Drugs • Community • Research group (Scripps) • 1000s of PC owners • Vendor (Entropia) • Common goal • Drug design • Advance AIDS research Paul Avery

  7. Why Grids? • Resources for complex problems are distributed • Advanced scientific instruments (accelerators, telescopes, …) • Storage and computing • Groups of people • Communities require access to common services • Scientific collaborations (physics, astronomy, biology, eng. …) • Government agencies • Health care organizations, large corporations, … • Goal is to build “Virtual Organizations” • Make all community resources available to any VO member • Leverage strengths at different institutions • Add people & resources dynamically Paul Avery

  8. Grids: Why Now? • Moore’s law improvements in computing • Highly functional endsystems • Burgeoning wired and wireless Internet connections • Universal connectivity • Changing modes of working and problem solving • Teamwork, computation • Network exponentials • (Next slide) Paul Avery

  9. Network Exponentials & Collaboration • Network vs. computer performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Difference = order of magnitude per 5 years • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010? • Computers: x 60 • Networks: x 4000 Scientific American (Jan-2001) Paul Avery

  10. Grid Challenges • Overall goal: Coordinated sharing of resources • Technical problems to overcome • Authentication, authorization, policy, auditing • Resource discovery, access, allocation, control • Failure detection & recovery • Resource brokering • Additional issue: lack of central control & knowledge • Preservation of local site autonomy • Policy discovery and negotiation important Paul Avery

  11. Application Application Internet Protocol Architecture Transport Internet Link Layered Grid Architecture(Analogy to Internet Architecture) Specialized services:App. specific distributed services User Managing multiple resources:ubiquitous infrastructure services Collective Sharing single resources:negotiating access, controlling use Resource Talking to things:communications, security Connectivity Controlling things locally:Accessing, controlling resources Fabric From Ian Foster Paul Avery

  12. Globus Project and Toolkit • Globus Project™ (Argonne + USC/ISI) • O(40) researchers & developers • Identify and define core protocols and services • Globus Toolkit™ 2.0 • A major product of the Globus Project • Reference implementation of core protocols & services • Growing open source developer community • Globus Toolkit used by all Data Grid projects today • US: GriPhyN, PPDG, TeraGrid, iVDGL • EU: EU-DataGrid and national projects • Recent announcement of applying “web services” to Grids • Keeps Grids in the commercial mainstream • GT 3.0 Paul Avery

  13. Globus General Approach Applications • Define Grid protocols & APIs • Protocol-mediated access to remote resources • Integrate and extend existing standards • Develop reference implementation • Open source Globus Toolkit • Client & server SDKs, services, tools, etc. • Grid-enable wide variety of tools • Globus Toolkit • FTP, SSH, Condor, SRB, MPI, … • Learn about real world problems • Deployment • Testing • Applications Diverse global services Core services Diverse resources Paul Avery

  14. Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by IT • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Geographically distributed collaboration • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handleadditional dimension of data access & movement Paul Avery

  15. Data Intensive Physical Sciences • High energy & nuclear physics • Including new experiments at CERN’s Large Hadron Collider • Gravity wave searches • LIGO, GEO, VIRGO • Astronomy: Digital sky surveys • Sloan Digital sky Survey, VISTA, other Gigapixel arrays • “Virtual” Observatories (multi-wavelength astronomy) • Time-dependent 3-D systems (simulation & data) • Earth Observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Pollutant dispersal scenarios Paul Avery

  16. Data Intensive Biology and Medicine • Medical data • X-Ray, mammography data, etc. (many petabytes) • Digitizing patient records (ditto) • X-ray crystallography • Bright X-Ray sources, e.g. Argonne Advanced Photon Source • Molecular genomics and related disciplines • Human Genome, other genome databases • Proteomics (protein structure, activities, …) • Protein interactions, drug delivery • Brain scans (3-D, time dependent) • Virtual Population Laboratory (proposed) • Database of populations, geography, transportation corridors • Simulate likely spread of disease outbreaks Craig Venter keynote @SC2001 Paul Avery

  17. Example: High Energy Physics “Compact” Muon Solenoid at the LHC (CERN) Smithsonianstandard man Paul Avery

  18. LHC Computing Challenges • Complexity of LHC interaction environment & resulting data • Scale: Petabytes of data per year (100 PB by ~2010-12) • GLobal distribution of people and resources 1800 Physicists 150 Institutes 32 Countries Paul Avery

  19. Tier 0 (CERN) 3 3 3 3 T2 T2 3 T2 Tier 1 3 3 T2 T2 3 3 3 3 3 3 4 4 4 4 Global LHC Data Grid Tier0 CERNTier1 National LabTier2 Regional Center (University, etc.)Tier3 University workgroupTier4 Workstation • Key ideas: • Hierarchical structure • Tier2 centers Paul Avery

  20. Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS Global LHC Data Grid CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 Experiment ~PBytes/sec Online System ~100 MBytes/sec Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size CERN Computer Center > 20 TIPS Tier 0 +1 HPSS 2.5 Gbits/sec France Center Italy Center UK Center USA Center Tier 1 2.5 Gbits/sec Tier 2 ~622 Mbits/sec Tier 3 Institute ~0.25TIPS Institute Institute Institute 100 - 1000 Mbits/sec Physics data cache Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Tier 4 Workstations,other portals Paul Avery

  21. Sloan Digital Sky Survey Data Grid Paul Avery

  22. INet2 Abilene LSC LSC LSC LSC LSC Tier2 LIGO (Gravity Wave) Data Grid MIT LivingstonObservatory HanfordObservatory OC48 OC3 OC3 OC12 Caltech Tier1 OC48 Paul Avery

  23. Data Grid Projects • Particle Physics Data Grid (US, DOE) • Data Grid applications for HENP expts. • GriPhyN (US, NSF) • Petascale Virtual-Data Grids • iVDGL (US, NSF) • Global Grid lab • TeraGrid (US, NSF) • Dist. supercomp. resources (13 TFlops) • European Data Grid (EU, EC) • Data Grid technologies, EU deployment • CrossGrid (EU, EC) • Data Grid technologies, EU • DataTAG (EU, EC) • Transatlantic network, Grid applications • Japanese Grid Project (APGrid?) (Japan) • Grid deployment throughout Japan • Collaborations of application scientists & computer scientists • Infrastructure devel. & deployment • Globus based Paul Avery

  24. Coordination of U.S. Grid Projects • Three U.S. projects • PPDG: HENP experiments, short term tools, deployment • GriPhyN: Data Grid research, Virtual Data, VDT deliverable • iVDGL: Global Grid laboratory • Coordination of PPDG, GriPhyN, iVDGL • Common experiments + personnel, management integration • iVDGL as “joint” PPDG + GriPhyN laboratory • Joint meetings (Jan. 2002, April 2002, Sept. 2002) • Joint architecture creation (GriPhyN, PPDG) • Adoption of VDT as common core Grid infrastructure • Common Outreach effort (GriPhyN + iVDGL) • New TeraGrid project (Aug. 2001) • 13MFlops across 4 sites, 40 Gb/s networking • Goal: integrate into iVDGL, adopt VDT, common Outreach Paul Avery

  25. Worldwide Grid Coordination • Two major clusters of projects • “US based” GriPhyN Virtual Data Toolkit (VDT) • “EU based” Different packaging of similar components Paul Avery

  26. GriPhyN = App. Science + CS + Grids • GriPhyN = Grid Physics Network • US-CMS High Energy Physics • US-ATLAS High Energy Physics • LIGO/LSC Gravity wave research • SDSS Sloan Digital Sky Survey • Strong partnership with computer scientists • Design and implement production-scale grids • Develop common infrastructure, tools and services (Globus based) • Integration into the 4 experiments • Broad application to other sciences via “Virtual Data Toolkit” • Strong outreach program • Multi-year project • R&D for grid architecture (funded at $11.9M +$1.6M) • Integrate Grid infrastructure into experiments through VDT Paul Avery

  27. U Florida U Chicago Boston U Caltech U Wisconsin, Madison USC/ISI Harvard Indiana Johns Hopkins Northwestern Stanford U Illinois at Chicago U Penn U Texas, Brownsville U Wisconsin, Milwaukee UC Berkeley UC San Diego San Diego Supercomputer Center Lawrence Berkeley Lab Argonne Fermilab Brookhaven GriPhyN Institutions Paul Avery

  28. GriPhyN: PetaScale Virtual-Data Grids Production Team Individual Investigator Workgroups ~1 Petaflop ~100 Petabytes Interactive User Tools Request Planning & Request Execution & Virtual Data Tools Management Tools Scheduling Tools Resource Other Grid • Resource • Security and • Other Grid Security and Management • Management • Policy • Services Policy Services Services • Services • Services Services Transforms Distributed resources(code, storage, CPUs,networks) Raw data source Paul Avery

  29. GriPhyN Research Agenda • Virtual Data technologies (fig.) • Derived data, calculable via algorithm • Instantiated 0, 1, or many times (e.g., caches) • “Fetch value” vs “execute algorithm” • Very complex (versions, consistency, cost calculation, etc) • LIGO example • “Get gravitational strain for 2 minutes around each of 200 gamma-ray bursts over the last year” • For each requested data value, need to • Locate item location and algorithm • Determine costs of fetching vs calculating • Plan data movements & computations required to obtain results • Execute the plan Paul Avery

  30. Fetch item Virtual Data in Action • Data request may • Compute locally • Compute remotely • Access local data • Access remote data • Scheduling based on • Local policies • Global policies • Cost Major facilities, archives Regional facilities, caches Local facilities, caches Paul Avery

  31. GriPhyN Research Agenda (cont.) • Execution management • Co-allocation of resources (CPU, storage, network transfers) • Fault tolerance, error reporting • Interaction, feedback to planning • Performance analysis (with PPDG) • Instrumentation and measurement of all grid components • Understand and optimize grid performance • Virtual Data Toolkit (VDT) • VDT = virtual data services + virtual data tools • One of the primary deliverables of R&D effort • Technology transfer mechanism to other scientific domains Paul Avery

  32. MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Kangaroo GSI, CAS Globus GRAM GridFTP; GRAM; SRM GriPhyN/PPDG Data Grid Architecture Application = initial solution is operational DAG Catalog Services Monitoring Planner Info Services DAG Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource Paul Avery

  33. Transparency wrt materialization Derived Metadata Catalog App-specific-attr id … App specific attr . id … … … i2,i10 i2,i10 id id … … … … Update upon materialization Derived Data Catalog Id Trans F Id Trans Param Param Name … Name … GCMS i1 F X F.X … i1 F X F.X … i2 F Y F.Y … i2 F Y F.Y … i10 G Y P i10 G Y P G(P).Y … G(P).Y … Trans. name Trans. name Transformation Catalog Trans Trans Prog Prog Cost … Cost … F URL:f 10 … F URL:f 10 … G URL:g 20 … G URL:g 20 … URLs for program location URLs for program location Program storage Program storage Catalog Architecture Transparency wrt location Metadata Catalog Metadata Catalog Name Name LObjN LObjN … … … X logO1 … … Y logO2 … … F.X F.X logO3 … … logO3 … … G(1).Y logO4 … … Object Name Object Name GCMS GCMS Logical Container Name Replica Catalog Replica Catalog LCN LCN PFNs PFNs … … logC1 URL1 logC1 URL1 logC2 URL2 URL3 logC2 URL2 URL3 logC3 URL4 logC3 URL4 logC4 URL5 URL6 logC4 URL5 URL6 URLs for physical file location Physical file storage Paul Avery

  34. iVDGL: A Global Grid Laboratory • International Virtual-Data Grid Laboratory • A global Grid laboratory (US, EU, South America, Asia, …) • A place to conduct Data Grid tests “at scale” • A mechanism to create common Grid infrastructure • A facility to perform production exercises for LHC experiments • A laboratory for other disciplines to perform Data Grid tests • A focus of outreach efforts to small institutions • Funded for $13.65M by NSF “We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.” From NSF proposal, 2001 Paul Avery

  35. iVDGL Components • Computing resources • Tier1, Tier2, Tier3 sites • Networks • USA (TeraGrid, Internet2, ESNET), Europe (Géant, …) • Transatlantic (DataTAG), Transpacific, AMPATH, … • Grid Operations Center (GOC) • Indiana (2 people) • Joint work with TeraGrid on GOC development • Computer Science support teams • Support, test, upgrade GriPhyN Virtual Data Toolkit • Outreach effort • Integrated with GriPhyN • Coordination, interoperability Paul Avery

  36. Current iVDGL Participants • Initial experiments (funded by NSF proposal) • CMS, ATLAS, LIGO, SDSS, NVO • U.S. Universities and laboratories • (Next slide) • Partners • TeraGrid • EU DataGrid + EU national projects • Japan (AIST, TITECH) • Australia • Complementary EU project: DataTAG • 2.5 Gb/s transatlantic network Paul Avery

  37. U.S. iVDGL Proposal Participants • U Florida CMS • Caltech CMS, LIGO • UC San Diego CMS, CS • Indiana U ATLAS, GOC • Boston U ATLAS • U Wisconsin, Milwaukee LIGO • Penn State LIGO • Johns Hopkins SDSS, NVO • U Chicago/Argonne CS • U Southern California CS • U Wisconsin, Madison CS • Salish Kootenai Outreach, LIGO • Hampton U Outreach, ATLAS • U Texas, Brownsville Outreach, LIGO • Fermilab CMS, SDSS, NVO • Brookhaven ATLAS • Argonne Lab ATLAS, CS T2 / Software CS support T3 / Outreach T1 / Labs(funded elsewhere) Paul Avery

  38. Tier1 (FNAL) Proto-Tier2 Tier3 university Initial US-iVDGL Data Grid SKC BU Wisconsin PSU BNL Fermilab Hampton Indiana JHU Caltech UCSD Florida Brownsville Other sites to be added in 2002 Paul Avery

  39. Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link iVDGL Map (2002-2003) Surfnet DataTAG • Later • Brazil • Pakistan • Russia • China Paul Avery

  40. Summary • Data Grids will qualitatively and quantitatively change the nature of collaborations and approaches to computing • The iVDGL will provide vast experience for new collaborations • Many challenges during the coming transition • New grid projects will provide rich experience and lessons • Difficult to predict situation even 3-5 years ahead Paul Avery

  41. Grid References • Grid Book • www.mkp.com/grids • Globus • www.globus.org • Global Grid Forum • www.gridforum.org • TeraGrid • www.teragrid.org • EU DataGrid • www.eu-datagrid.org • PPDG • www.ppdg.net • GriPhyN • www.griphyn.org • iVDGL • www.ivdgl.org Paul Avery

More Related