Scenarios for Grid Applications

Scenarios for Grid Applications Ed Seidel Max Planck Institute for Gravitational Physics eseidel@aei.mpg.de

Outline PHYSICS TOMORROW • Remote visualization, monitoring and steering • Distributed computing • Grid portals • More dynamic scenarios

Why Grids? (1) eScience • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour • 1,000 physicists worldwide pool resources for peta-op analyses of petabytes of data • Civil engineers collaborate to design, execute, & analyze shake table experiments • Climate scientists visualize, annotate, & analyze terabyte simulation datasets • An emergency response team couples real time data, weather model, population data Source: Ian Foster

Why Grids? (2) eBusiness • Engineers at a multinational company collaborate on the design of a new product • A multidisciplinary analysis in aerospace couples code and data in four companies • An insurance company mines data from partner hospitals for fraud detection • An application service provider offloads excess load to a compute cycle provider • An enterprise configures internal & external resources to support eBusiness workload Source: Ian Foster

Current Grid App Types • Community Driven • Serving the needs of distributed communities • Video Conferencing • Virtual Collaborative Environments • Code sharing to “experiencing each other” at a distance… • Data Driven • Remote access of huge data, data mining • Weather Information systems • Particle Physics • Process/Simulation Driven • Demanding Simulations of Science and Engineering • Get less attention in the Grid World, yet drive HPC! • Remote, steered, etc… • Present Examples: • “simple but very difficult” • Future Examples: • Dynamic • Interacting combinations of all types

From Telephone Conference Calls to Access Grid Intern’l Video Meetings New Cyber Arts Humans Interacting with Virtual Realities Internet Linked Pianos Creating a Virtual Global Research Lab Access Grid Lead-Argonne NSF STARTAP Lead-UIC’s Elec. Vis. Lab Source: Smarr

NSF’s EarthScope -- USArrayExplosions of Data! Typical! • High Resolution of Crust & Upper Mantle Structure • Transportable Array • Broadband Array • 400 Broadband Seismometers • ~70 Km Spacing • ~1500 X 1500 Km Grid • ~2 Year Deployments at Each Site • Rolling Deployment Over More Than 10 Years • Permanent Reference Network • Geodetic Quality GPS Receivers • All Data to Community in Near Real Time • Bandwidth Will Be Driven by Visual Analysis in Repositories • Realtime simulations work as data come in Source: Smarr/Frank Vernon (IGPP SIO, UCSD)

EarthScope Rollout Rollout Over 14 Years Starting With Existing Broadband Stations Source: Smarr

Common Infrastructure Wide range of scientific disciplines will require a common infrastructure: Common Needs Driven by the Science/Engineering • Large Number of Sensors / Instruments • Data to community in real time! • Daily Generation of Large Data Sets • Growth in Computing power from TB ---> PB machines • Experimental data • Data is on Multiple Length and Time Scales • Automatic Archiving in Distributed Repositories • Large Community of End Users • Multi-Megapixel and Immersive Visualization • Collaborative Analysis From Multiple Sites • Complex Simulations Needed to Interpret Data Source: Smarr

Common Infrastructure Wide range of scientific disciplines will require a common infrastructure Some will need Optical Networks • Communications  Dedicated Lambdas • Data  Large Peer-to-Peer Lambda Attached Storage Source: Smarr

Huge Black Hole Collision Simulation3000 frames of volume rendering TB of data

Issues for Complex Simulations • Huge amounts of data needed/generated across different machines • How to retrieve, track, manage data across Grid • In this case, had to fly Berlin-NCSA, bring data back on disks! • Many components developed by distributed collaborations • How to bring communities together? • How to find/load/execute different components? • Many computational resources available • How to find best ones to start? • How to distribute work effectively? • Needs of computations change with time! • How to adapt to changes? How to monitor system? • How to interact with Experiments? Coming!

Future View: Much is Here • Scale of computations much larger: Need Grids to keep up… • Much larger processes, task farm many more processes (e.g. Monte Carlo) • Also: how to simply make better use of current resources • Complexity approaching that of Nature • Simulations of the Universe and its constituents • Black holes, neutron stars, supernovae, Human genome, human behavior • Teams of computational scientists working together: the future of science and engineering • Must support efficient, high level problem description • Must support collaborative computational science: llok at Grand Challenges • Must support all different languages • Ubiquitous Grid Computing • Resources, applications, etc replaced by abstract notion: Grid Services • E.g., OGSA • Very dynamic simulations, even deciding their own future • Apps may find the services themselves: distributed, spawned, etc... • Must be tolerant of dynamic infrastructure • Monitored, viz’ed, controlled from anywhere, with colleagues elsewhere

Future View Much already here: • Scale of computations much larger: Need Grids to keep up… • Much larger processes, task farm many more processes (e.g. Monte Carlo) • Also: how to simply make better use of current resources • Complexity approaching that of Nature • Simulations of the Universe and its constituents • Black holes, neutron stars, supernovae • Human genome, human behavior • Climate modeling, environmental effects

Future View • Teams of computational scientists working together: the future of science and engineering • Must support efficient, high level problem description • Must support collaborative computational science: Grand Challenges • Must support all different languages • Ubiquitous Grid Computing • Resources, applications, etc replaced by abstract notion: Grid Services E.g., OGSA • Very dynamic simulations, even deciding their own future • Apps may find services themselves: distributed, spawned, etc... • Must be tolerant of dynamic infrastructure • Monitor, viz, control from anywhere, with colleagues elsewhere

Grid Applications So Far • SC93 - SC2000 • Typical scenario • Find remote resource (often using multiple computers) • Launch job (usually static, tightly coupled) • Visualize results (usually in-line, fixed) • Need to go far beyond this • Make it much, much easier • Portals, Globus, standards • Make it much more dynamic, adaptive, fault tolerant • Migrate this technology to general user Metacomputing Einsteins Equations:Connecting T3E’s in Berlin, Garching, SDSC

The Vision, Part I: “ASC” Cactus Computational Toolkit Science, Autopilot, AMR, Petsc, HDF, MPI, GrACE, Globus, Remote Steering 1. User has Science idea... 2. Composes/Builds Code Components w/Interface... 3. Selects Appropriate Resources... 4. Steers simulation, monitors performance... 5. Collaboratorslog in to monitor... Want to integrate and migrate this technology to the generic user…

Start simple:sit here, compute there Accounts for one user (real case): • berte.zib.de • denali.mcs.anl.gov • golden.sdsc.edu • gseaborg.nersc.gov • harpo.wustl.edu • horizon.npaci.edu • loslobos.alliance.unm.edu • mcurie.nersc.gov • modi4.ncsa.uiuc.edu • ntsc1.ncsa.uiuc.edu • origin.aei-potsdam.mpg.de • pc.rzg.mpg.de • pitcairn.mcs.anl.gov • quad.mcs.anl.gov • rr.alliance.unm.edu • sr8000.lrz-muenchen.de 16 machines, 6 different usernames, 16 passwords, ... • Portals provide: • Single access to all resources • Locate/build executables • Central/collaborative parameter files • Job submission/tracking • Access to new Grid Technologies • Use any Web Browser !! This is hard, but it gets much worse from here…

Start simple:sit here, compute there

Distributed Computation:Harnessing Multiple Computers • Why would anyone want to do this? • Capacity: computers can’t keep up with needs • Throughput • Issues • Bandwidth (increasing faster than computation) • Latency • Communication needs, Topology • Communication/computation • Techniques to be developed • Overlapping communication/computation • Extra ghost zones to reduce latency • Compression • Algorithms to do this for scientist

Remote Visualization & Steering Any Viz Client: LCA Vision, OpenDX HTTP • Changing any steerable parameter • Parameters • Physics, algorithms • Performance Remote Viz data Streaming HDF5 Autodownsample Remote Viz data Amira

SC02 (Baltimore) Visualization Remote Data Storage Lapse for r<1, every other point Network Monitoring Service grr,psi on timestep 2 TB Data File Analysis of Simulation Data AEI (Germany) Remote File Access HDF5 VFD/ GridFTP: Clients use file URL (downsampling,hyperslabbing) More Bandwidth Available NCSA (USA)

Vision, Part II: Dynamic Distributed Computing Make apps able to respond to changing Grid environment... • Many new ideas • Consider: the Grid IS your computer: • Networks, machines, devices come and go • Dynamic codes, aware of their environment, seek out resources • Distributed and Grid-based thread parallelism • Begin to change the way you think about problems: think global, solve much bigger problems • Many old ideas • 1960’s all over again • How to deal with dynamic processes • processor management • memory hierarchies, etc

New Paradigms for Dynamic Grids • Code/User/Infrastructure should be aware of environment • What Grid Sevices are available?? • Discover resources available NOW, and their current state? • What is my allocation? • What is the bandwidth/latency between sites? • Code/User/Infrastructure should be able to make decisions • A slow part of my simulation can run asynchronously…spawn it off! • New, more powerful resources just became available…migrate there! • Machine went down…reconfigure and recover! • Need more memory (or less!)…get it by adding (dropping) machines!

New Paradigms for Dynamic Grids • Code/User/Infrastructure should be able to publish to central server for tracking, monitoring, steering… • Unexpected event…notify users! • Collaborators from around the world all connect, examine simulation. • Rethink your Algorithms: Task farming, Vectors, Pipelines, etc all apply on Grids… The Grid IS your Computer!

New Grid Applications: examples • Intelligent Parameter Surveys, Monte Carlos • May control other simulations! • Dynamic Staging: move to faster/cheaper/bigger machine (“Grid Worm”) • Need more memory? Need less? • Multiple Universe: create clone to investigate steered parameter (“Gird Virus”) • Automatic Component Loading • Needs of process change, discover/load/execute new calc. component on approp.machine • Automatic Convergence Testing • from initial data or initiated during simulation • Look Ahead • spawn off and run coarser resolution to predict likely future • Spawn Independent/Asynchronous Tasks • send to cheaper machine, main simulation carries on • Routine Profiling • best machine/queue, choose resolution parameters based on queue • Dynamic Load Balancing: inhomogeneous loads, multiple grids

Issues Raised by Grid Scenarios • Infrastructure: • Is it ubiquitous? Is it reliable? Does it work? • Security: • How does user or process authenticate as moves from site to site? • Firewalls? Ports? • How does user/application get information about Grid? • Need reliable, ubiquitous Grid information services • Portal, Cell phone, PDA • What is a file? Where does it live? • Crazy Grid apps will leave pieces of files all over the world • Tracking • How does user track the Grid simulation hierarchies?

What can be done now? • Some Current Examples, work Now: Building blocks for the future • Dynamic, Adaptive Distributed Computing • Increase scaling from 15 - 70% • Migration: Cactus Worm • Spawning • Task Farm/Steering Combination • If these can be done now, think what you could do tomorrow • We are developing tools to enable such scenarios for any applications

GigE:100MB/sec 17 4 12 5 SDSC IBM SP 1024 procs 5x12x17 =1020 Dynamic Adaptive Distributed Computation(T.Dramlitsch, with Argonne/U.Chicago) 2 2 OC-12 line (But only 2.5MB/sec) 12 5 NCSA Origin Array 256+128+128 5x12x(4+2+2) =480 Dynamic Adaptation: Number of ghostzones, compression, … These experiments: • Einstein Equations (but could be any Cactus application) Achieved: • First runs: 15% scaling • With new techniques: 70-85% scaling, ~ 250GF Won “Gordon Bell Prize” (Supercomputing 2001, Denver)

Compress on! 3 ghosts 2 ghosts Dynamic Adaption • Automatically Load Balance at t=0 • Automatically adapt to bandwidth latency issues • Application has NO KNOWLEDGE of machines(s) it is on, networks, etc • Adaptive techniques make NO assumptions about network • Issues: if network conditions change faster than adaption… Adapt

Cactus Worm: Basic Scenario • Cactus simulation (could be anything) starts, launched from a portal • (Queries a Grid Information Server, finds available resources) • Migrates itself to next site, according to some criterion • Registers new location to GIS, terminates old simulation • User tracks/steers, using http, streaming data, etc...… • Continues around Europe… • If we can do this, much of what we want can be done!

Determining When to Migrate: Contract Monitor • GrADS project activity: Foster, Angulo, Cactus • Establish a “Contract” • Driven by user-controllable parameters • Time quantum for “time per iteration” • % degradation in time per iteration (relative to prior average) before noting violation • Number of violations before migration • Potential causes of violation • Competing load on CPU • Computation requires more processing power: e.g., mesh refinement, new sub-computation • Hardware problems • Going too fast! Using too little memory? Why waste a resource??

Running At UC 3 successive contract violations Resource discovery & migration Running At UIUC Load applied (migration time not to scale) Migration Experiments: Contract Based (Foster, Angulo, GrADS, Cactus Team…)

Spawning: SC2001 Demo • Black hole collision simulation • Every n timesteps, time consuming analysis tasks done • Process output data, find gravitational waves, horizons • Take much time • Processes do not run well in parallel • Solution: User invokes “Spawner” • Analysis tasks outsourced • Globus enabled resource • Resource Discovery (only mocked up last year) • login, data transfer • Remote jobs started up • Main simulation can keep going without pausing • Except to spawn: may be time consuming itself • It worked!

Spawning on ARG Testbed Main Cactus BH Simulation starts here All analysis tasks spawned automatically to free resources worldwide User only has to invoke “Spawner” thorn…

Task Farming/Steering Combo • Need to run large simulation, with dozens of input parameters • Selection is trail and error (resolution, boundary, etc) • Remember look ahead scenario? Run at lower resolution predict likely outcome! • Task farm dozens of smaller jobs across grid to set initial parameters for big run • Task farm manager sends out jobs across resources, collects results • Lowest error parameter set chosen • Main simulation starts with “best” parameters • If possible, low resolution jobs with different parameters can be cloned off at various intervals, run for awhile, and results returned to steer coordinate parameters for next phase of evolution

SC2002 Demo Main Cactus BH Simulation starts here Dozens of low resolution jobs sent out to test parameters Data returned for main job Huge job generates remote data to be visualized in Baltimore

S Brill Wave S1 S2 P1 P2 S2 S1 P2 P1 Future Dynamic Grid Computing We see something, but too weak. Please simulate to enhance signal! Physicist has new idea !

RZG SDSC LRZ Calculate/Output Invariants Further Calculations Found a black hole, Load new component Look for horizon AEI NCSA Future Dynamic Grid Computing Add more resources Queue time over, find new machine Free CPUs!! SDSC Clone job with steered parameter Archive data Calculate/Output Grav. Waves Archive to LIGO experiment Find best resources

Grid ApplicationToolkit (GAT) • Application developer should be able to build simulations with tools that easily enable dynamic grid capabilities • Want to build programming API to easily allow: • Query information server (e.g. GIIS) • What’s available for me? What software? How many processors? • Network Monitoring • Decision Routines • How to decide? Cost? Reliability? Size? • Spawning Routines • Now start this up over here, and that up over there • Authentication Server • Issues commands, moves files on your behalf • Data Transfer • Use whatever method is desired (Gsi-ftp, Streamed HDF5, scp…) • Etc…Apps themselves find, become, services

Grid Related Projects • GridLab www.gridlab.org • Enabling these scenarios • ASC: Astrophysics Simulation Collaboratory www.ascportal.org • NSF Funded (WashU, Rutgers, Argonne, U. Chicago, NCSA) • Collaboratory tools, Cactus Portal • Global Grid Forum (GGF) www.gridforum.org • Applications Working Group • GrADs: Grid Application Development Software www.isi.edu/grads • NSF Funded (Rice, NCSA, U. Illinois, UCSD, U. Chicago, U. Indiana...) • TIKSL/GriKSL www.griksl.org • German DFN funded: AEI, ZIB, Garching • Remote online and offline visualization, remote steering/monitoring • Cactus Team www.CactusCode.org • Dynamic distributed computing …

Summary • Grid computing has many promises • Making today’s computing easier by managing resources better • Creating an environment for advanced apps of tomorrow • In order to do this, your applications must be portable! • Rethink your problems for the Grid Computer • Distributed computing, grid threads, etc • Other forms of parallelism: Task farming, spawning, migration • Data access, data mining etc: all much more powerful • Next, we’ll show you • Demos of what you can do now • How to prepare for the future

Scenarios for Grid Applications