Parallelism and Distributed Applications

Parallelism and Distributed Applications Daniel S. Katz Director, Cyberinfrastructure and User Services, Center for Computation & Technology Associate Research Professor, Electrical and Computer Engineering Department

Context • Scientific/Engineering applications • Complex, multi-physics, multiple time scales, multiple spatial scales • Physics components • Elements such as I/O, solvers, etc. • Computer Science components • Parallelism across components • Parallelism within components, particularly physics components • Goal: efficient application execution on both parallel and distributed platforms • Goal: simple, reusable programming

Types of Systems • A lot of levels/layers to be aware of: • Individual computers • Many layers of memory hierarchy • Multi-core -> many-core CPUs • Clusters • Used to be reasonably-tightly coupled computers (1 CPU per node) or SMPs (multiple CPUs per node) • Grids elements • Individual computers • Clusters • Networks • Instruments • Data stores • Visualization systems • Etc…

Types of Applications • Applications can be broken up into pieces (components) • Size (granularity) and relationship of pieces is key • Fairly large pieces, no dependencies • Parameter sweeps, Monte Carlo analysis, etc. • Fairly large pieces, some dependencies • Multi-stage applications - PHOEBUS • Workflow applications - Montage • Data grid apps? • Large pieces, tight dependencies (coupling, components?) • Distributed viz, coupled apps - Climate • Small pieces, no dependencies • Small pieces, some dependencies • Dataflow? • Small pieces, tight dependencies • MPI apps • Hybrids?

Parallelism within programs • Initial parallelism: bitwise/vector (SIMD) • “Highly computational tasks often contain substantial amounts of concurrency. At LLL the majority of these programs use very large, two-dimensional arrays in a cyclic set of instructions. In many cases, al new array values could be computed simultaneously, rather than stepping through one position at a time. To date, vectorization has been the most effective scheme for exploiting this concurrency. However, pipelining and independent multiprocessing forms of concurrency are also available in these programs, but neither the hardware not the software exist to make it workable.” (James R. McGraw, Data Flow Computing: The VAL Language, MIT Computational Structures Group Memo 188, 1980) • Westinghouse’s Solomon introduced vector processing, early 1960s • Continued in ILLIAC IV, ~1970s • Goodyear MPP, 128x128 array of 1 bit processors , ~1980

Unhappy with your programming model?

Parallelism across programs • Co-operating Sequential Processes (CSP) - E. W. Dijkstra, The Structure of the “THE”-Multiprogramming System, 1968 • “We have given full recognition of the fact that in a single sequential process … only the time succession of the various states has a logical meaning, but not the actual speed with which the sequential process is performed. Therefore we have arranged the whole system as a society of sequential processes, progressing with undefined speed ratios. To each user program … corresponds a sequential process …” • “This enabled us to design the whole system in terms of these abstract "sequential processes". Their harmonious co-operation is regulated by means of explicit mutual synchronization statements. … The fundamental consequence of this approach … is that the harmonious co-operation of a set of such sequential processes can be established by discrete reasoning; as a further consequence the whole harmonious society of co-operating sequential processes is independent of the actual number of processors available to carry out these processes, provided the processors available can switch from process to process.”

Parallelism within programs (2) • MIMD • Taxonomy from Flynn, 1972 • Dataflow parallelism • “The data flow concept incorporates these forms of concurrency in one basic graph-oriented system. Every computation is represented by a data flow graph. The nodes … represent operations, the directed arcs represent data paths.” (McGraw, ibid) • “The ultimate goal of data flow software must be to help identify concurrency in algorithms and map as much as possible into the graphs.” (McGraw ibid) • Transputer - 1984 • programmed in occam • Uses CSP formalism, communication through named channels • MPPs - mid 1980s • Explicit message passing (CSP) • Other models: actors, Petri nets, …

+ + + PHOEBUS This matrix problem is filled and solved by PHOEBUS • The K submatrix is a sparse finite element matrix • The Z submatrices are integral equation matrices • The C submatrices are coupling matrices between the FE and IE equations 1996! - 3 Executable, 2+ programming models, executables run sequentially MPP MACHINE MESH SYSTEM OF EQUATIONS Credit: Katz, Cwik, Zuffada, Jamnejad

Cholesky Factorization • SuperMatrix work - Chan and van de Geijn, Univ. of Texas, in progress • Based on FLAME library • Aimed at NUMA systems, OpenMP programming model • Initial realization: poor performance of LAPACK (w/ multithreaded BLAS) could be fixed by choosing a different variant Credit: Ernie Chan

Chol Chol Syrk Chol Trsm Trsm Trsm Chol Trsm Gemm Syrk Trsm Syrk Syrk Gemm Syrk … … Chol Cholesky Factorization Iteration 1 Iteration 2 Iteration 3 • Can represent as DAG Credit: Ernie Chan

Cholesky SuperMatrix • Execute DAG tasks in parallel, possibly “out-of-order” • Similar in concept to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation • Superscalar -> SuperMatrix Credit: Ernie Chan

Uintah Framework • de St. German, McCorquedale, Parker, Johnson at SCI Institute, Univ. of Utah • Based on task graph model • Each algorithm define a description of computation • Required inputs and outputs • Callbacks to perform a task on a single region of space • Communication performed at graph edges • Graph created by Uintah

Master Graph (explicitly defined) Detailed Graph (implicitly defined) Uintah Tensor Product Task Graph • Each task is replicated over regions in space • Expresses data parallelism and task parallelism • Resulting detailed graph is tensor product of master graph and spatial regions • Efficient: • Detailed tasks not replicated on all processors • Scalable: • Control structure known globally • Communication structure known locally • Dependencies specified implicitly w/ simple algebra • Spatial dependencies • Computes: • Variable (name, type) • Patch subset • Requires: • Variable (name, type) • Patch subset • Halo specification • Other dependencies: AMR, others Credit: Steve Parker

Uintah - How It Works Credit: Steve Parker

Uintah - More Details • Task graphs can be complex • Can include loops, nesting, recursion • Optimal scheduling is NP-hard • “Optimal enough” scheduling isn’t too hard • Creating schedule can be expensive • But may not be done too often • Overall, good scaling and performance has been obtained with this approach Credit: Steve Parker

Applications and Grids • How to map applications to grids? • Some applications are Grid-unaware - they just want to run fast • May run on Grid-aware (Grid-enabled?) programming environments, e.g. MPICH-G2, MPIg • Other apps are Grid-aware themselves • This is where SAGA fits in, as an API to permit the apps to interact with the middleware Grid-unaware applications Grid-aware applications Grid-enabled tools/environments Simple API (SAGA) Middleware Grid resources, services, platforms Credit: Thilo Kielmann

Common Grid Applications • Data processing • Data exists on the grid, possibly replicated • Data is staged to a single set of resources • Application starts on that set of resources • Parameter sweeps • Lots of copies of a sequential/parallel job launched on independent resources, with different inputs • Controlling process start jobs and gathers outputs

More Common Grid Applications • Workflow applications • Multiple units of work, either sequential or parallel, either small or large • Data often transferred between tasks by files • Task sequence described as a graph, possibly a DAG • Abstract graph doesn’t include resource information • Concrete graph does • Some process/service converts graph from abstract to concrete • Often all at once, ahead of job start - static mapping • Perhaps more gradually (JIT?) - dynamic mapping • Pegasus from ISI is an example of this, currently static • (Note: Parameter sweeps are very simple workflows)

Montage - a Workflow App • An astronomical image mosaic servicefor the National Virtual Observatory • http://montage.ipac.caltech.edu/ • Delivers custom, science grade image mosaics • Image mosaic: combine many images so that they appear to be a single image from a single telescope or spacecraft • User specifies projection, coordinates, spatial sampling, mosaic size, image rotation • Preserve astrometry (to 0.1 pixels) & flux (to 0.1%) • Modular, portable “toolbox” design • Loosely-coupled engines • Each engine is an executable compiled from ANSI C 100 µm sky; aggregation of COBE and IRAS maps (Schlegel, Finkbeiner and Davis, 1998). Covers 360 x 180 degrees in CAR projection. Supernova remnant S147, from IPHAS: The INT/WFC Photometric H-alpha Survey of the Northern Galactic Plane David Hockney Pearblossom Highway 1986

3 3 2 2 3 2 1 1 1 Final Mosaic(Overlapping Tiles) mConcatFit mProject 1 mProject 2 mProject 3 mAdd 2 mAdd 1 mBackground 1 mBackground 2 mBackground 3 mDiff 1 2 mDiff 2 3 D23 D12 a1x + b1y + c1 = 0 a2x + b2y + c2 = 0 a3x + b3y + c3 = 0 mFitplane D12 mFitplane D23 mBgModel ax + by + c = 0 dx + ey + f = 0 ax + by + c = 0 dx + ey + f = 0 Montage Workflow

Example DAG for 10 input files Maps an abstract workflow to an executable form mProject Pegasus http://pegasus.isi.edu/ mDiff mFitPlane mConcatFit mBgModel Grid Information Systems mBackground Information about available resources, data location mAdd Data Stage-in nodes Condor DAGMan Montage compute nodes Executes the workflow Data stage-out nodes Grid MyProxy Registration nodes User’s grid credentials Montage on the Grid Using Pegasus (Planning for Execution on Grids)

Montage Performance • MPI version on a single cluster is baseline • Grid version on a single cluster has similar performance for large problems • Grid version on multiple clusters has performance dominated by data transfer between stages

Workflow Application Issues • Apps need to map processing to clusters • Depending on mapping, various data movement is needed, so the mapping either leads to networking requirements or is dependent on the available networking • Prediction (and mapping) needs some intelligence • One way to do this is through Pegasus, which currently does static mapping of an abstract workflow to a concrete workflow, but will do more dynamic mapping at some future point • Networking resources and availability could be inputs to Pegasus, or Pegasus could be used to request network resources at various times during a run.

Making Use of Grids • In general, groups of users (communities) want to run applications • Code/User/Infrastructure is aware of environment and does: • Discover resources available now (or perhaps later) • Start my application • Have access to data and storage • Monitor and possibly steer the application • Other things that could be done: • Migrate app to faster resources that are now available • Recover from hardware failure by continuing with fewer processors or by restarting from checkpoint on different resources • Use networks as needed (reserve them for these times) Credit: Thilo Kielmann and Gabrielle Allen

Less Common Grid Applications • True distributed MPI application over multiple resources/clusters • Other applications that use multiple coupled clusters • Uncommon because these jobs run poorly without sufficient network bandwidth, and there has been no good way for users to reserve bandwidth when needed

SPICE • Used for analyzing RNA translocation through protein pores • Using “standard” molecular dynamics would need millions of CPU hours • Instead, use Steered Molecular Dynamics and Jarzynski’s Equation (SMD-JE) • Uses static visualization to understand structural features • Uses interactive simulations to determine “near-optimal” parameters • Uses Haptic interaction - requires low-latency bi-directional communication between user and simulation • Uses “near-optimal” parameters and many large parallel simulations to determine “optimal” parameters • ~75 simulations on 128/256 processors • Uses “optimal” parameters to calculate full free energy profile along axis of pore • ~100 simulations on 2500 processors Credit: Shantenu Jha, et. al.

PSC NCSA TACC SDSC NEKTAR • Simulates arterial blood flow • Uses hybrid approach • 3D detailed CFD computed at bifurcations • Waveform coupling between bifurcations modeledw/ reduced set of 1D equations • 55 largest arteries in human body w/ 27 bifurcationswould require about 7 TB memory • Parallelized across and within clusters Credit: Shantenu Jha, et. al.

Cactus • Freely available, modular, portable and manageable environment for collaboratively developing parallel, high-performance multi-dimensional simulations (components-based) • Developed for numerical relativity, but now general framework for parallel computing (CFD, astro, climate, chem. eng., quantum gravity, etc.) • Finite difference, AMR, FE/FV, multipatch • Active user and developer communities, main development now at LSU and AEI • Science-driven design issues • Open source, documentation, etc. • Just over 10 years old Credit: Gabrielle Allen

Cactus Structure remote steering Plug-In “Thorns” (modules) extensibleAPIs ANSI C Fortran/C/C++ parameters driver scheduling equations of state Core “Flesh” input/output errorhandling black holes interpolation makesystem boundaryconditions gridvariables SOR solver coordinates multigrid wave evolvers Credit: Gabrielle Allen

Cactus and Grids • HTTPD thorn, allows web browser to connect to running simulation, examine state of running simulation, change parameters • Worm thorn, makes Cactus app self-migrating • Spawner thorn, any routine can be done on another resource • TaskFarm, allows distributing of apps on Grid • Run a single app using distributed MPI Credit: Gabrielle Allen, Erik Schnetter

EnLIGHTened • Network research, driven by concrete application projects, all of which critically require progress in network technologies and tools that utilize them • EnLIGHTened testbed: 10 Gbps optical networks running over NLR. Four all-photonic Calient switches are interconnected via Louisiana Optical Network Initiative (LONI), EnLIGHTened wave, and the Ultralight wave, all using GMPLS control plane technologies. • Global alliance of partners • Will develop, test, and disseminate advanced software and underlying technologies to: • Provide generic applications with the ability to be aware of their network, Grid environment and capabilities, and to make dynamic, adaptive and optimized use (monitor & abstract, request & control) of networks connecting various high end resources • Provide vertical integration from the application to the optical control plane, including extending GMPLS • Will examine how to distribute the network intelligence among the network control plane, management plane, and the Grid middleware

EnLIGHTened Team • Savera Tanwir • Harry Perros • Mladen Vouk • Yufeng Xin • Steve Thorpe • Bonnie Hurst • Joel Dunn • Gigi Karmous-Edwards • Mark Johnson • John Moore • Carla Hunt • Lina Battestilli • Andrew Mabe • Ed Seidel • Gabrielle Allen • Seung Jong Park • Jon MacLaren • Andrei Hutanu • Lonnie Leger • Dan Katz • Olivier Jerphagnon • John Bowers • Steven Hunter • Rick Schlichting • John Strand • Matti Hiltunen • Javad Boroumand • Russ Gyurek • Wayne Clark • Kevin McGrattan • Peter Tompsu • Yang Xia • Xun Su • Dan Reed • Alan Blatecky • Chris Heermann

San Diego EnLIGHTened Testbed To Canada To Asia To Europe SEA POR BOI CAVE wave EnLIGHTened wave (Cisco/NLR) PIT OGD DEN CHI KAN CLE SVL WDC Cisco/UltraLight wave LONI wave TUL VCL @NCSU DAL • International • Partners • Phosphorus - EC • G-lambda - Japan • GLIF • Members: • MCNC GCNS • LSU CCT • NCSU • RENCI • Official Partners: • AT&T Research • SURA • NRL • Cisco Systems • Calient Networks • IBM HOU • NSF Project Partners • OptIPuter • UltraLight • DRAGON • Cheetah

HARC: Highly Available Robust Co-allocator • Extensible, open-sourced co-allocation system • Can already reserve: • Time on supercomputers (advance reservation), and • Dedicated paths on GMPLS-based networks with simple topologies • Uses Paxos Commit to atomically reserve multiple resources, while providing a highly-available service • Used to coordinate bookings across EnLIGHTened and G-lambda testbeds in largest demonstration of its kind to date (more later) • Used for setting up the network for Thomas Sterling’s HPC Class which goes out live in HD (more later) Credit: Jon MacLaren

Request Network bandwidth and Computers Request Network bandwidth and Computers CRM CRM CRM CRM CRM CRM Reservation From xx:xx to yy:yy Reservation From xx:xx to yy:yy EL NRM KDDI NRM NTT NRM CRM CRM CRM CRM CRM CRM CRM CRM Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Application (Visualization) Application (MPI) US JAPAN

Data grid applications • Remote visualization • Data is somewhere, needs to flow quickly and smoothly to a visualization app • Data could be simulation results, or measured data

iGrid 2005 demo Visualization at LSU Interaction among San Diego, LSU, Brno Data on remote LONI machines Distributed Viz/Collaboration

Video for visualization • But also for videoconference between the three sites • 1080i (1920x1080, 60fps interlaced): • 1.5 Gbps / unidirectional stream, 4.5 Gbps each site (two incoming, one outgoing streams) • Jumbo frames (9000 bytes), Layer 2 lossless (more or less) dedicated network • Hardware capture: • DVS Centaurus (HD-SDI) + DVI -> HD-SDI converterfrom Doremi Credit: Andrei Hutanu

Hardware setup – one site Credit: Andrei Hutanu

Video distribution • Done in software (multicast not up to speed, optical multicast complicated to set up). Can do 1:4 distribution with high-end Opteron workstations. • HD class 1-to-n • Only one stream is distributed - the one showing the presenter (Thomas Sterling) - others are just to LSU

Data analysis • Future scenario motivated by increases in network speed • Possibilities of simulations to store results locally are limited • Downsampling the output, not storing all data • Use remote (distributed, possibly virtual) storage • Can store all data • This will enable new types of data analysis Credit: Andrei Hutanu

Components • Storage • high-speed distributed file systems or virtual RAM disks • potential use cases: global checkpointing facility; data analysis using the data from this storage • distribution could be determined by the analysis routines • Data access • Various data selection routines gather data from the distributed storage elements (storage supports app-specific operations) Credit: Andrei Hutanu

More Components • Data transport • Components of the storage are connected by various networks. May need to use different transport protocols • Analysis (visualization or numerical analysis) • Initially single-machine but can also be distributed • Data source • computed in advance and preloaded on the distributed storage initially • or live streaming from the distributed simulation Credit: Andrei Hutanu

Conclusions • Applications exist where infrastructure exists that enables them • Very few applications (and application authors) can afford to get ahead of the infrastructure • We can run the same (grid-unaware) applications on more resources • Perhaps add features such as fault tolerance • Use SAGA to help here?

SAGA • Intent: SAGA is to grid apps what MPI is to parallel apps • Questions/metrics: • Does SAGA enable rapid development of new apps? • Does it allow complex apps with less code? • Is it used in libraries? • Roots: Reality Grid (ReG Steering Library), GridLab (GAT), and others came together at GGF • Strawman API: • Uses SIDL (from Babel, CCA) • Language independent spec. • OO base design - can adapt to procedural languages • Status: • Started between GGF 11 & GGF 12 (July/Aug 2004) • Draft API submitted to OGF early Oct. 2006 • Currently, responding to comments…

More Conclusions • Infrastructure is getting better • Middleware developers are working on some of the right problems • If we want to keep doing the same things better • And add some new things (grid-aware apps) • Web 3.1 is coming soon… • We’re not driving the distributed computing world • Have to keep trying new things

Parallelism and Distributed Applications