Making progress with multi-tier programming

Making progress with multi-tier programming Scott B. Baden Daniel Shalit Department of Computer Science and Engineering University of California, San Diego

Introducing Multi-tier Computers • Hierarchical construction • Two kinds of communication: slow messages, fast shared memory • SMP clusters (numerous vendors) • NPACI Blue Horizon • ASCI Blue-Pacific CTR (LLNL)

High Opportunity Cost of Communication • Interconnect speeds are not keeping pace • r: DGEMM floating point rate per node, MFLOP/s • ß: peak pt - pt MPI message BW, MBYTE/s • IBM SP2/Power2SC: r = 640 ß = 100 • NPACI Blue Horizon: r = 5,500ß = 100 • ASCI Blue-Pacific CTR: r = 750 ß = 80

What programming models are available for multi-tier computers? • Single Tier • Flatten the hierarchical communication structure of the machine: one level or “tier” of parallelism • Simplest approach; MPI codes are reusable • Disadvantages: poor use of shared memory, unable to overlap communication with computation • Multi-tier • Utilize information about the hierarchical communication structure of the machine • Hybrid model: message passing + threads/openMP • More complicated, but possible to overcome disadvantages of single tier model

Road Map • A hierarchical model of parallelism: multi-tier prototype of KeLP, KeLP2 • How to improve processor utilization when non-blocking, asynchronous, point-to-point message passing fails to realize communication overlap • What are the opportunities and the limitations? • Guidelines for employing overlap • Studies on ASCI Blue-Pacific CTR • Progress on NPACI Blue Horizon

What is KeLP ? • KeLP = Kernel Lattice Parallelism • Thesis topic of Stephen J. Fink (Ph.D. 1998) • A set of run time C++ class libraries for parallel computation • Reduce application development time without sacrificing performance • Run-time decomposition and communication objects • Structured blocked N-dimensional data • http://www-cse.ucsd.edu/groups/hpcl/scg/kelp

Multi-tier programming • For an n-level machine, we identify n levels of parallelism + one collective level of control • KeLP2 programs have 3 levels of control: • Collective level: operations performed on all nodes • Nodelevel: operations performed on one node • Processorlevel: operations performed on one processor

More about the model • Communication reflects the organization of the hardware • Two kinds of communication: slow messages, fast shared memory • A node communicates on behalf of its processors • Direct inter-processor communication only on-node • Hierarchical parallel control flow • Node level communication may run as a concurrent task • Processors execute computation out of shared memory

KeLP’s central abstractions • MetaData • Region • FloorPlan • Distributed storage container • XArray • Parallel control flow • Iterators

The Region • Region: box in multidimensional index space • A geometric calculus to manipulate the regions • Similar to BoxTools (Colella et al.), doesn’t support general domains as in Titanium (UCB)

Aggregate abstractions • FloorPlan: a table of regionsand their assignment to processors • XArray: a distributed collection of multidimensional arrays instantiated over a FloorPlan

Data Motion Model • Unit of transfer is a regular section • Build a persistent communication object, the KeLP Mover, in terms of regular section copies • Satisfy dependencies by executing the Mover; may be executed asynchronously to realize overlap • Replace point-to-point message passing with geometric descriptions of data dependences

Road Map • A hierarchical model of parallelism: multi-tier prototype of KeLP, KeLP2 • How to improve processor utilization when non-blocking, asynchronous, point-to-point message passing fails to realize communication overlap • What are the opportunities and the limitations? • Studies on ASCI Blue-Pacific CTR • Progress on NPACI Blue Horizon • Guidelines for employing overlap

Single-tier formulation of an iterative method • Finite difference solver for Poisson’s eqn • Decompose the data BLOCKwise • Execute one process per processor Transmit halo regionsbetween processes Computeinner region after communication completes

Hierarchical multi-tier reformulation • One process per node, p threads per process Transmit halo regionsbetween nodes with MPI Computeinner region in shared memory using threads

A communication-computation imbalance Only a single thread communicates on each node Load imbalance due to a serial section If we have enough computation, we can shift work to improve hardware utilization

Overlapping Communication with Computation Reformulate the algorithm Isolate the inner region from the halo Execute communication concurrently with computation on the inner region Compute on the annulus when thehalofinishes Give up one processor to handle communication It may not be practical to have that processor also communicate

A Performance Model of overlap Give up one processor to communication p = number of processors per node running time = 1.0 f < 1 = multi-tier, non-overlapped communication time Running time = MAX ( (1-f)x(p/p-1) , f ) p/(p-1) slowdown factor Useful range: f > 1/p When f > 0.8, improvement  20% Equal time in communication and computation when f = p/(2p-1)

Performance • When we displace computation to make way for the proxy, computation time increases • Wait on communication drops to zero, ideally • When f < p/(2p-1): improvement is(p-1)/(p*(1-f))% • Communication bound: improvement is (1-f)/f % f 1 - f Dilation f T = (1-f)x(p/(p-1)) T = 1.0

Results: ASCI Blue Pacific CTR • Multiple SMP nodes • 320 4-way 332 MHz Power PC 604e compute nodes • 1.5 GB memory per node; 32 KB L1$, 256KB L2 per proc • Differential MPI communication rates (peak Ring) • 82 MB/sec off-node, 77 MB/sec on node • 81% parallel efficiency on 1 node w/ 4 threads

Variants • KeLP2 • Overlapped, using inner annulus • Non-overlapped: communication runs serially, no annulus • Single tier: 1 thread per MPI process, 4 processes per node • MPI: hand coded, 4 processes / node

Environment settings • KeLP2, overlapped MP_CSS_INTERRUPT=no MP_POLLING_INTERVAL=2000000000 MP_SINGLE_THREAD=yes AIXTHREAD_SCOPE=S • KeLP2, non-overlapped MP_CSS_INTERRUPT=yes AIXTHREAD_SCOPE=S • MPI and KeLP2 single-tier • #PSUB -g @tpn4

Software • KeLP layered on top of MPI and pthreads Express parallelism in C++, computation in f77 • Compilers: mpCC_r, mpxlf_r • Compiler flags -O3 -qstrict -qtune=auto -qarch=auto • OS: AIX 4.3.3

Performance improves with overlap

Comparison with the model • Consider compute bound cases: f < p/(2p-1) • On 8 nodes, N=128, f=0.51 • We predict 53%, observe 33% • Underestimated slowdown of computation • N=160: f=0.41. Predict 21%, observe 16% • N=256: f=0.28. Slight slowdown • For P=4, N=320, 64 nodes: f=0.52 • Predict 35%, observe 14% • Investigating cause of increased slowdown

Iteration, avg: 3.514000 std dev: 0.006992 FillPatch: avg: 0.814000 std dev: 0.009661 Local Time: avg: 2.727000 std dev: 0.009487 Mover Time: avg: 0.792000 std dev: 0.010328 avsd 2 Iteration, avg: 2.256000 std dev: 0.006992 FillPatch: avg: 0.814000 std dev: 0.005164 Local Time: avg: 1.433000 std dev: 0.004830 Mover Time: avg: 0.780000 std dev: 0.004714 avsd 3 Iteration, avg: 1.889000 std dev: 0.011972 FillPatch: avg: 0.825000 std dev: 0.008498 Local Time: avg: 1.053000 std dev: 0.006749 Mover Time: avg: 0.789000 std dev: 0.009944 avsd 4 Iteration, avg: 1.672000 std dev: 0.028597 FillPatch: avg: 0.851000 std dev: 0.028067 Local Time: avg: 0.798000 std dev: 0.011353 Mover Time: avg: 0.811000 std dev: 0.028067 A closer look at performance • 8 nodes, N=128, without overlap • With overlap

NPACI Blue Horizon • Consider N=800 on 8 nodes (64 proc) • Nodes ~8 times faster than CTR’s ,but inter-node communication is twice as slow • Non-overlapped execution: f=0.34 • Predict 25% improvement with overlap • Communication proxy interferes with computation threads • We can hide the communication at the cost of slowing down computation enough to offset any advantages • Currently under investigation

A closer look at performance w/o overlap, with overlap By comparison, single tier K2 runs in 41 sec

Inside the KeLP Mover • Mover encapsulates installation and architecture-specific optimizations, without affecting correctness of the code • Packetization to improve performance of ATM • Mover run as a proxy on SMP based nodes

Related Work • Fortran P [Sawdey et al., 1997] • SIMPLE [Bader & JáJá 1997] • Multiprotocol Active Messages [Lummeta, Mainwaring and Culler, 1997] • Message Proxies [Lim, Snir, et al., 1998]

Conclusions and Future Work • The KeLP Mover separates correctness concerns from policy decisions that affect performance • Accommodates generational changes in hardware, for software that will live for many generations • Multi-tier programming can realize performance improvements via overlap • Overlap will become more attractive in the future due to increased multiprocessing on the node • Future work • Hierarchical algorithm design • Large grain dataflow

Acknowledgements and Further Information • Sponsorship • NSF (CARM, ACR), NPACI, DoE (ISCR) • State of California, Sun Microsystems (Cal Micro) • NCSA • Official NPACI release KeLP1.3 • NPACI Blue Horizon, Origin 2000, Sun HPC, clusters • Workstations: Solaris, Linux, etc. • http://www.cse.ucsd.edu/groups/hpcl/scg/kelp • Thanks to John May, Bronis de Supinksi (CASC/LLNL), Bill Tuel (IBM)

Making progress with multi-tier programming