Split-C and Titanium: Global Address Space Programming

Split-C and Titanium: Global Address Space Programming Kathy Yelick CS267

Comparison of Programming Models • F77 + Heroic Compiler • Only for small scale parallelism • Data Parallel (HPF) • Good for regular applications; compiler controls performance • Message Passing SPMD (MPI) • Programmer control; no global data structures • Shared Memory with Dynamic Threads • Shared data is easy, but locality cannot be ignored • Virtual processor model adds overhead • Shared Address Space SPMD • Efficiency of single thread per processor • Address space is partitioned, but shared • Encourages shared data structures matched to the architecture

Overview • Split-C • Systems programming language based on C • Creating Parallelism: SPMD • Communication: Global pointers and spread arrays • Memory consistency model • Synchronization • Optimization opportunities • Titanium • Scientific programming language based on Java and C++ • Parallelism: SPMD • Communication: Global pointers • Memory consistency model • Synchronization • Language support for performance

Split-C: Systems Programming • Widely used parallel extension to C • Supported on most large-scale parallel machines • Tunable performance • Consistent with C

??? 6 Cache P0 P2 Split-C Overview Globally- Addressable Remote Memory Globally- Addressable Local Memory Global Address Space °Adds two new levels to the memory hierarchy - Local in the global address space - Remote in the global address space int x int x local address space Memory g_P P0 P1 P2 P3 • Model is a collection of processors & global address space • SPMD Model • Same Program on each node

SPMD Control Model PROCS threads of control ° independent ° explicit synchronization Synchronization ° global barrier ° locks PE PE PE PE barrier();

Address Value int x 0xC000 ??? int *P 0xC004 0xC000 Address Value int x 0xC000 6 int *P 0xC004 0xC000 C Pointers • (&x) read ‘pointer to x’ • Types read right to left int * read as ‘pointer to int’ • *P read as ‘value at P’ /* assign the value of 6 to x */ int x; int *P = &x; *P = 6;

gp3 PE PE PE PE Global Pointers int *global gp1; /* global ptr to an int */ typedef int *global g_ptr; gptr gp2; /* same */ typedef double foo; foo *global *global gp3; /* global ptr to a global ptr to a foo */ int *global *gp4; /* local ptr to a global ptr to an int */ A global pointer may refer to an object anywhere in the machine. Each object (C structure) lives on one processor Global pointers can be dereferenced, incremented, and indexed just like local pointers.

Address Value Address Value int x int x 0xC000 ??? 0xC000 ??? int *g_P int *g_P 0xC004 2 , 0xC000 0xC004 ??? int x 0xC000 6 int *g_P 0xC004 ??? Memory Model Processor 0 Processor 2 on_one { double *global g_P = toglobal(2,&x); *g_P = 6; }

0 A A+1 2 4 A+2 6 A+3 C Arrays • Set 4 values to 0,2,4,6 • Origin is 0 for (I = 0; I< 4; I++) { A[I] = I*2; } • Pointers & Arrays: • A[I] == *(A+I)

Spread Arrays Spread Arrays are spread over the entire machine – spreader determines which dimensions are spread – dimensions to the right define the objects on individual processors – dimensions to the left are linearized and spread in cyclic map Example: double A[n][r]::[b][b], Spread high dimensions Per processor blocks A[i][j] => A + i*r + j in units of sizeof(double)*b*b The traditional C duality between arrays and pointers is preserved through spread pointers.

A[0] A[1] A[2] A[3] 0 2 4 6 PE PE PE PE Spread Pointers double A[PROCS]::; for_my_1d (i,PROCS) { A[i] = i*2;} Global pointers, but with index arithmetic across processors(cyclic) – 1 dimensional address space, i.e. wrap and increment – Processor component varies fastest No communication:

Blocked Matrix Multiply void all_mat_mult_blk(int n, int r, int m, int b, double C[n][m]::[b][b], double A[n][r]::[b][b], double B[r][m]::[b][b]){ int i,j,k,l; double la[b][b], lb[b][b]; for_my_2D(i,j,l,n,m) { double (*lc)[b] = tolocal(C[i][j]); for (k=0;k<r;k++) { bulk_read (la, A[i][k], b*b*sizeof(double)); bulk_read (lb, B[k][j], b*b*sizeof(double)); matrix_mult(b,b,b,lc,la,lb); } } barrier(); } Configuration independent use of spread arrays Local copies of subblocks Highly optimized local routine Blocking improves performance because the number of remote accesses is reduced.

An Irregular Problem: EM3D Maxwells Equations on an Unstructured 3D Mesh Irregular Bipartite Graph of varying degree (about 20) with weighted edges v1 v2 w1 w2 H E B Basic operation is to subtract weighted sum of neighboring values for all E nodes for all H nodes D

EM3D: Uniprocessor Version typedef struct node_t { double value; int edge_count; double *coeffs; double *(*values); struct node_t *next; } node_t; void all_compute_E() { node_t *n; int i; for (n = e_nodes; n; n = n->next) { for (i = 0; i < n->edge_count; i++) n->value = n->value - *(n->values[i]) * (n->coeffs[i]); } } H E coeffs value values value How would you optimize this for a uniprocessor? – minimize cache misses by organizing list such that neighboring nodes are visited in order

v2 v3 proc M proc N EM3D: Simple Parallel Version Each processor has list of local nodes typedef struct node_t { double value; int edge_count; double *coeffs; double *global (*values); struct node_t *next; } node_t; void all_compute_e() { node_t *n; int i; for (n = e_nodes; n; n = n->next) { for (i = 0; i < n->edge_count; i++) n->value = n->value - *(n->values[i]) * (n->coeffs[i]); } barrier(); } v1 How do you optimize this? – Minimize remote edges – Balance load across processors: C(p) = a*Nodes + b*Edges + c*Remotes

v1 v2 v3 proc M proc N EM3D: Eliminate Redundant Accesses void all_compute_e() { ghost_node_t *g; node_t *n; int i; for (g = h_ghost_nodes; g; g = g->next) g->value = *(g->rval); for (n = e_nodes; n; n = n->next) { for (i = 0; i < n->edge_count; i++) n->value = n->value - *(n->values[i]) * (n->coeffs[i]); } barrier(); }

v1 v2 v3 proc M proc N EM3D: Overlap Global Reads: GET void all_compute_e() • ghost_node_t *g; node_t *n; int i; for (g = h_ghost_nodes; g; g = g->next) g->value := *(g->rval); sync(); for (n = e_nodes; n; n = n->next) { for (i = 0; i < n->edge_count; i++) n->value = n->value - *(n->values[i]) * (n->coeffs[i]); } barrier(); }

Split-C: Systems Programming • Tuning affects application performance usec per edge

Global Operations and Shared Memory int all_bcast(int val) { int left = 2*MYPROC+1; int right = 2*MYPROC+2; if (MYPROC > 0) { while (spread_lock[MYPROC] == 0) {} spread_lock[MYPROC] == 0 val = spread_buf[MYPROC]; } if ( left < PROCS) { spread_buf[left] = val; spread_lock[left] = 1; } if ( right < PROCS) { spread_buf[right] = val; spread_lock[right] = val; } return val; } Requires sequential consistency

Global Operations and Signaling Store int all_bcast(int val) { int left = 2*MYPROC+1 int right = 2*MYPROC+2; if (MYPROC > 0) { store_sync(4); val = spread_buf[MYPROC]; } if ( left < PROCS) spread_buf[left] :- val; if ( right < PROCS) spread_buf[right] :- val return val; }

Signaling Store and Global Communication void all_block_to_cyclic ( int m , double B[PROCS*m], double A[PROCS]::[m]) { double *a = &A[MYPROC]; for (i = 0; i < m; i++) { B[m*MYPROC+i] :- a[i]; } all_store_sync(); } PE PE PE PE

Split-C Summary • Performance tuning capabilities of message passing • Support for shared data structures • Installed on NOW and available on most platforms • http://www.cs.berkeley.edu/projects/split-c • Consistent with C design • arrays are simply blocks of memory • no linguistic support for data abstraction • interfaces difficult for complex data structures • explicit memory management

Administrative • Homework 3 available • Work in teams of 3 (interdisciplinary) • Three languages + physics/numerical analysis • Other?

Titanium • Build on Split-C ideas • global address space • SPMD parallelism • retain local/global distinction • Based on Java, a cleaner C++ • classes, better library support, memory management • Language is extensible through classes • domain-specific language extensions • current support for grid-based computations, particularly AMR • Optimizing compiler • eliminate explicit put/get • communication and memory optimizations • synchronization analysis

Titanium Overview • Linguistic support • Multigrid example • Compiler analyses and optimizations • Status

Grid Support in Titanium • multidimensional arrays (not in Java) • points: array indexes for multidimensional arrays • domains: set of points Point<2> p = [1, 2]; RectDomain<2> d = [[0,10],[20,30]]; double array<2> a[d];

Titanium example using domains • Gauss-Seidel red-black computation in multigrid void gsrb() { boundary (phi); for (domain<2> d = res; d != null; d = (d == red ? black : null)) { foreach (q in d) res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4 + (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)]) - 20.0*phi[q] - k*rhs[q]) * 0.05; foreach (q in d) phi[q] += res[q]; } } foreach does unordered iteration

Memory Hierarchy Optimizations • Merge successive red-black calls • Merge successive Gauss Seidel calls • Reduced memory traffic, but obfuscates code

Reordering Done for Performance • Machines and compilers reorder memory operation • network may reorder messages • remote memory is not equidistant • write buffers may be non-FIFO • superscalar instruction issue • Compilers and hardware for sequential languages maintain dependencies • One processor may observe others operations happening out of order • New compiler techniques are needed • to prevent reordering of operations that affect correctness • to insert hardware primitives such as memory fences

Care Needed to Ensure Semantics • Compiling sequential programs x = expr1 y = expr2 transformed y = expr1 x = expr2 Legal if X not in expr1 and Y not in expr2 • Compiling parallel programs Thread A data = expr flag = 1 Thread B if flag = 1 read data No dependencies, but neither pair of statements can be reordered

Why should you care? • Most compilers and hardware are either overly conservative or may reorder things that break your programs. • Programmers should synchronize using explicit (system-defined) synchronization. • In Titanium: • Synchronization built into the language, not the runtime library. • Communication and remote memory optimizations enabled. • Only optimizing compiler specifically for explicitly parallel programs.

Static Analysis for Parallel Performance • Analysis of synchronization • barrier analysis to find parallel code segments single attribute improves analyzability • extends traditional control flow analysis • Analysis of communication • reorder shared memory operations without observed effect • overlap and aggregate distributed memory communication • extends traditional dependence analysis • Traditional analyses and optimization in presence of synchronization and communication

Future Optimization Opportunities

Titanium Status • Titanium language definition complete. • Titanium compiler running. • Compiles for uniprocessors, NOW; others soon. • Application developments underway. • Visit Supercomputing ‘97 for further updates.

Split-C and Titanium: Global Address Space Programming