1 / 35

Split-C and Titanium: Global Address Space Programming

Split-C and Titanium: Global Address Space Programming. Kathy Yelick. CS267. Comparison of Programming Models. F77 + Heroic Compiler Only for small scale parallelism Data Parallel (HPF) Good for regular applications; compiler controls performance Message Passing SPMD (MPI)

mizell
Download Presentation

Split-C and Titanium: Global Address Space Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Split-C and Titanium: Global Address Space Programming Kathy Yelick CS267

  2. Comparison of Programming Models • F77 + Heroic Compiler • Only for small scale parallelism • Data Parallel (HPF) • Good for regular applications; compiler controls performance • Message Passing SPMD (MPI) • Programmer control; no global data structures • Shared Memory with Dynamic Threads • Shared data is easy, but locality cannot be ignored • Virtual processor model adds overhead • Shared Address Space SPMD • Efficiency of single thread per processor • Address space is partitioned, but shared • Encourages shared data structures matched to the architecture

  3. Overview • Split-C • Systems programming language based on C • Creating Parallelism: SPMD • Communication: Global pointers and spread arrays • Memory consistency model • Synchronization • Optimization opportunities • Titanium • Scientific programming language based on Java and C++ • Parallelism: SPMD • Communication: Global pointers • Memory consistency model • Synchronization • Language support for performance

  4. Split-C: Systems Programming • Widely used parallel extension to C • Supported on most large-scale parallel machines • Tunable performance • Consistent with C

  5. ??? 6 Cache P0 P2 Split-C Overview Globally- Addressable Remote Memory Globally- Addressable Local Memory Global Address Space °Adds two new levels to the memory hierarchy - Local in the global address space - Remote in the global address space int x int x local address space Memory g_P P0 P1 P2 P3 • Model is a collection of processors & global address space • SPMD Model • Same Program on each node

  6. SPMD Control Model PROCS threads of control ° independent ° explicit synchronization Synchronization ° global barrier ° locks PE PE PE PE barrier();

  7. Address Value int x 0xC000 ??? int *P 0xC004 0xC000 Address Value int x 0xC000 6 int *P 0xC004 0xC000 C Pointers • (&x) read ‘pointer to x’ • Types read right to left int * read as ‘pointer to int’ • *P read as ‘value at P’ /* assign the value of 6 to x */ int x; int *P = &x; *P = 6;

  8. gp3 PE PE PE PE Global Pointers int *global gp1; /* global ptr to an int */ typedef int *global g_ptr; gptr gp2; /* same */ typedef double foo; foo *global *global gp3; /* global ptr to a global ptr to a foo */ int *global *gp4; /* local ptr to a global ptr to an int */ A global pointer may refer to an object anywhere in the machine. Each object (C structure) lives on one processor Global pointers can be dereferenced, incremented, and indexed just like local pointers.

  9. Address Value Address Value int x int x 0xC000 ??? 0xC000 ??? int *g_P int *g_P 0xC004 2 , 0xC000 0xC004 ??? int x 0xC000 6 int *g_P 0xC004 ??? Memory Model Processor 0 Processor 2 on_one { double *global g_P = toglobal(2,&x); *g_P = 6; }

  10. 0 A A+1 2 4 A+2 6 A+3 C Arrays • Set 4 values to 0,2,4,6 • Origin is 0 for (I = 0; I< 4; I++) { A[I] = I*2; } • Pointers & Arrays: • A[I] == *(A+I)

  11. Spread Arrays Spread Arrays are spread over the entire machine – spreader determines which dimensions are spread – dimensions to the right define the objects on individual processors – dimensions to the left are linearized and spread in cyclic map Example: double A[n][r]::[b][b], Spread high dimensions Per processor blocks A[i][j] => A + i*r + j in units of sizeof(double)*b*b The traditional C duality between arrays and pointers is preserved through spread pointers.

  12. A[0] A[1] A[2] A[3] 0 2 4 6 PE PE PE PE Spread Pointers double A[PROCS]::; for_my_1d (i,PROCS) { A[i] = i*2;} Global pointers, but with index arithmetic across processors(cyclic) – 1 dimensional address space, i.e. wrap and increment – Processor component varies fastest No communication:

  13. Blocked Matrix Multiply void all_mat_mult_blk(int n, int r, int m, int b, double C[n][m]::[b][b], double A[n][r]::[b][b], double B[r][m]::[b][b]){ int i,j,k,l; double la[b][b], lb[b][b]; for_my_2D(i,j,l,n,m) { double (*lc)[b] = tolocal(C[i][j]); for (k=0;k<r;k++) { bulk_read (la, A[i][k], b*b*sizeof(double)); bulk_read (lb, B[k][j], b*b*sizeof(double)); matrix_mult(b,b,b,lc,la,lb); } } barrier(); } Configuration independent use of spread arrays Local copies of subblocks Highly optimized local routine Blocking improves performance because the number of remote accesses is reduced.

  14. An Irregular Problem: EM3D Maxwells Equations on an Unstructured 3D Mesh Irregular Bipartite Graph of varying degree (about 20) with weighted edges v1 v2 w1 w2 H E B Basic operation is to subtract weighted sum of neighboring values for all E nodes for all H nodes D

  15. EM3D: Uniprocessor Version typedef struct node_t { double value; int edge_count; double *coeffs; double *(*values); struct node_t *next; } node_t; void all_compute_E() { node_t *n; int i; for (n = e_nodes; n; n = n->next) { for (i = 0; i < n->edge_count; i++) n->value = n->value - *(n->values[i]) * (n->coeffs[i]); } } H E coeffs value values value How would you optimize this for a uniprocessor? – minimize cache misses by organizing list such that neighboring nodes are visited in order

  16. v2 v3 proc M proc N EM3D: Simple Parallel Version Each processor has list of local nodes typedef struct node_t { double value; int edge_count; double *coeffs; double *global (*values); struct node_t *next; } node_t; void all_compute_e() { node_t *n; int i; for (n = e_nodes; n; n = n->next) { for (i = 0; i < n->edge_count; i++) n->value = n->value - *(n->values[i]) * (n->coeffs[i]); } barrier(); } v1 How do you optimize this? – Minimize remote edges – Balance load across processors: C(p) = a*Nodes + b*Edges + c*Remotes

  17. v1 v2 v3 proc M proc N EM3D: Eliminate Redundant Accesses void all_compute_e() { ghost_node_t *g; node_t *n; int i; for (g = h_ghost_nodes; g; g = g->next) g->value = *(g->rval); for (n = e_nodes; n; n = n->next) { for (i = 0; i < n->edge_count; i++) n->value = n->value - *(n->values[i]) * (n->coeffs[i]); } barrier(); }

  18. v1 v2 v3 proc M proc N EM3D: Overlap Global Reads: GET void all_compute_e() • ghost_node_t *g; node_t *n; int i; for (g = h_ghost_nodes; g; g = g->next) g->value := *(g->rval); sync(); for (n = e_nodes; n; n = n->next) { for (i = 0; i < n->edge_count; i++) n->value = n->value - *(n->values[i]) * (n->coeffs[i]); } barrier(); }

  19. Split-C: Systems Programming • Tuning affects application performance usec per edge

  20. Global Operations and Shared Memory int all_bcast(int val) { int left = 2*MYPROC+1; int right = 2*MYPROC+2; if (MYPROC > 0) { while (spread_lock[MYPROC] == 0) {} spread_lock[MYPROC] == 0 val = spread_buf[MYPROC]; } if ( left < PROCS) { spread_buf[left] = val; spread_lock[left] = 1; } if ( right < PROCS) { spread_buf[right] = val; spread_lock[right] = val; } return val; } Requires sequential consistency

  21. Global Operations and Signaling Store int all_bcast(int val) { int left = 2*MYPROC+1 int right = 2*MYPROC+2; if (MYPROC > 0) { store_sync(4); val = spread_buf[MYPROC]; } if ( left < PROCS) spread_buf[left] :- val; if ( right < PROCS) spread_buf[right] :- val return val; }

  22. Signaling Store and Global Communication void all_block_to_cyclic ( int m , double B[PROCS*m], double A[PROCS]::[m]) { double *a = &A[MYPROC]; for (i = 0; i < m; i++) { B[m*MYPROC+i] :- a[i]; } all_store_sync(); } PE PE PE PE

  23. Split-C Summary • Performance tuning capabilities of message passing • Support for shared data structures • Installed on NOW and available on most platforms • http://www.cs.berkeley.edu/projects/split-c • Consistent with C design • arrays are simply blocks of memory • no linguistic support for data abstraction • interfaces difficult for complex data structures • explicit memory management

  24. Administrative • Homework 3 available • Work in teams of 3 (interdisciplinary) • Three languages + physics/numerical analysis • Other?

  25. Titanium • Build on Split-C ideas • global address space • SPMD parallelism • retain local/global distinction • Based on Java, a cleaner C++ • classes, better library support, memory management • Language is extensible through classes • domain-specific language extensions • current support for grid-based computations, particularly AMR • Optimizing compiler • eliminate explicit put/get • communication and memory optimizations • synchronization analysis

  26. Titanium Overview • Linguistic support • Multigrid example • Compiler analyses and optimizations • Status

  27. Grid Support in Titanium • multidimensional arrays (not in Java) • points: array indexes for multidimensional arrays • domains: set of points Point<2> p = [1, 2]; RectDomain<2> d = [[0,10],[20,30]]; double array<2> a[d];

  28. Titanium example using domains • Gauss-Seidel red-black computation in multigrid void gsrb() { boundary (phi); for (domain<2> d = res; d != null; d = (d == red ? black : null)) { foreach (q in d) res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4 + (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)]) - 20.0*phi[q] - k*rhs[q]) * 0.05; foreach (q in d) phi[q] += res[q]; } } foreach does unordered iteration

  29. Memory Hierarchy Optimizations • Merge successive red-black calls • Merge successive Gauss Seidel calls • Reduced memory traffic, but obfuscates code

  30. Reordering Done for Performance • Machines and compilers reorder memory operation • network may reorder messages • remote memory is not equidistant • write buffers may be non-FIFO • superscalar instruction issue • Compilers and hardware for sequential languages maintain dependencies • One processor may observe others operations happening out of order • New compiler techniques are needed • to prevent reordering of operations that affect correctness • to insert hardware primitives such as memory fences

  31. Care Needed to Ensure Semantics • Compiling sequential programs x = expr1 y = expr2 transformed y = expr1 x = expr2 Legal if X not in expr1 and Y not in expr2 • Compiling parallel programs Thread A data = expr flag = 1 Thread B if flag = 1 read data No dependencies, but neither pair of statements can be reordered

  32. Why should you care? • Most compilers and hardware are either overly conservative or may reorder things that break your programs. • Programmers should synchronize using explicit (system-defined) synchronization. • In Titanium: • Synchronization built into the language, not the runtime library. • Communication and remote memory optimizations enabled. • Only optimizing compiler specifically for explicitly parallel programs.

  33. Static Analysis for Parallel Performance • Analysis of synchronization • barrier analysis to find parallel code segments single attribute improves analyzability • extends traditional control flow analysis • Analysis of communication • reorder shared memory operations without observed effect • overlap and aggregate distributed memory communication • extends traditional dependence analysis • Traditional analyses and optimization in presence of synchronization and communication

  34. Future Optimization Opportunities

  35. Titanium Status • Titanium language definition complete. • Titanium compiler running. • Compiles for uniprocessors, NOW; others soon. • Application developments underway. • Visit Supercomputing ‘97 for further updates.

More Related