Next Generation Parallel Programming Languages & Libraries

Next GenerationParallel Programming Languages & Libraries (Co-Array Fortran, Global Arrays, UPC, Titanium) Eun-Gyu Kim Research Group Meeting 8/12/2004 University of Illinois at Urbana-Champaign

network Background : Message Passing Model • A set of cooperating sequential processes • Each with own local address space • Processes interact with explicit transaction (send, receive,…) • Advantage • Programmer controls data and work distribution • Disadvantage • Communication overhead for small transactions • Hard to program! • Example : MPI Address space Process

network Background : Data Parallel Model • One thread (process) of execution • Different data items are manipulated in the same way by that thread • Conditional statements to exclude (or include) parts of data in an operation • Parallelism is implicit (compiler) • Advantage • Easy to write and comprehend • No synchronization • Disadvantage • No independent branching • Example: • HPF(High Performance Fortran) process Different data / address space

Background : Shared Memory Model • Different simultaneous execution threads (processes) • Read / Write to one shared memory space and invalidate if necessary. • Advantage • Read remote memory via an expression • Write remote memory through assignment • Disadvantage • Manipulating shared data leads to synchronization requirements • Does not allow locality exploitation • Example : OpenMP (usually) Thread 2 Thread 3 Thread 1 Shared address space (i.e. Shared variable x)

Distributed Shared Memory Model • Similar to the shared memory paradigm • Memory Mi has affinity to Thread i. • At the same time each thread has global view of memory. • Advantage: • Helps exploiting locality of references • simple statements as SM • Disadvantage: • Synchronization still necessary • Example: UPC, Titanium, • Co-Array, Global Arrays Thread 2 Thread 3 Thread 1 M1 M2 M3 Partitioned shared address space (with each partition having affinity to corresponding thread)

Historical Timeline Fortran 95 Co-Array Fortran Developed by Rober Numrich in Minnesota Supercomputing Institute and NASA. Added parallel extension to Fortran 95. Global Arrays toolkit From Pacific Northwest National Laboratory. Library based interface for C, C++, Fortran, and Python. ARMCI (one-sided communication library) version started in 1998. UPC (Unified parallel C) Consortium of government, academia, and HPC vendors coordinated by GMU, IDA, NSA. MPI, OpenMP, HPF Titanium Led by professor Yelick from U of California, Berkeley. 2004 1994 1998 1999

Co-Array FortranRobert NumrichMinnesota Supercomputing Institute /Goddard Space Flight Center

Co-Array Execution Model • Execution Model • Number of images(threads) is fixed( num_images(), this_image() ) • Each image executes the same program independently of the others. • An “object” has the same name in each image. • Each image works on its own local data. • An image moves remote data to local data only through explicit co-array syntax. • Designed for Cray T3E, and Cray X1

Co-Array real::x(n)[p,*] co-array (data) co-dimension (images)

Co-Array Declaration & Memory real :: x(n) real :: x(n)[*] x(1) x(2) x(3) . . . x(n) x(1) x(2) x(3) . . . x(n) x(1) x(2) x(3) . . . x(n) x(1) x(2) x(3) . . . x(n) image 1 image 2 image 1 image 2 image 0 image 0 * Replicate an array x of length n to each image. * Local array x of length n

Examples of Co-Array Declarations real :: a(n)[*] - replicate array a of length n to all images. integer :: z[p,*] - organize logical two-dimensional grid p x (num_images()/p). - replicate scalar z to each image character :: b(n,m)[p,q,*] - organize logical three-dimensional grid p x q x (num_images()/(p x q)). - replicate two-dimensional array b of size n x m to each image. real, allocatable :: c(:)[:] - define allocatable pointer c. type(field) :: user_defined[*] - replicate user defined structure to all images. integer :: local_x - define local variable local_x

Co-Array Communication y(:) = x(:)[p,q] - copies array x from image (p,q) to local array y x(index(:)) = y[index(:)] - gather from all images in index structure value of y, and put into local array x do i=2, num_images() x(:) = x(:) + x(:)[i] end do - reduction on array x p[:] = x - broadcast value x to all images. *absent co-dimension defaults to local object

Co-Array Synchronization • sync_all() • Barrier involving everyone. • sync_team(list(:)) • Barrier involving ones that are in list.

Irregular Structure in Co-Array type some_type real, pointer, dimension(:) :: p end type Pointer 1 Pointer 2 • type(some_type) z[np] • allocate(z%ptr(some_value)) • Although co-array requires replication throughout allimages, using pointers canexpress irregular data structures by local allocation. Image 1 Image 2

Why Co-Array? • Syntax easier to express than MPI, yet captures low level details. • Can explicitly control communication down to the level of specifying image id. • Can explicitly control synchronization • Assignment is easier than long library call. • One-sided communication is possible. • Irregular structures are now supported.

Why NOT Co-Array? • Poor support • Only available in Cray machines, SGI, and HP-Alpha. • May be available soon in Bluegene/L at LLNL(Lawrence Livermore) • Still have to do dirty work of managing which data is on which image.

Global ArraysPacific Northwest National Laboratory

Global Arrays • Toolkit to create/manage global array data structure. • Accessing the global array structure, physically distributed arrays can be accessed through shared memory style programming • Preserves data locality exploitation • Library based, not language extension • Interfaces to Fortran, C, C++, Python • Each process has a global view of the array and can read/write to it. • Eg. Access A(5,4) rather than a(2) on task 3 • Accessing data does not require task id.

Global Arrays Execution Model shared global object shared global object get put Compute / Update local memory local memory local memory

Global Arrays Creation • NGA_Create(int type, int ndim, int dims[], char *array_name, int chunk[]) • For regular arrays • int NGA_Create_irreg(int type, int ndim, int dims[], char *array_name, int map[], int block[]) • For irregular arrays

Global Arrays Put / Get • NGA_Put(int g_a, int lo[], int hi[], void *buf, int ld[]) • NGA_Get(int g_a, int lo[], int hi[], void *buf, int ld[]) • lo[] and hi[] are starting and ending points • ld[] is stride information in local array • *buf is local array that has same number of dimension • g_a is the global array object • GA_Fence() and GA_Sync() as synchronization methods.

Affinity Hints for Global Arrays • What data does a processor own? • NGA_Distribution(g_a, iproc, lo, hi) • Where is the data? • NGA_Access(g_a, lo, hi, ptr, ld) * Use this information to organize calculation so that maximum use is made of locally held data.

Features in Global Arrays • Rich library calls • accumulate, scatter, gather, … • External Linear Algebra libraries interfaces • ScaLAPACK, Peigs-parallel eigensolvers • Interoperable with MPI • Nonblocking Communication • Ghost cell library calls • Mirrored Arrays for SMP nodes

Why Global Arrays? • Global Layout of data structure • Don’t have to worry about which task to put/get. • Data partition done with ease. • Flexible enough to capture message passing model. • Shared Memory Style programming • Development has been active involving many scientific applications and linear algebra problems. • Rich high-level function calls • Available on many major platforms

Why NOT Global Arrays? • It’s library. • Optimization opportunity is less than language extension model. • Language extension model has better syntax. (i.e. assignment and expression reference)

UPC(Unified Parallel C)Consortium of government, academia, and vendors(GWU, Berkeley)

UPC Programming Model • Global Address Space with affinity. • Private • Shared • One thread per processor working independently (SPMD) • Syntax with assignments and expression. • Extension to ANSI C. • Portability and Easy-of-Use main goal.

UPC Memory Model Thread 0 Thread 1 Thread 2 Thread 3 Shared Private • Each thread has shared and private address space. • Shared address space can be accessed by all threads, where as private is only accessible with the thread of its affinity. • Communication occurs when accessing remote shared data. • Private memory access : FAST • Shared memory access : SLOWER (but relatively fast) • Dynamic allocation available both shared and private.

UPC Declaration Assume THREADS = 3 shared int x; // shared scalar shared int y[THREADS]; // shared array int z; // local scalar threads 1 threads 2 threads 0 y[0] y[1] y[2] x z z z *unless block size is provided, cyclic distribution is chosen.

UPC Declaration Assume THREADS = 4 shared int A[4][THREADS] Thread 1 Thread 2 Thread 3 Thread 0 A[0][0] A[0][1] A[0][2] A[0][3] A[1][0] A[1][1] A[1][2] A[1][3] A[2][0] A[2][1] A[2][2] A[2][3] A[3][0] A[3][1] A[3][2] A[3][3]

UPC Declaration with blocking shared [3] int A[4][THREADS] Thread 1 Thread 2 Thread 3 Thread 0 A[0][0] A[0][3] A[1][2] A[2][1] A[0][1] A[1][0] A[1][3] A[2][2] A[0][2] A[1][1] A[2][0] A[2][3] A[3][0] A[3][3] A[3][1] A[3][2]

UPC Pointers int *p1; // private pointer pointing locally. (access private data) shared int *p2; // private pointer pointing into the shared space ( access by one thread to shared space ) int *shared p3; // shared pointer pointing locally ( not recommended ) shared int *shared p4; // shared pointer pointing into the shared space ( accessed by all to shared space)

UPC Work Sharing (syntactic sugar) shared int a[100],b[100],c[101]; int i; upc_forall(i=0;i<100;i++;&a[i]) // thread that owns &a[i] a[i] = b[i] * c[i+1]; upc_forall(i=0;i<100;i++;i) // round-robin a[i] = b[i] * c[i+1]; upc_forall(i=0;i<100;i++;(i*THREADS)/100) // by chunks of 25 a[i] = b[i] * c[i+1];

Matrix Multiplication #include <upc_relaxed.h> #define N 4 #define P 4 #define M 4 shared [N*P/THREADS] int a[N][P] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}, c[N][M]; shared [M/THREADS] int b[P][M] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1}; void main(void) { int i,j,k; upc_forall (i=0;i<N;i++;&c[i][0]) { for (j=0;j<M;j++) { c[i][j] = 0; for (k=0;k<P;k++) c[i][j] += a[i][k] * b[k][j]; } } }

UPC Synchronization • upc_barrier() • upc_notify() • upc_wait() • upc_lock()

UPC Optimization • Use local pointer instead of shared pointer when dealing with local shared data, through casting and assignment. • Pointer arithmetic is faster. • Aggregate remote accesses by blocking instead of one-by-one in loop. • shared[] int a[1000], b[1000];typedef struct {int array[1000]} st_copy;st_copy *block_a, *block_b;block_a = (st_copy *) a; block_b = (st_copy *) b;*block_a = *block_b; • Use split-phase barrier and overlap remote accesses.

Why UPC? • Global Address Space • Easy to program • Communications done through expression and assignment. • Global view of data make program more readable. • Portability from sequential C code. • Through use of upc_forall() • Shared memory programming style • Good programmer involvement of optimization

Why NOT UPC? • Parallelism is fixed. • No nested-parallelism. • One-dimensional partition at best!

TitaniumProfessor Katherine YelickUniversity of California, Berkeley

Titanium Design • Based on Java • Classes, automatic memory management • Compiled to C, then naïve binary (no JVM) • Same parallelism model as UPC • Dynamic threads not supported • SPMD with global address space • Designed specifically with Adaptive Mesh Refinement and PDE Computations.

SPMD Execution Model • Same as UPC. • Simple Hello class HelloWorld { public static void main (String [] argv) { int single gv = 1000; int lv = Ti.thisProc(); System.out.println (“Hello from proc “ + lv + “ with “ + gv); } }

Titanium “single” keyword • A “single” method is one called by all processes. • public single static void allStep(…) • A “single” variable has same value on all processes. • int single timestep = 0; • Don’t have to use it, but helpful to compilers.

Titanium : Memory Model Thread 0 Thread 1 Thread 2 Thread 3 object heap Shared program stack Private • global pointer, local pointer (rather than variables) • global pointer may point to remote locations. • global references are more expensive • Dereferencing time (check to see if local)

Titanium : Global Address Space other processes • Processes allocate locally • References can be passedto other processes. • Work with pointers rather than shared variables. class C { int val; … } C gv; // global pointer C local lv; // local pointer if (Ti.thisProc() == 0) lv = new C(); gv = broadcast lv from 0; process 0 gv local gv local lv lv gv gv lv lv gv gv lv lv

Titanium : Data Distribution class Boxed { public Boxed(int j) { val = j; } public int val; } Object [1d] single allData; allData = new Object [0:Ti.numProcs()-1]; allData.exchange(new Boxed(Ti.thisProc()); allData allData allData val: 0 val: 1 val: 2

Titanium : Unordered Iteration • foreach (p in r) {…A[p]…} • This is not parallel construct • p is a Point • r is a Domain • Memory optimization is facilitated • helps loop-dependency analysis • simplifies bounds checking • avoids indexing details

Titanium : Domains and Unordered Iterations Point <2> lb = [1,1]; Point <2> ub = [10,20]; RectDomain <2> r = [lb:ub]; // rectangular domain double [2d] a = new double [r]; double [2d] b = new double [1:10,1:20]; double [2d] c = new double [lb:ub:[1,1]]; //10x20 with [1,1] stride for (int i=1; i<=10;i++) for (int j=1; j<=20; j++) c[i,j] = a[i,j] + b[i,j]; can be expressed as : foreach(p in c.domain()) {c[p] = a[p] + b[p]; }

Titanium : Other Features • Immutable Classes : pass-by-value classes • Templates • Operator Overloading • Region-Based Garbage Collection

Why Titanium? • Clean object oriented design • Many constructs and data types have been defined to help compiler optimizations • domains (obtaining subarrays without copying) • immutable classes • unordered iterations • Multidimensional and irregular data structures can be expressed.

Why NOT Titanium? • Might not be straight-forward for simple loop-based programs. • Still limited parallelism. (static) • It is not real JAVA. (translated to C).

Next Generation Parallel Programming Languages & Libraries