UPC Workshop George Washington University May 6-7, 2003

U P C collective functions UPC Workshop George Washington University May 6-7, 2003

The V1.0 UPC Collectives Spec • First draft by Wiebel and Greenberg, March 2002. • Spec discussed at May, 2002, and SC’02 UPC workshops. • Many helpful comments from Dan Bonachea and Brian Wibecan. • pre4V1.0, dated April 2, is now on the table.

Collective functions • Initialization • upc_all_init • 5.3 “Relocalization” collectives change data affinity. These are byte-oriented operations. • upc_all_broadcast • upc_all_scatter • upc_all_gather • upc_all_gather_all • upc_all_exchange • upc_all_permute • 5.4 “Computational” collectives for reduction and sorting. These operations respect data type and blocksize. • upc_all_reduce • upc_all_prefix_reduce • upc_all_sort

Remaining collectives spec issues(large and small) • Wording used to specify the affinity of certain arguments • {signed} option for types supported by reduce and prefix reduce operations • What requirements are made of the phase of function arguments? • Associativity of reduce and prefix reduce operations • Commutativity of reduce and prefix reduce operations • Can nbytes be 0 in 5.3 functions? • What are the synchronization semantics?

Wording used to specify the affinity of certain arguments • Resolved: The target of the src/dst pointer must have affinity to thread 0. • This applies to distributed arrays, such as the targets of a broadcast and scatter, and the source of a gather.

{signed} option for types supported by reduceand prefix reduceoperations • “signed char” and “char” are separate and incompatible types. • Resolved: Remove the brackets around all signed keywords for all the types. Arguments of type “char” are treated in an implementation-dependent manner. • Resolved: Remove references to “ASCII values” since these equivalents are already specified by ANSIC.

What requirements are made of the phase of function arguments? • Resolved: Remove the “common” statement regarding phase. • Resolved: To the 5.3 functions add: “The src and dst arguments are treated as if they have zero phase.” • Resolved: To the 5.4 functions add: “The phase field for the X argument is respected when referencing array elements.”

Associativity and commutative reduceand prefix reduceoperations • All provided reduction operators are assumed to be associative and commutative. All reduction operators (except those provided using the UPC_NONCOMM_FUNC) are assumed to be commutative. • The operation op is always assumed to be associative. All predefined operations are also assumed to be commutative. Users may define operations that are assumed to be associative, but not commutative. The “canonical” evaluation order of a reduction is in the order of array indices. However, the implementation may take advantage of associativity, or associativity and commutativity in order to change the order of evaluation. This may change the result of the reduction for operations that are not strictly associative and commutative, such as floating point addition. • Advice to implementors. • It is strongly recommended that the function be implemented so that the same result be obtained whenever the function is applied on the same arguments, appearing in the same order. Note that this may prevent optimizations that take advantage of the physical location of processors.

Alternative Synchronization semantics 1a) The collective function may begin to read or write data when any thread enters the collective function. 1b) The collective function may begin to read or write data with affinity to a thread when that thread enters the collective function. 1c) The collective function may begin to read or write data when all threads have entered the collective function. 2a) The collective function may exit before the operation is complete. The operation is guaranteed to be complete at the beginning of the next synchronization phase. 2b) The collective function may return in a thread when all reads and writes with affinity to the thread are complete. 2c) The operation is complete when any thread exits the collective function. 3) Each collective function implements any pair (1x,2y) of synchronization requirements based on the argument UPC_SYNC_SEM.

Synch semantic naming ideas UPC_BEGIN_ON_{ANY, MINE, ALL}_ COMPLETE_{LATER, MINE, ALL}

Can nbytes be 0 in 5.3 functions? • Resolved: Yes. Use the variable name numbytes to distinguish it from nbytes in the allocation functions. Add a statement that if numbytes is 0 then the function is a no-op.

1. Synchronization phase “Arguments to each call to a collective function must be ready at the beginning of the synchronization phase in which the call is made. Results of each call to a collective function are not ready until the beginning of the next synchronization phase.” • This is a policy that can be relaxed as implementations demonstrate that fewer constraints lead to better performance. • This is an easy-to-remember semantic.

2. Bill’s strict semantic On input, no data will be accessed until all threads enter the collective function. On exit, all output will be written before any thread exits the collective function.

3. Affinity-based semantics Source data with affinity to a thread must be ready when that thread calls the collective function. Destination data with affinity to a thread will be ready when that thread returns from the collective function. Version A: Provide two versions of each collective. Provide distinct function names: “strict”: guarantee Bill’s strict semantics; “relaxed”: affinity-based semantics Version B: Only the “relaxed” affinity-based version is provided; the user provides explicit barriers to guarantee safety.

4. “Split-phase” semantics Split-phase collectives. How can the split-phase concept be extended to describe the synchronization semantics of the collective functions?

What are the synchronization semantics? Resolution A: Provide two versions of each collective. By distinct function names: “strict”: guaranteed entry and exit barriers; “relaxed”: affinity-based semantics applies Resolution B: Only the “relaxed” affinity-based version is provided; the user provides explicit barriers to guarantee safety.

shared local dst dst dst src src src th0 th1 th2 void upc_all_broadcast(dst, src, blk); Thread 0 sends the same block of data to each thread. shared [blk] char dst[blk*THREADS]; shared [] char src[blk]; } blk

shared local dst dst dst src src src th0 th1 th2 void upc_all_scatter(dst, src, blk); Thread 0 sends a unique block of data to each thread. shared [blk] char dst[blk*THREADS]; shared [] char src[blk*THREADS];

shared local dst dst dst src src src th0 th1 th2 void upc_all_gather(dst, src, blk); Each thread sends a block of data to thread 0. shared [] char dst[blk*THREADS]; shared [blk] char src[blk*THREADS];

shared local dst dst dst src src src th0 th1 th2 void upc_all_gather_all(dst, src, blk); Each thread sends one block of data to all threads.

shared local dst dst dst src src src th0 th1 th2 void upc_all_exchange(dst, src, blk); Each thread sends a unique block of data to each thread.

shared local perm perm perm src dst dst dst src src 1 2 0 th0 th1 th2 void upc_all_permute(dst, src, perm, blk); Thread i sends a block of data to thread perm(i).

n Thread 0 receives UPC_OP src[i]. i=0 1 2 4 8 16 32 64 128 256 512 1024 2048 src src src i i i th0 th1 th2 int upc_all_reduceI(src, UPC_ADD, n, blk, NULL); int i; shared [3] int src[4*THREADS]; 0 3 6 1 64 4 16 256 8 128 2 32 shared 1024 512 2048 S S 448 56 S 3591 9 4095 local i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); i=upc_all_reduceI(src,UPC_ADD,12,3,NULL);

k Thread k receives UPC_OP src[i]. i=0 0 0 3 3 6 6 1 1 2 2 4 4 8 8 16 16 32 32 64 64 128 128 256 256 shared local src src src dst dst dst th0 th1 th2 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared [*] int src[3*THREADS], dst[3*THREADS]; 3 7 63 63 32 3 511 7 1 2 4 8 16 127 64 128 256 15 127 1 15 31 255 31 255

shared local dst dst dst src src src th0 th1 th2 A “pull” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); } 0 1 2

shared local i i i dst dst dst src src src th0 th1 th2 A “push” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk ); } 0 1 2 2 0 1

k Thread k receives UPC_OP src[i]. i=0 0 0 1 1 2 2 1 8 64 2 16 128 4 32 256 shared local src src src dst dst dst th0 th1 th2 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared int src[3*THREADS], dst[3*THREADS]; 1 2 16 128 4 256 8 32 64 1 15 3 3 63 15 127 127 255 511 7 31 31 255 63 7

Extensions • Strided copying • Vectors of offsets for src and dst arrays • Variable-sized blocks • Reblocking (cf: preceding example of prefix reduce) shared int src[3*THREADS]; shared [3] int dst[3*THREADS]; upc_forall(i=0; i<3*THREADS; i++; ?) dst[i] = src[i];

More sophisticated synchronization semantics • Consider the “pull” implementation of broadcast. There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other. Each thread does a pairwise synchronization with thread 0. Thread i will not have to wait if it reaches its synchronization point after thread 0. Thread 0 returns from the call after it has sync’d with each thread.

What requirements are made of the phase of function arguments? • Resolved: Remove the “common” statement regarding phase. • Resolved: To the 5.3 functions add: “The src and dst arguments are treated as if they have zero phase.” • Resolved: To the 5.4 functions add: “The phase field for the X argument is respected when referencing array elements.” • Suitably define “respected”. • Note that “respecting” the phase requires over 20 integer operations to compute the address of an arbitrary array element given: • a shared void * array address of arbitrary phase • an element index (offset) • the blocksize and element size

Commutativity of reduceand prefix reduceoperations • All reduction operators (except those provided using the UPC_NONCOMM_FUNC) are assumed to be commutative. A commutative reduction operator whose result is dependent on a particular order of execution has undefined results.

UPC Workshop George Washington University May 6-7, 2003

UPC Workshop George Washington University May 6-7, 2003

Presentation Transcript

THE GEORGE WASHINGTON UNIVERSITY

George Washington

George washington

George Washington

George Washington

George Washington

GEORGE WASHINGTON

George Washington

George Washington

George Washington

George Washington.

George Washington

ROD DAQ Workshop February 6-7, 2003

George Washington

George Washington

George washington

GEORGE WASHINGTON

George Washington

George Washington

Cynthia Deitch George Washington University

GEORGE WASHINGTON

THE GEORGE WASHINGTON UNIVERSITY