Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading- The EARTH Model (in more details) Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware ggao@capsl.udel.edu cpeg421-10-F/Topic-3-II-EARTH

Outline • Overview • Fine-grain multithreading • Compiling for fine-grain multithreading • The power of fine-grain synchronization - SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH

The EARTH Multithreaded Execution Model fiber within a frame Two Level of Fine-Grain Threads: - threaded procedures - fibers Aync. function invocation 2 1 2 2 1 2 2 4 A sync operation Invoke a threaded func Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

Fiber within a frame Parallel function invocation frames fork a procedure SYNC ops EARTH vs. CILK CILK Model EARTH Model Note: EARTH has it origin in static dataflow model cpeg421-10-F/Topic-3-II-EARTH

The “Fiber” Execution Model 0 0 0 0 0 1 2 2 2 4 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

The “Fiber” Execution Model 0 0 0 2 2 4 2 2 Signal Token Total # signals Arrived # signals 1 1 cpeg421-10-F/Topic-3-II-EARTH

A Loop Example i= 1 i= 2 i= 3 i= N for(i =1; i <= N; ++i){ S1: … S2: x[i] = … S3: y[i] = … + x[i-1] … . . . Sk: … } S1: S2: S3: Sk: T1 T2 T3 Note: How loop carried dependencies are handled? And its implication on cross core software pipelining cpeg421-10-F/Topic-3-II-EARTH

Main Features of EARTH • Fast thread context switching • Efficient parallel function invocation • Good support of fine grain dynamic load balancing • Efficient support split phase transactions and fibers • Features unique to the EARTH model in comparison to the CILK model cpeg421-10-F/Topic-3-II-EARTH

Compiling C for EARTHObjectives • Design simple high-level extensions for C that allow programmers to write programs that will run efficiently on multi-threaded architectures. (EARTH-C) • Develop compiler techniques to automatically translate programs written in EARTH-C to multi-threaded programs. (EARTH-C, Threaded-C) • Determine if EARTH-C + compiler can compete with hand-coded Threaded-C programs. cpeg421-10-F/Topic-3-II-EARTH

Summary of EARTH-C Extensions • Explicit Parallelism • Parallel versus Sequential statement sequences • Forall loops • Locality Annotation • Local versus Remote Memory references (global, local, replicate, …) • Dynamic Load Balancing • Basic versus remote function and invocation sites cpeg421-10-F/Topic-3-II-EARTH

EARTH-C Compiler Environment EARTH SIMPLE C EARTH-C Split Phase Analysis McCAT Program Dependence Analysis Build DDG EARTH SIMPLE Compute Remote Level EARTH-C Compiler Thread Generation Thread Partitioning Threaded-C Merge Statements Threaded-C Compiler Thread Synchronization EARTH Compilation Environment Threaded-C Thread Scheduling The EARTH Compiler cpeg421-10-F/Topic-3-II-EARTH Thread Code Generation

The McCAT/EARTH Compiler EARTH-C PHASE I (Standard McCAT Analyses & Transformations) Simplify goto elimination Local function inlining Points-to Analysis Heap Analysis R/W Set Analysis Array Dependence Tester EARTH-SIMPLE-C PHASE II (Parallelization) Forall Loop Detection Loop Partitioning EARTH-SIMPLE-C Build Hierarchical DDG Thread Generation PHASE III Code Generation THREADED-C cpeg421-10-F/Topic-3-II-EARTH

result n done fib 0 0 If n < 2 DATA_RSYNC (1, result, done) else { TOKEN (fib, n-1, & sum1, slot_1); TOKEN (fib, n-2, & sum2, slot_2); } END_THREAD( ) ; 2 2 THREAD-1; DATA_RSYNC (sum1 + sum2, result, done); END_THREAD ( ) ; END_FUNCTION The Fibonacci Example \Petaflop\Workshop98-7B.ppt

Matrix Multiplication void main ( ) { int i, j, k; float sum; for (i=0; i < N; i++) for (j=0; j < N ; j++) { sum = 0; for (k=0; k < N; k++) sum = sum + a [i] [k] * b [k] [j] c [i] [j] = sum; } } Sequential Version \Petaflop\Workshop98-7B.ppt

result a b done inner 0 0 2 2 THREAD-1; for (i=0; i<N; i++ ); sum = sum + (row_a[i] * column_b[i]); DATA_RSYNC (sum, result, done); END_THREAD ( ) ; BLKMOV_SYNC (a, row_a, N, slot_1); BLKMOV_SYNC (b, column_b, N, slot_1); sum = 0; END_THREAD; END_FUNCTION The Inner Product Example \Petaflop\Workshop98-7B.ppt

Summary of EARTH-C Extensions • Explicit Parallelism • Parallel versus Sequential statement sequences • Forall loops • Locality Annotation • Local versus Remote Memory references (global, local, replicate, …) • Dynamic Load Balancing • Basic versus remote function and invocation sites cpeg421-10-F/Topic-3-II-EARTH

EARTH C Threaded C(Thread Generation) Given a sequence of statements, s1, s2, …sn, we wish to create threads such that: • Maximize thread length (minimize thread switching overhead) • retain sufficient parallelism • Issue remote memory requests as early as possible (prefetching) • Compile split-phase remote memory operations and remote function calls correctly cpeg421-10-F/Topic-3-II-EARTH

An Example int f(int *x, int i, int j){ int a, b, sum, prod, fact; int r1, r2, r3; a = x[i]; fact = 1; fact = fact * a; b = x[j]; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); return(r1 + r2 + r3); } cpeg421-10-F/Topic-3-II-EARTH

Example Partitioned into Four Fibers fact = fact * a; b = x[j]; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); a = x[i]; fact = 1; return (r1 + r2 + r3); 1 Fiber-0: Fiber-1: 1 3 Fiber-2: Fiber-3: cpeg421-10-F/Topic-3-II-EARTH

Better Strategy Using List Scheduling • Put each instruction in the earliest possible thread. • Within a thread, the remote operations are executed as early as possible. Build a Data Dependence Graph (DDG), and use a list scheduling strategy, where the selection of instructions is guided by Earliest Thread Number and Statement Type. cpeg421-10-F/Topic-3-II-EARTH

Instruction Types • Schedule First • remote_read, remote_write • remote_fn_call • local_simple • remote_compound • local_compound • basic_fn_call • Schedule Last cpeg421-10-F/Topic-3-II-EARTH

List Scheduling Previous Example (0,RR) (0,LS) (0,RR) a = x[i]; b = x[j]; fact = 1; (1,LS) (1,LS) (1,LC) sum=a+b; prod=a*b; fact = fact*a; (1,RF) (1,RF) (1,RF) r1=g(sum); r2=g(prod) r3=g(fact) (2,LS) return(r1 + r2 + r3) cpeg421-10-F/Topic-3-II-EARTH

Resulting List Scheduled Threads a=x[i]; b=x[j]; fact=1; 2 sum=a+b; r1=g(sum); prod=a*b; r2=g(prod); fact=fact*i; r3=g(fact) 3 return (r1+r2+r3); cpeg421-10-F/Topic-3-II-EARTH

Generating Threaded-C Code THREADED f ( int *ret_parm, SLOT *rsync_parm, int *x, inti, int j) { SLOTS SYNC_SLOTS[2]; int a, b, sum, prod, fact, r1, r2, r3; /* THREAD_0:; */ INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2); GET_SYNC_L (&x[i], &a, 0); GET_SYNC_L (&x[j], &b, 0); fact = 1; END_THREAD( ); THREAD_1:; sum = a + b; TOKEN (G, &r1, SLOT_ADR(1), sum); prod = a * b; TOKEN (g, &r2, SLOT_ADR(1), prod); fact = fact * a; TOKEN (g, &r3, SLOT_ADR(1), fact); END_THREAD( ); THREAD_2:; DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm); END_FUNCTION( ); } cpeg421-10-F/Topic-3-II-EARTH

Fine-Grain Synchronization: Two Types cpeg421-10-F/Topic-3-II-EARTH

Enforce Data Dependencies • A DoAcross loop with positive and constant dependence distance. In parallel iterations are assigned to different threads T0 T1 for(i= D; i < N; ++i){ A[i] = … … … = A[i-D]; } (i = 2) { A[2] = … … … = A[2-D] } (i = 2 + D) { A[2+D] = … … … = A[2] } The data dependence needs to be enforced by synchronization cpeg421-10-F/Topic-3-II-EARTH

Memory Based Fine-Grain Synchronization: • Full/Empty Bits (HEP, Tera MTA, etc) & I-Structures (dataflow based machines) • Associate “state” to a memory location (fine-granularity). Fine-grain synchronization for the memory location is realized through “state transition” on such “state”. Empty I-Structure state transition [ArvindEtAl89 @ TOPLAS] read write reset read Full Deferred write read cpeg421-10-F/Topic-3-II-EARTH

With Memory Based Fine-Grain Sync for(i= D; i < N; ++i){ A[i] = … … … = A[i-D]; } • Using a single atomic operation complete synchronized write/read in memory directly • No need to implement synchronization with other resources, e.g., shared memory. • Low overhead: just one memory transaction for(i= D; i < N; ++i){ write_sync(&(A[i]),…) … … = read_sync(&(A[i-D])); } cpeg421-10-F/Topic-3-II-EARTH

With Memory Based Fine-Grain Sync T0 (i = 2) { write_sync(&(A[2]),…); … … = read_sync(&(A[2-D])); } • Using a single atomic operation complete synchronized write/read in memory directly • No need to implement synchronization with other resources, e.g., shared memory. • Low overhead: just one memory transaction T1 (i = 2 + D) { write_sync(&(A[2 + D]),…); … … = read_sync(&(A[2]));} cpeg421-10-F/Topic-3-II-EARTH

An Alternative: control-flow based synchronizations • The post/wait instructions needs to be implemented in shared memory in coordination with the underline memory (consistency) models • You may need to worry about this: for(i= D; i < N; ++i){ A[i] = … post(i); … wait(i-D); … = A[i-D]; } No data dependency No data dependency A[i] = …; fence; post(i); wait(i-D); fence; … = A[i-D]; For computation with more complicated data dependencies, memory-based fine-grain synchronization is more effective and efficient.[ArvindEtAl89 @ TOPLAS] cpeg421-10-F/Topic-3-II-EARTH

A Question! Is that really necessary to tag every word in the entire memory to support memory-based fine-grain synchronization? cpeg421-10-F/Topic-3-II-EARTH

Key Observation Key Observation: At any instance of a “reasonable” parallel execution only a small fraction of memory locations are actively participating in synchronization. Solution: Synchronization State Buffer (SSB): Only record and manage states of active synchronized data units to support fine-grain synchronization. cpeg421-10-F/Topic-3-II-EARTH

What is SSB? • A small hardware buffer attached to the memory controller of each memory bank. • Record and manage states of actively synchronized data units. • Hardware Cost • Each SSB is a small look-up table: Easy-to-implement • Independence of each SSB: hardware cost increases only linearly proportional to # of memory banks cpeg421-10-F/Topic-3-II-EARTH

SSB on Many-Core (IBM C64) IBM Cyclops-64, Designed by Monty Denneau. cpeg421-10-F/Topic-3-II-EARTH

SSB Synchronization Functionalities Data Synchronization: Enforce RAW data dependencies • Support word-level • Two single-writer-single-reader (SWSR) modes • One single-writer-multiple-reader (SWMR) mode Fine-Grain Locking: Enforce mutual exclusion • Support word-level • write lock (exclusive lock) • read lock (shared lock) • recursive lock SSB is capable of supporting more functionality cpeg421-10-F/Topic-3-II-EARTH

Binutils: OpenMP Compiler Libraries: C Compiler (GCC/Open64) linker TiNy Threads Library/RTS OpenMP RTS assembler FAST Simulator (Software) Std C/Math lib Ms. Clops Hardware Emulator Cyclops-64 Micro Kernel Simulation Testbed: • IBM Cyclops-64 Chip Architecture • 160 thread units (500MHz) • Three-level explicit-addressable memory hierarchy • Efficient thread-level execution support • SSB for on-chip SRAM bank: 16-entry, 8-way associative Experimental Infrastructure cpeg421-10-F/Topic-3-II-EARTH

SSB Fine-Grain Sync. is Efficient • For all the benchmarks, the SSB-based version shows significant performance improvement over the versions based on other synchronization mechanisms. • For example, with up to 128 threads • Livermore loop 6 (linear recurrence): a 312% improvement over the barrier based version • Ordered integer set (hash table): outperform the software-based fine-grain methods by up to 84% cpeg421-10-F/Topic-3-II-EARTH

Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor