1 / 56

Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details). Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware ggao@capsl.udel.edu. Outline.

niabi
Download Presentation

Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading- The EARTH Model (in more details) Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware ggao@capsl.udel.edu cpeg421-10-F/Topic-3-II-EARTH

  2. Outline • Overview • Fine-grain multithreading • Compiling for fine-grain multithreading • The power of fine-grain synchronization - SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH

  3. Outline • Overview • Fine-grain multithreading • Compiling for fine-grain multithreading • The power of fine-grain synchronization - SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH

  4. The EARTH Multithreaded Execution Model fiber within a frame Two Level of Fine-Grain Threads: - threaded procedures - fibers Aync. function invocation 2 1 2 2 1 2 2 4 A sync operation Invoke a threaded func Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  5. Fiber within a frame Parallel function invocation frames fork a procedure SYNC ops EARTH vs. CILK CILK Model EARTH Model Note: EARTH has it origin in static dataflow model cpeg421-10-F/Topic-3-II-EARTH

  6. The “Fiber” Execution Model 0 0 0 0 0 1 2 2 2 4 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  7. The “Fiber” Execution Model 0 0 1 0 0 1 2 2 2 4 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  8. The “Fiber” Execution Model 0 0 0 0 1 2 2 4 2 2 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  9. The “Fiber” Execution Model 0 0 0 2 2 4 2 2 Signal Token Total # signals Arrived # signals 1 1 cpeg421-10-F/Topic-3-II-EARTH

  10. The “Fiber” Execution Model 2 1 0 0 2 2 2 4 Signal Token Total # signals Arrived # signals 1 1 cpeg421-10-F/Topic-3-II-EARTH

  11. The “Fiber” Execution Model 1 1 2 1 0 1 2 2 2 4 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  12. The “Fiber” Execution Model 1 1 2 0 1 2 2 4 2 2 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  13. The “Fiber” Execution Model 1 2 2 0 1 2 2 4 Signal Token Total # signals Arrived # signals 2 2 cpeg421-10-F/Topic-3-II-EARTH

  14. The “Fiber” Execution Model 1 2 2 2 1 1 2 2 2 4 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  15. The “Fiber” Execution Model 1 2 2 2 2 1 2 2 2 4 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  16. The “Fiber” Execution Model 1 2 2 2 3 1 2 2 2 4 Signal Token Total # signals Arrived # signals cpeg421-10-F/Topic-3-II-EARTH

  17. The “Fiber” Execution Model 1 2 2 2 1 2 2 2 Signal Token Total # signals Arrived # signals 4 4 cpeg421-10-F/Topic-3-II-EARTH

  18. A Loop Example i= 1 i= 2 i= 3 i= N for(i =1; i <= N; ++i){ S1: … S2: x[i] = … S3: y[i] = … + x[i-1] … . . . Sk: … } S1: S2: S3: Sk: T1 T2 T3 Note: How loop carried dependencies are handled? And its implication on cross core software pipelining cpeg421-10-F/Topic-3-II-EARTH

  19. Main Features of EARTH • Fast thread context switching • Efficient parallel function invocation • Good support of fine grain dynamic load balancing • Efficient support split phase transactions and fibers • Features unique to the EARTH model in comparison to the CILK model cpeg421-10-F/Topic-3-II-EARTH

  20. Outline • Overview • Fine-grain multithreading • Compiling for fine-grain multithreading • The power of fine-grain synchronization - SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH

  21. Compiling C for EARTHObjectives • Design simple high-level extensions for C that allow programmers to write programs that will run efficiently on multi-threaded architectures. (EARTH-C) • Develop compiler techniques to automatically translate programs written in EARTH-C to multi-threaded programs. (EARTH-C, Threaded-C) • Determine if EARTH-C + compiler can compete with hand-coded Threaded-C programs. cpeg421-10-F/Topic-3-II-EARTH

  22. Summary of EARTH-C Extensions • Explicit Parallelism • Parallel versus Sequential statement sequences • Forall loops • Locality Annotation • Local versus Remote Memory references (global, local, replicate, …) • Dynamic Load Balancing • Basic versus remote function and invocation sites cpeg421-10-F/Topic-3-II-EARTH

  23. EARTH-C Compiler Environment EARTH SIMPLE C EARTH-C Split Phase Analysis McCAT Program Dependence Analysis Build DDG EARTH SIMPLE Compute Remote Level EARTH-C Compiler Thread Generation Thread Partitioning Threaded-C Merge Statements Threaded-C Compiler Thread Synchronization EARTH Compilation Environment Threaded-C Thread Scheduling The EARTH Compiler cpeg421-10-F/Topic-3-II-EARTH Thread Code Generation

  24. The McCAT/EARTH Compiler EARTH-C PHASE I (Standard McCAT Analyses & Transformations) Simplify goto elimination Local function inlining Points-to Analysis Heap Analysis R/W Set Analysis Array Dependence Tester EARTH-SIMPLE-C PHASE II (Parallelization) Forall Loop Detection Loop Partitioning EARTH-SIMPLE-C Build Hierarchical DDG Thread Generation PHASE III Code Generation THREADED-C cpeg421-10-F/Topic-3-II-EARTH

  25. result n done fib 0 0 If n < 2 DATA_RSYNC (1, result, done) else { TOKEN (fib, n-1, & sum1, slot_1); TOKEN (fib, n-2, & sum2, slot_2); } END_THREAD( ) ; 2 2 THREAD-1; DATA_RSYNC (sum1 + sum2, result, done); END_THREAD ( ) ; END_FUNCTION The Fibonacci Example \Petaflop\Workshop98-7B.ppt

  26. Matrix Multiplication void main ( ) { int i, j, k; float sum; for (i=0; i < N; i++) for (j=0; j < N ; j++) { sum = 0; for (k=0; k < N; k++) sum = sum + a [i] [k] * b [k] [j] c [i] [j] = sum; } } Sequential Version \Petaflop\Workshop98-7B.ppt

  27. result a b done inner 0 0 2 2 THREAD-1; for (i=0; i<N; i++ ); sum = sum + (row_a[i] * column_b[i]); DATA_RSYNC (sum, result, done); END_THREAD ( ) ; BLKMOV_SYNC (a, row_a, N, slot_1); BLKMOV_SYNC (b, column_b, N, slot_1); sum = 0; END_THREAD; END_FUNCTION The Inner Product Example \Petaflop\Workshop98-7B.ppt

  28. Summary of EARTH-C Extensions • Explicit Parallelism • Parallel versus Sequential statement sequences • Forall loops • Locality Annotation • Local versus Remote Memory references (global, local, replicate, …) • Dynamic Load Balancing • Basic versus remote function and invocation sites cpeg421-10-F/Topic-3-II-EARTH

  29. EARTH C Threaded C(Thread Generation) Given a sequence of statements, s1, s2, …sn, we wish to create threads such that: • Maximize thread length (minimize thread switching overhead) • retain sufficient parallelism • Issue remote memory requests as early as possible (prefetching) • Compile split-phase remote memory operations and remote function calls correctly cpeg421-10-F/Topic-3-II-EARTH

  30. An Example int f(int *x, int i, int j){ int a, b, sum, prod, fact; int r1, r2, r3; a = x[i]; fact = 1; fact = fact * a; b = x[j]; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); return(r1 + r2 + r3); } cpeg421-10-F/Topic-3-II-EARTH

  31. Example Partitioned into Four Fibers fact = fact * a; b = x[j]; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); a = x[i]; fact = 1; return (r1 + r2 + r3); 1 Fiber-0: Fiber-1: 1 3 Fiber-2: Fiber-3: cpeg421-10-F/Topic-3-II-EARTH

  32. Better Strategy Using List Scheduling • Put each instruction in the earliest possible thread. • Within a thread, the remote operations are executed as early as possible. Build a Data Dependence Graph (DDG), and use a list scheduling strategy, where the selection of instructions is guided by Earliest Thread Number and Statement Type. cpeg421-10-F/Topic-3-II-EARTH

  33. Instruction Types • Schedule First • remote_read, remote_write • remote_fn_call • local_simple • remote_compound • local_compound • basic_fn_call • Schedule Last cpeg421-10-F/Topic-3-II-EARTH

  34. List Scheduling Previous Example (0,RR) (0,LS) (0,RR) a = x[i]; b = x[j]; fact = 1; (1,LS) (1,LS) (1,LC) sum=a+b; prod=a*b; fact = fact*a; (1,RF) (1,RF) (1,RF) r1=g(sum); r2=g(prod) r3=g(fact) (2,LS) return(r1 + r2 + r3) cpeg421-10-F/Topic-3-II-EARTH

  35. Resulting List Scheduled Threads a=x[i]; b=x[j]; fact=1; 2 sum=a+b; r1=g(sum); prod=a*b; r2=g(prod); fact=fact*i; r3=g(fact) 3 return (r1+r2+r3); cpeg421-10-F/Topic-3-II-EARTH

  36. Generating Threaded-C Code THREADED f ( int *ret_parm, SLOT *rsync_parm, int *x, inti, int j) { SLOTS SYNC_SLOTS[2]; int a, b, sum, prod, fact, r1, r2, r3; /* THREAD_0:; */ INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2); GET_SYNC_L (&x[i], &a, 0); GET_SYNC_L (&x[j], &b, 0); fact = 1; END_THREAD( ); THREAD_1:; sum = a + b; TOKEN (G, &r1, SLOT_ADR(1), sum); prod = a * b; TOKEN (g, &r2, SLOT_ADR(1), prod); fact = fact * a; TOKEN (g, &r3, SLOT_ADR(1), fact); END_THREAD( ); THREAD_2:; DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm); END_FUNCTION( ); } cpeg421-10-F/Topic-3-II-EARTH

  37. Outline • Overview • Fine-grain multithreading • Compiling for fine-grain multithreading • The power of fine-grain synchronization - SSB • The percolation model and its applications • Summary cpeg421-10-F/Topic-3-II-EARTH

  38. Fine-Grain Synchronization: Two Types cpeg421-10-F/Topic-3-II-EARTH

  39. Enforce Data Dependencies • A DoAcross loop with positive and constant dependence distance. In parallel iterations are assigned to different threads T0 T1 for(i= D; i < N; ++i){ A[i] = … … … = A[i-D]; } (i = 2) { A[2] = … … … = A[2-D] } (i = 2 + D) { A[2+D] = … … … = A[2] } The data dependence needs to be enforced by synchronization cpeg421-10-F/Topic-3-II-EARTH

  40. Memory Based Fine-Grain Synchronization: • Full/Empty Bits (HEP, Tera MTA, etc) & I-Structures (dataflow based machines) • Associate “state” to a memory location (fine-granularity). Fine-grain synchronization for the memory location is realized through “state transition” on such “state”. Empty I-Structure state transition [ArvindEtAl89 @ TOPLAS] read write reset read Full Deferred write read cpeg421-10-F/Topic-3-II-EARTH

  41. With Memory Based Fine-Grain Sync for(i= D; i < N; ++i){ A[i] = … … … = A[i-D]; } • Using a single atomic operation complete synchronized write/read in memory directly • No need to implement synchronization with other resources, e.g., shared memory. • Low overhead: just one memory transaction for(i= D; i < N; ++i){ write_sync(&(A[i]),…) … … = read_sync(&(A[i-D])); } cpeg421-10-F/Topic-3-II-EARTH

  42. With Memory Based Fine-Grain Sync T0 (i = 2) { write_sync(&(A[2]),…); … … = read_sync(&(A[2-D])); } • Using a single atomic operation complete synchronized write/read in memory directly • No need to implement synchronization with other resources, e.g., shared memory. • Low overhead: just one memory transaction T1 (i = 2 + D) { write_sync(&(A[2 + D]),…); … … = read_sync(&(A[2]));} cpeg421-10-F/Topic-3-II-EARTH

  43. An Alternative: control-flow based synchronizations • The post/wait instructions needs to be implemented in shared memory in coordination with the underline memory (consistency) models • You may need to worry about this: for(i= D; i < N; ++i){ A[i] = … post(i); … wait(i-D); … = A[i-D]; } No data dependency No data dependency A[i] = …; fence; post(i); wait(i-D); fence; … = A[i-D]; For computation with more complicated data dependencies, memory-based fine-grain synchronization is more effective and efficient.[ArvindEtAl89 @ TOPLAS] cpeg421-10-F/Topic-3-II-EARTH

  44. A Question! Is that really necessary to tag every word in the entire memory to support memory-based fine-grain synchronization? cpeg421-10-F/Topic-3-II-EARTH

  45. Key Observation Key Observation: At any instance of a “reasonable” parallel execution only a small fraction of memory locations are actively participating in synchronization. Solution: Synchronization State Buffer (SSB): Only record and manage states of active synchronized data units to support fine-grain synchronization. cpeg421-10-F/Topic-3-II-EARTH

  46. What is SSB? • A small hardware buffer attached to the memory controller of each memory bank. • Record and manage states of actively synchronized data units. • Hardware Cost • Each SSB is a small look-up table: Easy-to-implement • Independence of each SSB: hardware cost increases only linearly proportional to # of memory banks cpeg421-10-F/Topic-3-II-EARTH

  47. SSB on Many-Core (IBM C64) IBM Cyclops-64, Designed by Monty Denneau. cpeg421-10-F/Topic-3-II-EARTH

  48. SSB Synchronization Functionalities Data Synchronization: Enforce RAW data dependencies • Support word-level • Two single-writer-single-reader (SWSR) modes • One single-writer-multiple-reader (SWMR) mode Fine-Grain Locking: Enforce mutual exclusion • Support word-level • write lock (exclusive lock) • read lock (shared lock) • recursive lock SSB is capable of supporting more functionality cpeg421-10-F/Topic-3-II-EARTH

  49. Binutils: OpenMP Compiler Libraries: C Compiler (GCC/Open64) linker TiNy Threads Library/RTS OpenMP RTS assembler FAST Simulator (Software) Std C/Math lib Ms. Clops Hardware Emulator Cyclops-64 Micro Kernel Simulation Testbed: • IBM Cyclops-64 Chip Architecture • 160 thread units (500MHz) • Three-level explicit-addressable memory hierarchy • Efficient thread-level execution support • SSB for on-chip SRAM bank: 16-entry, 8-way associative Experimental Infrastructure cpeg421-10-F/Topic-3-II-EARTH

  50. SSB Fine-Grain Sync. is Efficient • For all the benchmarks, the SSB-based version shows significant performance improvement over the versions based on other synchronization mechanisms. • For example, with up to 128 threads • Livermore loop 6 (linear recurrence): a 312% improvement over the barrier based version • Ordered integer set (hash table): outperform the software-based fine-grain methods by up to 84% cpeg421-10-F/Topic-3-II-EARTH

More Related