1 / 51

Towards Optimized UPC Implementations

Towards Optimized UPC Implementations. Tarek A. El-Ghazawi The George Washington University tarek@gwu.edu. Agenda. Background UPC Language Overview Productivity Performance Issues Automatic Optimizations Conclusions. Parallel Programming Models. What is a programming model?

jamar
Download Presentation

Towards Optimized UPC Implementations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Optimized UPC Implementations Tarek A. El-Ghazawi The George Washington Universitytarek@gwu.edu

  2. Agenda • Background • UPC Language Overview • Productivity • Performance Issues • Automatic Optimizations • Conclusions

  3. Parallel Programming Models • What is a programming model? • An abstract machine which outlines the view perceived by the programmer of data and execution • Where architecture and applications meet • A non-binding contract between the programmer and the compiler/system • Good Programming Models Should • Allow efficient mapping on different architectures • Keep programming easy • Benefits • Application - independence from architecture • Architecture - independence from applications

  4. Programming Models Process/Thread Address Space Message PassingShared Memory DSM/PGAS MPI OpenMP UPC

  5. Programming Paradigms Expressivity LOCALITY Implicit Explicit PARALLEISM Implicit Sequential (e.g. C, Fortran, Java) Data Parallel (e.g. HPF, C*) Shared Memory (e.g. OpenMP) Distributed Shared Memory/PGAS (e.g. UPC, CAF, and Titanium) Explicit

  6. What is UPC? • Unified Parallel C • An explicit parallel extension of ISO C • A distributed shared memory/PGAS parallel programming language

  7. Why not message passing? • Performance • High-penalty for short transactions • Cost of calls • Two sided • Excessive buffering • Ease-of-use • Explicit data transfers • Domain decomposition does not maintain the original global application view • More code and conceptual difficulty

  8. Why DSM/PGAS? • Performance • No calls • Efficient short transfers • locality • Ease-of-use • Implicit transfers • Consistent global application view • Less code and conceptual difficulty

  9. Why DSM/PGAS:New Opportunities for Compiler Optimizations Image Sobel Operator Thread0 • DSM P_Model exposes sequential remote accesses at compile time • Opportunity for compiler directed prefetching Thread1 Ghost Zones Thread2 Thread3

  10. History • Initial Tech. Report from IDA in collaboration with LLNL and UCB in May 1999 • UPC consortium of government, academia, and HPC vendors coordinated by GWU, IDA, and DoD • The participants currently are: IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, …

  11. Status • Specification v1.0 completed February of 2001, v1.1.1 in October of 2003, v1.2 will add collectives and UPC/IO • Benchmarking Suites: Stream, GUPS, RandomAccess, NPB suite, Splash-2, and others • Testing suite v1.0, v1.1 • Short courses and tutorials in the US and abroad • Research Exhibits at SC 2000-2004 • UPC web site: upc.gwu.edu • UPC Book by mid 2005 from John Wiley and Sons • Manual(s)

  12. Hardware Platforms • UPC implementations are available for • SGI O 2000/3000 • Intrepid – 32 and 64b GCC • UCB – 32 b GCC • Cray T3D/E • Cray X-1 • HP AlphaServer SC, Superdome • UPC Berkeley Compiler: Myrinet, Quadrics, and Infiniband Clusters • Beowulf Reference Implementation (MPI-based, MTU) • New ongoing efforts by IBM and Sun

  13. UPC Execution Model • A number of threads working independently in a SPMD fashion • MYTHREAD specifies thread index (0..THREADS-1) • Number of threads specified at compile-time or run-time • Process and Data Synchronization when needed • Barriers and split phase barriers • Locks and arrays of locks • Fence • Memory consistency control

  14. UPC Memory Model Thread THREADS-1 Thread 0 Thread 1 • Shared space with thread affinity, plus private spaces • A pointer-to-shared can reference all locations in the shared space • A private pointer may reference only addresses in its private space or addresses in its portion of the shared space • Static and dynamic memory allocations are supported for both shared and private memory Shared Private 0 Private 1 Private THREADS-1

  15. How to declare them? int *p1; /* private pointer pointing locally */ shared int *p2; /* private pointer pointing into the shared space */ int *shared p3; /* shared pointer pointing locally */ shared int *shared p4; /* shared pointer pointing into the shared space */ You may find many using “shared pointer” to mean a pointer pointing to a shared object, e.g. equivalent to p2 but could be p4 as well. UPC Pointers

  16. P1 P1 P1 UPC Pointers Thread 0 Shared P4 P3 P2 P2 Private P2

  17. Synchronization - Barriers • No implicit synchronization among the threads • UPC provides the following synchronization mechanisms: • Barriers • Locks • Memory Consistency Control • Fence

  18. Memory Consistency Models • Has to do with ordering of shared operations, and when a change of a shared object by a thread becomes visible to others • Consistency can be strict or relaxed • Under the relaxed consistency model, the shared operations can be reordered by the compiler / runtime system • The strict consistency model enforces sequential ordering of shared operations. (No operation on shared can begin before the previous ones are done, and changes become visible immediately)

  19. Memory Consistency Models • User specifies the memory model through: • declarations • pragmas for a particular statement or sequence of statements • use of barriers, and global operations • Programmers responsible for using correct consistency model

  20. UPC and Productivity • Metrics • Lines of ‘useful’ Code • indicates the development time as well as the maintenance cost • Number of ‘useful’ Characters • alternative way to measure development and maintenance efforts • Conceptual Complexity • function level, • keyword usage, • number of tokens, • max loop depth, • …

  21. Manual Effort – NPB Example

  22. Manual Effort – More Examples

  23. Conceptual Complexity - HIST

  24. Conceptual Complexity - GUPS

  25. UPC Optimizations Issues • Particular Challenges • Avoiding Address Translation • Cost of Address Translation • Special Opportunities • Locality-driven compiler-directed prefetching • Aggregation • General • Low-level optimized libraries, e.g. collective • Backend optimizations • Overlapping of remote accesses and synchronization with other work

  26. Showing Potential Optimizations Through Emulated Hand-Tunings • Different Hand-tuning levels: • Unoptimized UPC code • referred as UPC.O0 • Privatized UPC code • referred as UPC.O1 • Prefetched UPC code • hand-optimized variant using block get/put to mimic the effect of prefetching • referred as UPC.O2 • Fully Hand-Tuned UPC code • Hand-optimized variant integrating privatization, aggregation of remote accesses as well as prefetching • Referred as UPC.O3 • T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372

  27. Address Translation Cost and Local Space Privatization- Cluster STREAM BENCHMARK Results gathered on a Myrinet Cluster

  28. Address Translation and Local Space Privatization – DSM ARCHITECTURE Bulk operations Element-by-Element operations STREAM BENCHMARK MB/S

  29. Aggregation and Overlapping of Remote Shared Memory Accesses UPC N-Queens: Execution Time UPC Sobel Edge: Execution Time • Benefit of hand-optimizations are greatly application dependent: • N-Queens does not perform any better, mainly because it is an embarrassingly parallel program • Sobel Edge Detector does get a speedup of one order of magnitude after hand-optimizating, scales linearly perfectly. • SGI O2000

  30. Impact of Hand-Optimizations on NPB.CG Class A onSGI Origin 2k

  31. Shared Address Translation Overhead • Address translation overhead is quite significant • More than 70% of work for a local-shared memory access • Demonstrates the real need for optimization Overhead Present in Local-Shared Memory Accesses (SGI Origin 2000, GCC-UPC) Quantification of the Address Translation Overheads

  32. Shared Address Translation Overheads for Sobel Edge Detection UPC.O0: unoptimized UPC code, UPC.O3: handoptimized UPC code. Ox notations from T. El-Ghazawi, S. Chauvin, “UPC Benchmarking Issues”, Proceedings of the 2001 International Conference on Parallel Processing, Valencia, September 2001

  33. Reducing Address Translation Overheads via Translation Look-Aside Buffers • F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005 • Use Look-up Memory Model Translation Buffers (MMTB) to perform fast translations • Two alternative methods proposed to create and use MMTB’s: • FT: basic method using direct addressing • RT: advanced method, using indexed addressing • Was prototyped as a compiler-enabled optimization • no modifications to actual UPC codes are needed

  34. Different Strategies – Full-Table Pros • Direct mapping • No address calculation Cons • Large memory required • Can lead to competition over caches and main memory Consider shared [B] int array[8]; To Initialize FT: i  [0,7], FT[i] = _get_vaddr(&array[i]) To Access array[ ]: i  [0,7], array[i] = _get_value_at(FT[i])

  35. Different Strategies – Reduced-Table: Infinite blocksize RT Strategy: • Only one table entry in this case • Address calculation step is simple in that case BLOCKSIZE=infinite Only first address of the element of the array needs to be saved since all array data is contiguous Consider shared [] int array[4]; To initialize RT: RT[0] = _get_vaddr(&array[0]) To access array[]: i [0,3], array[i] = _get_value_at( RT[0] + i ) array[0] array[1] i array[2] array[3] RT[0] RT[0] RT[0] RT[0] THREAD0 THREAD1 THREAD2 THREAD3

  36. Different Strategies – Reduced-Table: Default blocksize BLOCKSIZE=1 Only first address of elements on each thread are saved since all array data is contiguous Consider shared [1] int array[16]; To initialize RT: i [0,THREADS-1], RT[i] = _get_vaddr(&array[i]) To access array[]: i [0,15], array[i] = _get_value_at( RT[i mod THREADS] + (i/THREADS)) RT Strategy: • Less memory required than FT, MMTB buffer has threads entries • Address calculation step is a bit costly but much cheaper than current implementations array[0] array[1] array[2] array[3] RT[0] array[4] array[5] array[6] array[7] RT[1] array[8] array[9] array[10] array[11] RT[2] array[12] array[13] array[14] array[15] RT[3] RT RT RT RT RT THREAD0 THREAD1 THREAD2 THREAD3

  37. Different Strategies – Reduced-Table: Arbitrary blocksize ARBITRARY BLOCK SIZES Only first address of elements of each block are saved since all block data is contiguous Consider shared [2] int array[16]; To initialize T: i [0,7], RT[i] = _get_vaddr(&array[i*blocksize(array)]) To access array[]: i [0,15], array[i] = _get_value_at( RT[i / blocksize(array)] + (i mod blocksize(array)) ) RT Strategy: • Less memory required than for FT, but more than previous cases • Address calculation step more costly than previous cases RT[0] RT[1] array[0] array[2] array[4] array[6] RT[2] array[1] array[3] array[5] array[7] RT[3] array[8] array[10] array[12] array[14] RT[4] array[9] array[11] array[13] array[15] RT[5] RT[6] RT RT RT RT RT[7] THREAD0 THREAD1 THREAD2 THREAD3 RT

  38. Performance Impact of the MMTB – Sobel Edge Performance of Sobel-Edge Detection using new MMTB strategies (with and without O0) • FT and RT are performing around 6 to 8 folds better than the regular basic UPC version (O0) • RT strategy slower than FT since address calculation (arbitrary block size case), becomes more complex. • FT on the other hand is performing almost as good as the hand-tuned versions (O3 and MPI)

  39. Performance Impact of the MMTB – Matrix Multiplication Performance and Hardware Profiling of Matrix Multiplication using new MMTB strategies • FT strategy: increase in L1 data cache misses due to the large table size • RT strategy: L1 kept low, but increase in number of loads and stores is observed showing increase in computations (arbitrary blocksize used)

  40. Time and storage requirements of the Address Translation Methods for the Matrix Multiply Microkernel (E: element size in bytes,P: pointer size in bytes) Comparison among Optimizations of Storage, Memory Accesses and Computation Requirements • Number of loads and stores can increase with arithmetic operators

  41. By thread/index number (upc_forall integer) upc_forall(i=0; i<N; i++; i) loop body; By the address of a shared variable (upc_forall address) upc_forall(i=0; i<N; i++; &shared_var[i]) loop body; By thread/index number (for optimized) for(i=MYTHREAD; i<N; i+=THREADS) loop body; By thread/index number(for integer) for(i=0; i<N; i++) { if(MYTHREAD == i%THREADS) loop body; } By the address of a shared variable (for address) for(i=0; i<N; i++) { if(upc_threadof(&shared_var[i]) == MYTHREAD) loop body; } UPC Work-sharing Construct Optimizations

  42. Performance of Equivalent upc_forall and for Loops

  43. Performance Limitations Imposed by Sequential C Compilers -- STREAM

  44. Loopmark – SET/ADD Operations Let us compare loopmarks for each F / C operation

  45. MEMSET (bulk set) 146. 1 t = mysecond(tflag) 147. 1 V M--<><> a(1:n) = 1.0d0 148. 1 t = mysecond(tflag) - t 149. 1 times(2,k) = t SET 158. 1 arrsum = 2.0d0; 159. 1 t = mysecond(tflag) 160. 1 MV------< DO i = 1,n 161. 1 MV c(i) = arrsum 162. 1 MV arrsum = arrsum + 1 163. 1 MV------> END DO 164. 1 t = mysecond(tflag) - t 165. 1 times(4,k) = t ADD 180. 1 t = mysecond(tflag) 181. 1 V M--<><> c(1:n) = a(1:n) + b(1:n) 182. 1 t = mysecond(tflag) - t 183. 1 times(7,k) = t MEMSET (bulk set) 163. 1 times[1][k] = mysecond_(); 164. 1 memset(a, 1, NDIM*sizeof(elem_t));; 165. 1 times[1][k] = mysecond_() - times[1][k]; SET 217. 1 set = 2; 220. 1 times[5][k] = mysecond_(); 222. 1 MV--< for (i=0; i<NDIM; i++) 223. 1 MV { 224. 1 MV c[i] = (set++); 225. 1 MV--> } 227. 1 times[5][k] = mysecond_() - times[5][k]; ADD 283. 1 times[10][k]= mysecond_(); 285. 1 Vp--< for (j=0; j<NDIM; j++) 286. 1 Vp { 287. 1 Vp c[j] = a[j] + b[j]; 288. 1 Vp--> } 290. 1 times[10][k] = mysecond_() - times[10][k]; Loopmark – SET/ADD Operations Fortran C Legend: V: Vectorized – M: Multistreamed – p: conditional, partial and/or computed

  46. UPC vs CAF using the NPB workloads • In General, UPC slower than CAF, mainly due to • Point-to-point vs barrier synchronization • Better scalability with proper collective operations • Program writers can do a p-to-p syncronization using current constructs • Scalar performance of source-to-source translated code • Alias analysis (C pointers) • Can highlight the need for explicitly using restrict to help several compiler backends • Lack of support for multi-dimensional arrays in C • Can prevent high level loop transformations and software pipelining, causing a 2 times slowdown in SP for UPC • Need for exhaustive C compiler analysis • A failure to perform proper loop fusion and alignment in the critical section of MG can lead to 51% more loads for UPC than CAF • A failure to unroll adequately the sparse matrix-vector multiplication in CG can lead to more cycles in UPC

  47. Conclusions • UPC is a locality-aware parallel programming language • With proper optimizations, UPC can outperform MPI in random short accesses and can otherwise perform as good as MPI • UPC is very productive and UPC applications result in much smaller and more readable code than MPI • UPC compiler optimizations are still lagging, in spite of the fact that substantial progress has been made • For future architectures, UPC has the unique opportunity of having very efficient implementations as most of the pitfalls and obstacles are revealed along with adequate solutions

  48. Conclusions • In general, four types of optimizations: • Optimizations to Exploit the Locality Consciousness and other Unique Features of UPC • Optimizations to Keep the Overhead of UPC low • Optimizations to Exploit Architectural Features • Standard Optimizations that are Applicable to all Systems Compilers

  49. Conclusions • Optimizations possible at three levels: • Source to source program acting during the compilation phase and incorporating most UPC specific optimizations • C backend compilers to compete with Fortran • Strong run-time system that can work effectively with the Operating System

  50. Selected Publications • T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: Distributed Shared Memory Programming. John Wiley &Sons Inc., New York, 2005. ISBN: 0-471-22048-5. (June 2005) • T. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, Benchmarking Parallel Compilers for Distributed Shared Memory Languages: A UPC Case Study, Journal of Future Generation Computer Systems, North-Holland (Accepted)

More Related