Compiler Perspectives on Transitioning Code Generation Techniques for Multi-Core and Many-Core Systems

HPC User ForumBack End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009

Will compiler code generation techniques transition along with the hardware transition from multi-core to many-core and hybrid systems and at what speed? • Not quickly: Typical sequence of compiler, hardware & user response to hardware innovations: • Hand coded experimentation with primitives • Directly specify new primitives in source to simplify the development chain • Exploit “normal” language constructs to target subset capability of new primitives • Early attempts at optimization of feature usage • Language extensions to allow user specification of source attributes & choices to assist • New programming paradigm for radical programmer productivity improvement

What information do you need from a Compiler Intermediate Format to efficiently utilize multi-core, many-core and hybrid systems that is not available from traditional languages like C, C++, or F90? Are you looking at directive-based or library-based approaches or is there another approach that you like? IR not an obvious problem: HW communication design is: • Each core/alu obeys single threaded von Neumann design. Traditional optimizers work well. • Source languages continue to be single threaded or explicitly parallel for ease of understanding & control • Performance limitations around the use of multiple cores and hybrids are all bandwidth & latency related: • Primary optimization challenge since the 80s: How to minimize data movement? • Optimization paradigms for single threaded (e.g. cache reuse) work on new opportunities – how to apply?

Is embedded global memory addressing (like Co-Array Fortran) to be widely available and supported even on distributed memory systems? Yes! Partitioned Global Address Space (PGAS) languages bridge the “shared” vs. “distributed” design gap. • UPC, Co-Array Fortran, Titanium, HPCS languages ... • Shared memory does not scale: Cache coherency is too expensive. RDMA needed – here soon (OpenFabrics Alliance) • The hard part about distributed memory: • Knowing where to put the data • Knowing when & how to move it (bandwidth & latency)‏ • A problem even in the shared memory case (NUMA) • Two level model: allows the compiler to provide the mechanisms, the programmer specifies the data placement and movement. • Compiler & runtime can help: Prefetching, caching etc. Analogous to traditional local memory optimizations.

What kind of hybrid systems or processor extensions are going to be supported by your compiler's code generation suite? SiCortex systems use multi-core chips with integrated communications fabric with built-in RDMA • Extant parallel programming models work well • Ideal platform for PGAS languages – coming soon! Committed to Open Source model for compilers & tools • Investment in gcc toolchain – bringing MIPS to HPC • Large investment in Open64 compiler codebase • Using open source components for PGAS support Integrated platform allows tighter tool integration than commodity cluster approach

What new run-time libraries will be available to utilize multi-core, many-core, and hybrid systems and will they work seamlessly through dynamic linking? Autotuned libraries for dense linear algebra and signal processing are clear successes. This trend will continue for other common HPC programming paradigms. Should work well for new HW paradigms. Data layout and movement optimization is unsolved for irregular problems. Both static analysis and run-time information appears essential to get the best results. Pushing AMR down into the tools seems promising. Modern dynamic languages (Java, C#, Matlab, ...) not designed to exploit the power of static analysis – far from Fortran or C performance for HPC apps

Multi/Many-core vs. Hybrid From a programming perspective, Multi/Many-core isn’t fundamentally different from well-studied SMP exploitation going back decades. Cache coherency is expensive in both old SMPs and many-core contexts – difficult for compilers & libraries to optimize – false sharing, etc. HPC use of many-core will be severely limited by memory BW. Not obvious what compilers or libraries can do to overcome. Use of hybrid systems (GPGPUs, FPGAs, Cell, ...) are relatively recent in comparison. Performance asymmetries still being explored. Much usage experience needed for tool evolution.

Compiler Perspectives on Transitioning Code Generation Techniques for Multi-Core and Many-Core Systems