1 / 20

Exploiting SIMD parallelism with the CGiS compiler framework

Exploiting SIMD parallelism with the CGiS compiler framework. Nicolas Fritz , Philipp Lucas, Reinhard Wilhelm Saarland University. Outline. CGiS Language, compiler and GPU back-end SIMD back-end Hardware Challenges Transformations and optimizations Experimental results Future Work

morty
Download Presentation

Exploiting SIMD parallelism with the CGiS compiler framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University

  2. Outline • CGiS • Language, compiler and GPU back-end • SIMD back-end • Hardware • Challenges • Transformations and optimizations • Experimental results • Future Work • Conclusion

  3. CGiS • C-like data-parallel programming language • Goals: • Exploitation of parallel processing units in common PCs (GPU, SIMD units) • Easy access for inexperienced programmers • High abstraction level • 32-bit scalar and small vector data types • Two forms of explicit parallelism • SPMP (iteration), SIMD (vector types)

  4. CGiS Example: YUV to RGB PROGRAM yuv_to_rgb; INTERFACE extern in float3 YUV<_>; extern out float3 RGB<_>; CODE procedure yuv2rgb (in float3 yuv, out float3 rgb) { rgb = yuv.x + [0, 0.344, 1.77 ] * yuv.y + [1.403, 0.714, 0] * yuv.z; } CONTROL forall (yuv in YUV, rgb in RGB) { yuv2rgb (yuv, rgb); }

  5. PPU Code CGiS Compiler CGiSSource Interface Application CGiS Runtime CGiS Compiler Overview

  6. CGiS for GPUs • nVidia G80: • 128 floating points units • Scalar and vector data processible • 2-on-2 mapping of CGiS‘ parallelism • Code generation for various GPU generations • NV30, NV40, G80, CUDA • Limited access to hardware features through the driver

  7. SIMD Hardware • Every common PC features SIMD units • Intel‘s SSE and Freescale‘s AltiVec • SIMD parallelism not easily accessible for standard compilers • Well-known vectorization problems • Data access • Hardware requires 16-byte aligned loads • Slow but cached • Only 4-way SIMD vector parallelism usable

  8. The SIMD Back-end • Goal is mapping of CGiS parallelisms to SIMD hardware • “2-on-1” mapping • SIMD vectorization problems • Avoided by design: data dependency analyses • Control flow • Divergence in consecutive elements • Misalignment and data layout • Reordering might be needed • Gathering operations are bottle-necks in load-heavy algorithms on multidimensional streams

  9. Transformations and Optimizations • Control flow conversion • If/loop conversion • Loop sectioning for 2D streams • Increase cache performance for gather accesses • Kernel flattening • IR transformation that replaces compound variables and operations by scalar ones • “2-on-1”

  10. Control Flow Conversion • Full inlining • If/loop converison with slightly modified Allen-Kennedy algorithm • No guarded assignments • Masks for select operations are the results of vector compares • Live and written variables after a control flow join are copied at the branching • Select operations are inserted at the join

  11. Loop Sectioning • Adaptation of iteration sequence to better exploit cached data • Only interesting for 2D streams • Iterations subdivided in stripes • Width depends on access pattern, cache size and local variables

  12. Kernel Flattening • SIMD vectorization for yuv2rgb not applicable • Thus “flatten” the procedure or kernel: • Code transformation on the IR • All variables and all statements are split into scalar ones • Those can be subjected to SIMD vectorization procedure yuv2rgb (in float3 yuv, out float3 rgb) { rgb = yuv.x + [0, 0.344, 1.77 ] * yuv.y + [1.403, 0.714, 0] * yuv.z; }

  13. Kernel Flattening Example • Procedure yuv2rgb_f now features data types suitable to be SIMD-parellelized procedure yuv2rgb_f (in float yuv_x, in float yuv_y, in float yuv_z, out float rgb_x, out float rgb_y, out float rgb_z) { float cy = 0.344, cz = 1.77, dx = 1.403, dy = 0.714; rgb_x = yuv_x + + dx * yuv.z; rgb_y = yuv_x + cy * yuv.y + dy * yuv.z; rgb_z = yuv_x + cz * yuv.y; }

  14. Kernel Flattening • But: data layout doesn’t fit • No stride-one access for single components • Reordering of data required • Locally via permutes or shuffles • Globally via memory copy

  15. Kernel Flattening Data Reorderig

  16. Global vs. Local Reordering • Global reordering • Reusable for further iterations • Simple, but expensive in-memory copy • Destroys locality for gather accesses • Local reordering • Original stream data untouched • Insertion of possibly many relatively cheap in-register permutation operations • Locality for gathering preserved

  17. Experimental Results • Tested on Intel Core 2 Duo 1.83GHz and PowerPC G5 1.8GHz • Compiled with intrinsics on gcc 4.0.1 • Examples • Image processing: Gaussian blur • Loop sectioning • Computation of mandelbrot set • Control flow conversion • Block cipher encryption: rc5 encryption • Kernel flattening

  18. Experimental Results

  19. Future Work • Replace intrinsics by inline-assembly • Improvement of conditionals • Better control over register allocation • Improvement of register re-utilization for AltiVec • Raises with inline-assembly • Cell back-end • SIMD instruction set close to AltiVec • Work list algorithm to distribute stream parts to single PEs • More applications

  20. Conclusion • CGiS abstracts GPUs as well as SIMD units • SIMD back-end of the CGiS compiler produces efficient code • Other transformations and optimizations needed than for the GPU backend • Full control flow conversion needed • Gather accesses gain speed with loop sectioning • Kernel flattening enables better exploitation

More Related