1 / 22

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization. Shixiong Xu , David Gregg University of Dublin, Trinity College Lero@TCD. Outline. Motivation Language Support for Data layout transformations Data layout transformation pragmas

Download Presentation

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College Lero@TCD

  2. Outline • Motivation • Language Support for Data layout transformations • Data layout transformation pragmas • Composition of data layout transformations • Data layout aware loop transformations • Implementation and Experimental Evaluation • Conclusion

  3. Motivation (1/5) • Inter-leaved data access from the data organized in an array of structures (AoS) hinders loop vectorization from unleashing the power of SIMD • the performance of gather and scatter instructions is still not good enough on modern commodity processors (e.g. Intel AVX2). • the state-of-art data permutation optimization only deals with strides of power-of-two, e.g. the data permutation optimization in GCC. • aggressive data permutation optimization may degrade the performance due to the overheads of data permutation instructions. • even if there were some general data permutation optimization for arbitrary strides

  4. Motivation (2/5) • For many scientific computing applications with data in AoS, different loops in the program often repeat the same pattern of data permutation. • one easy way of getting rid of these repeated data permutations is to transform the layout of the data throughout the program. • compilers face great challenges when applying automatic data-layout transformations. • Safety: the automatic data layout transformations needs very sophisticated whole-program data dependency and pointer aliasing analysis • Profitability: guided by some imprecise cost models, thus, it is hard for compilers to choose the best data-layout transformations.

  5. Motivation (3/5) • For many scientific computing applications with data in AoS, different loops in the program often repeat the same pattern of data permutation. • It is tedious and error-prone for programmers to change their code by hand. • Programmers need to change both the type declarations and any code that operates on the array to be transformed. • To the best of our knowledge, there are no suitable ways to allow users to specify their own data layout transformations. • Prior work mainly focuses on how to annotate loop transformation rather than data layout transformation, e.g. POET (CGO 11) • Inspired by the work, Semi-automatic Composition of Loop Transfor- mations for Deep Parallelism and Memory Hierarchies (IJPP, 2006)

  6. Motivation (4/5) • Motivating Example • tezar() in the SP (Scalar Penta-diagonal), one of the benchmarks in the NAS Parallel Benchmarks (NPB) * in this paper, we don’t consider other cache optimization like array padding. only use one field data access of stride 5

  7. Motivation (5/5) • Possible data layout transformation and corresponding vectorization strategies Simplify vectorization

  8. Data layout transformation pragmas (1/3) • array transform, a C language pragma to express data layout transformations on static arrays • array transform pragma consists of two parts: • array descriptor: • give a name to each array dimension • transform actions: • present basic data layout transformations

  9. Data layout transformation pragmas (2/3) • four basic data layout transformations • strip-mining • interchange • pad • peel • terms are borrowed from classic loop transformations • see the details of semantics of these transformation in the paper. • classified into two kinds: • pre-actions • post-actions: • array peel, split array dimension for the purpose of alignment, or making the array dimension size power-of-two.

  10. Data layout transformation pragmas (3/3) • Syntax of the array transform pragma

  11. Composition of Data layout Transformations (1/2) • Array permutation, a sequence of array interchange. • Rectangular Array Tiling, a sequence of array strip-mining and array interchange.

  12. Composition of Data layout Transformations (2/2) • Motivating example:

  13. Data layout aware loop transformations (1/3) • Data layout transformation may change the code into a form that is not amenable to loop vectorization. • array strip-mining, introducing modulus operations to get off-sets in the resulting tiles. hinder the loop vectorization from detecting possible contiguous memory access.

  14. Data layout aware loop transformations (2/3) • Solution: • data layout ware loop strip-mining • core idea: apply loop peeling and loop strip-mining according to the boundaries of data tiles from array strip-mining • kill two birds with one stone • eliminate parts of the modulus operations • enhance the data alignment • data accesses from the tile boundaries are possibly aligned

  15. Data layout aware loop transformations (3/3) • Solution: • data layout ware loop strip-mining

  16. Implementation and Experimental Evaluation (1/5) • Implementation • is implemented in the Cetus source-to-source compiler. • array pragmas are collected and processed in the pragma parsing phase in the Cetus compiler. • * a pre-processing pass is optional, which applies loop-unrolling and constant propagation. • may be required by the array peeling. • array transformations are done as a transformation pass in the Cetus compiler. • the high-level internal representation in Cetus simplifies the processing of array transformations.

  17. Implementation and Experimental Evaluation (2/5) • Experimental Evaluation • A case study for data layout tuning for loop vectorization • use the SP in the NAS Parallel Benchmarks with the data set of Class A in NPB, which has the size of 64 ×64 ×64 with 400 iterations. • Intel C compiler 13.1.3 as the native C compiler Note that: • as seen in the movitivating example, we don’t consider other cache optimizations, e.g. array padding. • we only focus on the vectorization performance on a single core.

  18. Implementation and Experimental Evaluation (3/5) • Performance of the Motivating Example • the performance improvement with data layout transformation on tzetar() with single precision is much more significant than doubles.

  19. Implementation and Experimental Evaluation (4/5) • Performance of SP 1.8X

  20. Implementation and Experimental Evaluation (5/5) • Performance breakdown of SP

  21. Conclusion • We put forward a new C language pragma to allow programmers to specify a sequence of data layout transformations. This language annotation serves as a script to control data layout transformations and thus can be integrated into a performance auto-tuning framework as an extra tuning dimension. • We implemented our proposed data layout transformation pragma in the Cetus source-to-source compiler. To reduce the overhead of address computation and help vectorization, we introduce data layout aware loop transformations along with the data layout transformations. • Manual tuning of data layout transformations on the SP in the NAS Parallel Benchmarks shows that with proper data layout transformations, significant performance improvements are possible from better vectorization.

  22. Q&A .

More Related