semi automatic composition of data layout transformations for loop vectorization n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization PowerPoint Presentation
Download Presentation
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization

Loading in 2 Seconds...

play fullscreen
1 / 22

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization. Shixiong Xu , David Gregg University of Dublin, Trinity College Lero@TCD. Outline. Motivation Language Support for Data layout transformations Data layout transformation pragmas

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization' - howard-stanley


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
semi automatic composition of data layout transformations for loop vectorization

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization

Shixiong Xu, David Gregg

University of Dublin, Trinity College

Lero@TCD

outline
Outline
  • Motivation
  • Language Support for Data layout transformations
    • Data layout transformation pragmas
    • Composition of data layout transformations
  • Data layout aware loop transformations
  • Implementation and Experimental Evaluation
  • Conclusion
motivation 1 5
Motivation (1/5)
  • Inter-leaved data access from the data organized in an array of structures (AoS) hinders loop vectorization from unleashing the power of SIMD
    • the performance of gather and scatter instructions is still not good enough on modern commodity processors (e.g. Intel AVX2).
    • the state-of-art data permutation optimization only deals with strides of power-of-two, e.g. the data permutation optimization in GCC.
    • aggressive data permutation optimization may degrade the performance due to the overheads of data permutation instructions.
      • even if there were some general data permutation optimization for arbitrary strides
motivation 2 5
Motivation (2/5)
  • For many scientific computing applications with data in AoS, different loops in the program often repeat the same pattern of data permutation.
    • one easy way of getting rid of these repeated data permutations is to transform the layout of the data throughout the program.
    • compilers face great challenges when applying automatic data-layout transformations.
      • Safety: the automatic data layout transformations needs very sophisticated whole-program data dependency and pointer aliasing analysis
      • Profitability: guided by some imprecise cost models, thus, it is hard for compilers to choose the best data-layout transformations.
motivation 3 5
Motivation (3/5)
  • For many scientific computing applications with data in AoS, different loops in the program often repeat the same pattern of data permutation.
    • It is tedious and error-prone for programmers to change their code by hand.
      • Programmers need to change both the type declarations and any code that operates on the array to be transformed.
      • To the best of our knowledge, there are no suitable ways to allow users to specify their own data layout transformations.
        • Prior work mainly focuses on how to annotate loop transformation rather than data layout transformation, e.g. POET (CGO 11)
        • Inspired by the work, Semi-automatic Composition of Loop Transfor- mations for Deep Parallelism and Memory Hierarchies (IJPP, 2006)
motivation 4 5
Motivation (4/5)
  • Motivating Example
    • tezar() in the SP (Scalar Penta-diagonal), one of the benchmarks in the NAS Parallel Benchmarks (NPB)

* in this paper, we don’t consider other cache optimization like array padding.

only use one field

data access of stride 5

motivation 5 5
Motivation (5/5)
  • Possible data layout transformation and corresponding vectorization strategies

Simplify vectorization

data layout transformation pragmas 1 3
Data layout transformation pragmas (1/3)
  • array transform, a C language pragma to express data layout transformations on static arrays
  • array transform pragma consists of two parts:
    • array descriptor:
      • give a name to each array dimension
    • transform actions:
      • present basic data layout transformations
data layout transformation pragmas 2 3
Data layout transformation pragmas (2/3)
  • four basic data layout transformations
    • strip-mining
    • interchange
    • pad
    • peel
  • terms are borrowed from classic loop transformations
  • see the details of semantics of these transformation in the paper.
  • classified into two kinds:
    • pre-actions
    • post-actions:
      • array peel, split array dimension for the purpose of alignment, or making the array dimension size power-of-two.
data layout transformation pragmas 3 3
Data layout transformation pragmas (3/3)
  • Syntax of the array transform pragma
composition of data layout transformations 1 2
Composition of Data layout Transformations (1/2)
  • Array permutation, a sequence of array interchange.
  • Rectangular Array Tiling, a sequence of array strip-mining and array interchange.
data layout aware loop transformations 1 3
Data layout aware loop transformations (1/3)
  • Data layout transformation may change the code into a form that is not amenable to loop vectorization.
  • array strip-mining, introducing modulus operations to get off-sets in the resulting tiles.

hinder the loop vectorization from detecting possible contiguous memory access.

data layout aware loop transformations 2 3
Data layout aware loop transformations (2/3)
  • Solution:
    • data layout ware loop strip-mining
      • core idea: apply loop peeling and loop strip-mining according to the boundaries of data tiles from array strip-mining
      • kill two birds with one stone
        • eliminate parts of the modulus operations
        • enhance the data alignment
          • data accesses from the tile boundaries are possibly aligned
data layout aware loop transformations 3 3
Data layout aware loop transformations (3/3)
  • Solution:
    • data layout ware loop strip-mining
implementation and experimental evaluation 1 5
Implementation and Experimental Evaluation (1/5)
  • Implementation
    • is implemented in the Cetus source-to-source compiler.
    • array pragmas are collected and processed in the pragma parsing phase in the Cetus compiler.
    • * a pre-processing pass is optional, which applies loop-unrolling and constant propagation.
      • may be required by the array peeling.
    • array transformations are done as a transformation pass in the Cetus compiler.
    • the high-level internal representation in Cetus simplifies the processing of array transformations.
implementation and experimental evaluation 2 5
Implementation and Experimental Evaluation (2/5)
  • Experimental Evaluation
    • A case study for data layout tuning for loop vectorization
    • use the SP in the NAS Parallel Benchmarks with the data set of Class A in NPB, which has the size of 64 ×64 ×64 with 400 iterations.
    • Intel C compiler 13.1.3 as the native C compiler

Note that:

    • as seen in the movitivating example, we don’t consider other cache optimizations, e.g. array padding.
    • we only focus on the vectorization performance on a single core.
implementation and experimental evaluation 3 5
Implementation and Experimental Evaluation (3/5)
  • Performance of the Motivating Example
  • the performance improvement with data layout transformation on tzetar() with single precision is much more significant than doubles.
conclusion
Conclusion
  • We put forward a new C language pragma to allow programmers to

specify a sequence of data layout transformations. This language

annotation serves as a script to control data layout transformations

and thus can be integrated into a performance auto-tuning framework

as an extra tuning dimension.

  • We implemented our proposed data layout transformation pragma in

the Cetus source-to-source compiler. To reduce the overhead of

address computation and help vectorization, we introduce data

layout aware loop transformations along with the data layout

transformations.

  • Manual tuning of data layout transformations on the SP in

the NAS Parallel Benchmarks shows that with proper data layout

transformations, significant performance improvements are possible from better vectorization.