optimizing data permutations for simd devices l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Optimizing Data Permutations for SIMD Devices PowerPoint Presentation
Download Presentation
Optimizing Data Permutations for SIMD Devices

Loading in 2 Seconds...

play fullscreen
1 / 26

Optimizing Data Permutations for SIMD Devices - PowerPoint PPT Presentation


  • 577 Views
  • Uploaded on

Optimizing Data Permutations for SIMD Devices Gang Ren , Peng Wu 1 , David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center SIMD Is Everywhere + + + + ALU Register File Memory SIMD Architecture SIMD Compilation for(i=0; i<16; i++)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Optimizing Data Permutations for SIMD Devices' - niveditha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
optimizing data permutations for simd devices

Optimizing Data Permutations for SIMD Devices

Gang Ren, Peng Wu1, David Padua

University of Illinois at Urbana-Champaign

1 IBM T.J. Watson Research Center

simd is everywhere
SIMD Is Everywhere

+

+

+

+

ALU

Register File

Memory

SIMD Architecture

simd compilation
SIMD Compilation

for(i=0; i<16; i++)

c[i] = a[i] + b[i];

int a[16],b[16],c[16];

for(i=0; i<16; i++)

c[i] = a[i] + b[i];

Explore Data Parallelism

Explore Data Parallelism

c[0:15] = a[0:15] + b[0:15];

float a[16],b[16],c[16];

c[0:15] = a[0:15] + b[0:15];

Generating Efficient SIMD Code

Generating Efficient SIMD Code

...

vr1 = vec_load(a);

vr2 = vec_load(b);

vr3 = vec_add(vr1, vr2);

...

float a[16], b[16], c[16];

...

vr1 = vload(a);

vr2 = vload(b);

vr3 = vadd(vr1, vr2);

...

  • Vectorization
  • Instruction Packing
  • If Conversion
  • ……
  • Data Permutation Optimization
  • Idiom Recognition
  • Execution Mapping
  • Type Promotion Elimination
  • ……
strict simd architecture 1
Strict SIMD Architecture (1)

a0

a0

a0

a1

a1

a1

a2

a2

a2

a3

a3

a3

+

+

+

+

  • Most SIMD devices only support memory accesses on contiguous and aligned memory sections

... = ...a[0:3:1]...;

 vr1 = vec_load(a);

a0

a1

a2

a3

a4

a5

a6

a7

……

ALU

Register File

Memory

strict simd architecture 2
Strict SIMD Architecture (2)

a0

a2

a1

a3

a4

a6

a5

a7

a0

a4

a4

a0

a0

a4

a2

a5

a5

a5

a1

a1

a2

a2

a6

a6

a4

a6

a6

a7

a7

a3

a7

a3

a0

a1

a2

a3

a4

a5

a6

a7

a0

a2

a4

a6

+

+

+

+

vperm

<0,2,4,6>

  • Additional permutation instructions are needed for non-contiguous and/or misaligned memory references

... = ...a[0:6:2]...;

vr1 = vec_load(a);

vr2 = vec_load(a+4);

vr4 = vperm(vr1, vr2, <0,2,4,6>);

a0

a1

a2

a3

a4

a5

a6

a7

……

ALU

Register File

Strict SIMD devices: All data reorganization must be accomplished with permutation instructions.

Memory

overview of the optimization framework
Overview of the Optimization Framework

c[0:15] = a[0:31:2] + b[0:15];

float a[16],b[16],c[16];

c[0:15] = a[0:15] + b[0:15];

Normalization

Optimization

Code Generation

...

vr1 = vec_load(a);

vr2 = vec_load(a+4);

vr3 = vperm(vr1,vr2,…);

vr4 = vec_load(b);...

float a[16], b[16], c[16];

...

vr1 = vload(a);

vr2 = vload(b);

vr3 = vadd(vr1, vr2);

...

example an 8 point fft program
Example: An 8-point FFT Program

1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0:1] + t3[2:3];9. y[i+4:i+6:2] = t3[0:1] - t3[2:3];10. }

1. t0[0:6:2] = x[0:3] + x[4:7];2. t0[1:7:2] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * t0[0:7];4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2];6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2];7. t3[0:3] = T4[0:3] * t2[0:3];8. y[i+0:i+2:2] = t3[0] + t3[2:3];9. y[i+4:i+6:2] = t3[0] - t3[2:3];10. }

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];

0

1

2

3

Generating native permutation instructions from Permute operations

overview of the optimization framework8
Overview of the Optimization Framework

c[0:15] = a[0:31:2] + b[0:15];

float a[16],b[16],c[16];

c[0:15] = a[0:15] + b[0:15];

Normalization

Optimization

Code Generation

...

vr1 = vec_load(a);

vr2 = vec_load(a+4);

vr3 = vperm(vr1,vr2,…);

vr4 = vec_load(b);...

float a[16], b[16], c[16];

...

vr1 = vload(a);

vr2 = vload(b);

vr3 = vadd(vr1, vr2);

...

  • Use generic Permute to represent:
  • Non-unit strides
  • Misalignment
  • Other reorganizations
data permutations on vectors
Data Permutations on Vectors

a2

0

0

a0

a1

1

a1

1

a2

2

a0

2

a3

3

a3

3

t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>);

... = t[0:3] + t[4:7];

t[0:7] = Permute(a[0:7], <0,2,4,6,1,3,5,7>);

... = t[0:3] + t[4:7];

... = a[0:6:2] + a[1:7:2];

... = a[0:6:2] + a[1:7:2];

  • Permute(Xn, Pn): Xn is a vector and Pn is a permutation matrix
  • Use Permute to represent all data reorganizations explicitly

a[0:3]

b[0:3]

b[0:3] = Permute(a[0:3], <2,1,0,3>)

Two stride-2 accesses at right-hand side

overview of the optimization framework10
Overview of the Optimization Framework

c[0:15] = a[0:31:2] + b[0:15];

float a[16],b[16],c[16];

c[0:15] = a[0:15] + b[0:15];

Normalization

Optimization

Code Generation

...

vr1 = vec_load(a);

vr2 = vec_load(a+4);

vr3 = vperm(vr1,vr2,…);

vr4 = vec_load(b);...

float a[16], b[16], c[16];

...

vr1 = vload(a);

vr2 = vload(b);

vr3 = vadd(vr1, vr2);

...

  • Minimize Permute ops in a basic block
  • - Based on two rules of Permute
  • A NP-complete problem
  • Propagation-based algorithm
two important rules on permutations
Two Important Rules on Permutations

a0

a1

x0

x0

a1+b1

x0

x0

a0

x0

b0

a1+b1

x0

b1

x0

x0

a0

a0+b0

x0

a3

x0

x0

a1

x0

b0

x1

a1

a1+b1

a0+b0

a1

x1

x1

x1

a0

a0

x1

a0+b0

x1

a1

x1

x1

b1

b1

b0

x1

x1

x1

a0

x1

x2

x2

x2

a2

b2

x2

x2

a3

a2

a2+b2

x2

a1

x2

x2

x2

a2

a3+b3

b2

x2

x2

x2

a3

a3+b3

b3

a3

x3

a3

x3

a2+b2

x3

b2

x3

x3

b3

a2+b2

a2

x3

a2

b3

x3

x3

a3+b3

x3

a2

x3

x3

a3

x3

+

+

  • Composition Rule
  • Distributive Rule

Permute(Permute(a[0:3:1], <1, 0, 3, 2>), <2, 1, 0, 3>)

Permute(a[0:3:1], <3, 0, 1, 2>)

Permute(a[0:3:1], <1, 0, 3, 2>) + Permute(b[0:3:1], <1, 0, 3, 2>)

Permute(a[0:3:1] + b[0:3:1], <1, 0, 3, 2>)

propagation based optimization algorithm
Propagation-Based Optimization Algorithm

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8[0:7] * t0[0:7];5. v2[0:7] = Permute(t1[0:7], P2);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t1[0:7] = T8[0:7] * v1[0:7];4. u1[0:7] = Permute(t1[0:7], Q1);5. u2[0:3] = u1[0:3] + u1[4:7];6. u2[4:7] = u1[0:3] - u1[4:7];7. t3[0:7] = T4_2[0:7] * u2[0:7];8. u3[0:7] = Permute(t3[0:7], Q2);9. y[0:3] = u3[0:3] + u3[4:7];10. y[4:7] = u3[0:3] - u3[4:7];

  • Overview: Propagating permutation to permutation
    • Step 1: Pickup an unvisited permutation statement
    • Step 2: Propagate the permutation from the definition to the uses
    • Step 3: If a use is a permutation, goto (a), otherwise goto (b)
      • Merge it with the propagated permutation pattern. Goto Step 1
      • Propagate the permutation from right-hand side to left-hand side. Goto Step 2

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(v2[0:7], P3);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2[0:7] * t2[0:7];12. u3[0:7] = Permute(t3[0:7], P6);13. u4[0:3] = u3[0:3] + u3[4:7];14. u4[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

1. v1[0:3] = x[0:3] + x[4:7];2. v1[4:7] = x[0:3] - x[4:7];3. t0[0:7] = Permute(v1[0:7], P1);4. t1[0:7] = T8’[0:7] * v1[0:7];5. v2[0:7] = Permute(t1[0:7], P2’);6. u1[0:7] = Permute(t1[0:7], P3’);7. u2[0:3] = u1[0:3] + u1[4:7];8. u2[4:7] = u1[0:3] - u1[4:7];9. v3[0:7] = Permute(u2[0:7], P4);10. t2[0:7] = Permute(v3[0:7], P5);11. t3[0:7] = T4_2’[0:7] * u2[0:7];12. u3[0:7] = Permute(t3[0:7], P6’);13. y[0:3] = u3[0:3] + u3[4:7];14. y[4:7] = u3[0:3] - u3[4:7];15. v4[0:7] = Permute(u4[0:7], P7);16. y[0:7] = Permute(v4[0:7], P8);

propagating permutations to partial uses
Propagating Permutations to Partial Uses

b[0:3] and b[4:7] are two partial uses of b[0:7].

b[0:3] = Permute(a[0:3], <3,2,1,0>);

b[4:7] = Permute(a[4:7], <3,2,1,0>);

c[0:3] = b[0:3] + b[4:7];

b[0:3] = Permute(a[0:3], <3,2,1,0>);

b[4:7] = Permute(a[4:7], <3,2,1,0>);

c[0:3] = b[0:3] + b[4:7];

b[0:7] = Permute(a[0:7], <0,4,1,5,2,6,3,7>);

c[0:3] = b[0:3] + b[4:7];

b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);

c[0:3] = b[0:3] + b[4:7];

Q

b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);

c[0:3] = b[0:3] + b[4:7];

P

b[0:7] = Permute(a[0:7], <3,2,1,0,7,6,5,4>);

c[0:3] = b[0:3] + b[4:7];

R

Not all permutations can be partitioned and propagated to partial uses

  • Improvements over partial use boundary
  • - Permutation decomposition
    • Register-wise decomposition
    • Shuffle instruction decomposition
  • Permutation reshaping
optimization permutation reshaping
Optimization: Permutation Reshaping

a0

a0

a0

a0+a4

a4

a4

a0

c0

a4

a4

a0+a4

a0+a4

a4

a0

a0

a4

a4

a0

a4

a0+a4

c0

a0

a5+a1

c1

a1

a1

a5

a1

a5

a5

a5

a5+a1

a5

a1+a5

a5

a5

a1

a1

a1

a5+a1

a1

a1

a5

c1

a6

c2

a6

a2+a6

a2

a6

a2+a6

a6

a2

a6

a2

a2+a6

a2

a6

a2

c2

a2

a2

a6

a2

a2+a6

a6

a7

a7+a3

c3

a7+a3

a7

a3

a7

a3

a3

a3

a3

a3

a7

a3

a7

a3

a7

c3

a3+a7

a7

a7

a7+a3

b[0:7] = Permute(a[0:7], <0,5,2,7,4,1,6,3>);c[0:4] = b[0:3] + b[4:7];

b[0:7] = Permute(a[0:7], <0,1,2,3,4,5,6,7>);c[0:4] = b[0:3] + b[4:7];

b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85];

b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85];

+

+

  • For permutations used in commutative operations
overview of the optimization framework15
Overview of the Optimization Framework

c[0:15] = a[0:31:2] + b[0:15];

float a[16],b[16],c[16];

c[0:15] = a[0:15] + b[0:15];

Normalization

Optimization

Code Generation

...

vr1 = vec_load(a);

vr2 = vec_load(a+4);

vr3 = vperm(vr1,vr2,…);

vr4 = vec_load(b);...

float a[16], b[16], c[16];

...

vr1 = vload(a);

vr2 = vload(b);

vr3 = vadd(vr1, vr2);

...

  • “Strip-mine” Permute to vperm inst.
  • Map vperm to native permutation inst.
generating permutation inst ructions 1
Generating Permutation Instructions (1)

vperm

vperm

vperm

vperm

0

0

2

0

0

0

0

0

12

1

8

4

0

0

0

0

0

2

0

1

0

1

0

2

0

3

0

3

0

3

0

0

0

0

0

0

4

1

7

1

6

1

5

4

1

1

1

1

6

7

6

1

1

1

4

1

13

9

7

1

5

1

1

1

1

1

4

4

5

1

1

5

2

8

2

11

9

2

2

2

8

*

10

2

10

2

2

6

2

14

2

2

2

*

2

8

11

*

2

2

2

2

10

2

9

2

*

*

3

15

3

7

3

3

*

*

3

*

3

*

3

*

3

3

*

3

13

15

*

*

14

3

3

3

3

11

3

3

12

3

*

3

*

3

vperm

vperm

<0,1,4,*>

vperm

vperm

vperm

vperm

<0,4,*,*>

vperm

<0,1,2,4>

vperm

vperm

vperm

vperm

a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>);

b[0:15]

a[0:15]

generating permutation inst ructions 2
Generating Permutation Instructions (2)

0

0

2

10

4

0

0

0

0

0

8

0

0

8

0

0

3

0

12

0

0

0

0

0

8

8

0

2

0

0

0

1

1

13

1

1

12

5

1

9

4

4

1

6

1

1

1

1

1

5

12

1

1

1

6

4

7

1

1

1

12

14

4

1

2

14

2

2

8

2

2

2

2

2

11

2

2

10

10

6

3

*

9

11

2

*

2

2

2

9

1

2

2

2

9

1

3

3

14

5

3

3

3

13

3

3

13

12

3

7

3

3

7

*

3

5

3

15

3

3

15

3

11

15

13

*

3

3

vperm

vperm

vperm

vperm

vperm

vperm

<0,4,*,*>

<0,4,*,*>

<0,4,1,5>

<0,4,1,5>

<0,1,4,5>

<2,3,6,7>

vperm

vperm

vperm

vperm

vperm

vperm

vperm

vperm

a[0:15] = Permute(b[0:15], <0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15>);

b[0:15]

a[0:15]

  • Two Steps:
  • Maximize empty slots when generating vperm instructions;
  • Fill empty slots with data elements that go to the same target;
experiment setups
Experiment Setups
  • Two SIMD devices: VMX(AltiVec) & SSE2
  • Tested applications
    • Group I : Applications with relatively simple permutation patterns
      • C-Saxpy: Complex version of saxpy ( y = alpha*x + y )
      • R-Color, C-Dot, R-FIR, …
    • Group II: Applications with complicated permutation patterns
      • FFT: Fast Fourier transform programs generated by the SPIRAL system
      • WHT: Walsh-Hadamard transform routines generated by the SPIRAL system
      • Bitonic sorting: One of the fastest sorting networks
    • Group III: Reorganization-only applications
      • Matrix transpose
      • Bit-reversal reordering
related work
Related Work
  • Optimizing permutation instructions introduced by misalignment
    • A. Eichenberger, P. Wu, K. O'Brien, Vectorization for SIMD architectures with alignment constraints, PLDI ’04
    • P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD Code Generation for Runtime Alignment and Length Conversion, CGO 05
  • Efficient permutation instruction generation
    • A. Kudriavtsev, P. Kogge, Generation of permutations for SIMD processors, LCTES ’05
    • M. Narayanan, K. Yelick, Generating permutation instructions from a high-level description, MSP ’04
    • D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization of interleaved data for SIMD, PLDI ’06
  • Similar idea, different applications
    • A. Solar-Lezama, R. Rabbah, R. Bodik, K. Ebcioglu, Programming by sketching for bit-streaming programs, PLDI ’05
    • S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng. Automatic array alignment in data-parallel programs, POPL ’93
    • G. Hwang, J. K. Lee, D. Ju, An array operation synthesis scheme to optimize FORTRAN 90 programs, PPOPP ’95
conclusion
Conclusion
  • It is a performance critical problem for SIMD compilation to reduce the overhead introduced by permutation instructions
  • A unified framework is proposed to optimize data permutations
    • Putting all forms of data permutations into a unified representation
    • Propagating permutations across statements and merging them together
    • Generating efficient permutation instructions natively supported by devices
  • Experiments were conducted on different applications
    • Up to 77% permutation instructions are eliminated
    • Improve average performance by 48% on VMX and 68% on SSE2
    • Near-peak overall speedups are achieved on some applications
questions

Questions?

June 2006

future work
Future Work
  • Extending the unified optimization framework
    • Handle data permutations with variable length
    • More optimizations on reduction, duplication and special elem.
  • Expanding the application domain of the framework
    • Implementing on a vectorizing compiler
    • Testing on more benchmark applications
  • Study the interaction between data permutation optimization and the other compiler techniques
optimization permutation decomposition
Optimization: Permutation Decomposition

b[0:7] = Permute(a[0:7], <3,2,7,6,1,0,5,4>);c[0:4] = b[0:3] + b[4:7];

b[05] = Permute(a[05], <0,11,10,9,8>);c[0:7] = b[0:7] + b[85];

t[0:7] = Permute(a[0:7], <2,3,6,7,0,1,4,5>);b[0:3] = Permute(t[0:3], <1,0,3,2>);b[4:7] = Permute(t[4:7], <1,0,3,2>); c[0:4] = b[0:3] + b[4:7];

t[0:7] = Permute(a[0:7], <2,3,6,7,0,1,4,5>);b[0:3] = Permute(t[0:3], <1,0,3,2>);b[4:7] = Permute(t[4:7], <1,0,3,2>); c[0:4] = b[0:3] + b[4:7];

=

*

  • Observation: Register-wise permutation may cost nothing
  • Decompose permutations for propagation to partial uses

b[0:7] = Permute(a[0:7], <2,3,6,7,0,1,4,5>);

a0

a1

a2

a3

a4

a5

a6

a7

a0

a1

a2

a3

a0

a1

a2

a3

a[0:7]

a2

a3

a6

a7

a0

a1

a4

a5

a2

a3

a2

a3

a0

a1

a0

a1

b[0:7]