Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Superword Level Parallelismwith Multimedia Instruction Sets Samuel Larsen and SamanAmarasinghe, MIT CSAIL Presenter: Baixu Chen, Rui Hou, Yifan Zhao

Outline • Introduction • Vector Parallelism • Superword Level Parallelism • SLP Compiler Algorithm and Example • Evaluation • Conclusion Exploiting SLP

Introduction • Vector Parallelism (VP): • Single instruction multiple data (SIMD) • Some vector registers can hold several values at any one time, and then perform same operation on each value by one pass • Example • In vectorized iteration, each load will load 3 adjacent values into one vector register, and then one vector add will be executed instead of 3 individual scalar add operation. So execution ideally will be 3x speedup • for (i = 0; i < 3; i+=3) { • A[i..i+2] += B[i..i+2] + 2; • } for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i += 3) { A[i] += B[i] + 2; A[i+1] += B[i+1] + 2; A[i+2] += B[i+2] + 2; } original loop vectorized iteration unrolling loop Exploiting SLP

Introduction • SLP vs Vector Parallelism • VP: Identify vectorizable instruction by unrolling the loop for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; } • Identify vectorizable instruction by finding isomorphic instructions, which share similar instruction structure a = b + c * z[i+0];d = e + f * z[i+1]; g = h + i * z[i+2]; a d g b e f c f i z[i+0] z[i+1] z[i+2] = + * SIMD SIMD Exploiting SLP

Introduction • Superword Level Parallelism (SLP): • Generally applicable - SLP is not restricted on parallelization of loops • Find independent and isomorphic instructions within same basic block • Goal: 1) gain more speedup via parallelization; 2) minimize the cost of packing and unpacking • Prefer operating on adjacent memory, whose cost of packing is minimum • SLP vs Vector Parallelism for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; } a = b + c * z[i+0];d = e + f * z[i+1]; g = h + i * z[i+2]; a d g b e f c f i z[i+0] z[i+1] z[i+2] = + * SIMD SIMD Exploiting SLP

Introduction • Superword Level Parallelism (SLP): • Generally applicable - SLP is not restricted on parallelization of loops • Find independent and isomorphic instructions within same basic block • Goal: 1) gain more speedup via parallelization; 2) minimize the cost of packing and unpacking • Prefer operating on adjacent memory, whose cost of packing is minimum • SLP vs Vectorization Parallelism vs ILP ILP VP SLP ILP

Outline • Introduction • Vector Parallelism • Superword Level Parallelism • SLP Compiler Algorithm and Example • Evaluation • Conclusion Exploiting SLP

Loop Unrolling • Allows SLP to also cover loop vectorization • Unroll factor: SIMD datapath size / scalar variable size Basicblock Loop unrolling Alignment analysis Loop unrolling Pre-optimization for (i = 0...4) { A[i] += B[i] + 2; A[i+1] += B[i+1] + 2;} for (i = 0...8) { A[i] += B[i] + 2; } PackSet seeding PackSet extension SLP vectorization Group combination sizeof(A[i]) == 4 bytes datapath == 8 bytes Unroll count == 2 Scheduling SIMD instructions for (i = 0...4) { A[i:i+2] += B[i:i+2] + 2; } Exploiting SLP

Alignment Analysis • SLP vectorization emits wider loads/stores • A[i]+=B[i]+2: load 4 bytes • A[i:i+2]+=B[i:i+2]+2: load 8 bytes • Goal: emits aligned SIMD loads (aligned to datapath size) • Unaligned loads -> slower (most Intel chips); illegal (ARM chips) • Alignment analysis calculates address modulo info • 0x1001 mod 4 = 1, 0x1004 mod 8 = 4, etc. • Modulo = 0 -> aligned! • Merge 2 loads only if resulting in an aligned load • load A[i], 4 byte; load A[i+1], 4 byte -> load A[i], 8 byte • Only does so if A[i] is aligned to 8 bytes Basicblock Loop unrolling Alignment analysis Pre-optimization PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Exploiting SLP

Pre-optimization • Emitting three-address code (flattened instructions) • Each instruction has at most 3 operands • Isomorphic instructions are more identifiable in this form • Classical optimizations • LICM, Dead code elim., Const. Propagation, etc. • Don’t optimize instructions that we don’t need! Basicblock Loop unrolling Alignment analysis Pre-optimization PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Exploiting SLP

Identifying Adjacent Memory References • Find pairs of adjacent, independent memory loads • Adjacent: A[i] and A[i+1] • Independent: no overlap, load to different registers • Packset: a set of pairs of instruction Basicblock Loop unrolling Alignment analysis Pre-optimization Packset (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J + I PackSet seeding L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] PackSet extension Group combination Scheduling SIMD instructions Exploiting SLP

Extending Packset • Add more instructions into packset… • Use existing packed data as source operands (def-use) Basicblock Loop unrolling Alignment analysis Pre-optimization Packset PackSet seeding (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J +I L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I PackSet extension Group combination def-use Scheduling SIMD instructions Exploiting SLP

Extending Packset • Add more instructions into packset… • Use existing packed data as source operands (def-use) • Produce need source operands in packed form (use-def) Packset (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J +I L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I L: (2) C = E * 3 R: (5) G = F * 6 L: (5) G = F * 6 R: (8) J = I * 8 use-def Exploiting SLP

Extending Packset -- Cost model • Estimate a cost before/after optimization • Perform optimization only when profitable • Packing and unpacking has cost • Packing: scalar registers -> SIMD register • Unpacking: SIMD register -> scalar registers A B A B A B Packing A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 2 3 C D A B + = C D C D Unpacking Exploiting SLP

Extending Packset -- Cost model • Using a vector value compensates for its packing costs • 2n instructions -> 1 instruction • Want to generate longer use chain • Choose Packset instruction wisely • Contiguous memory loads/stores -> no packing/unpacking cost • Prioritize min-cost instructions when making choices long tmp1, tmp2, A[], B[] tmp1 = B[i+0] + D[i+0]; tmp2 = B[i+1] + D[i+1]; A[i+0] = tmp1; A[i+1] = tmp2; tmp1 tmp2 B[i+0] B[i+1] D[i+0] D[i+1] = + tmp1 tmp2 A[i+0] A[i+1] <- No packing/unpacking cost! = Exploiting SLP

Combination • Combine all profitable pairs into larger groups • Two groups can be combined when the left statement of one is the same as the right statement of the other • Prevent a statement from appearing in more than one group in final Packset Basicblock Loop unrolling Alignment analysis Pre-optimization PackSet seeding Packset Packset PackSet extension (1) B = A[i+0] (4) F = A[i+1] (7) I = A[i+2] (3) D = C + B (6) H = G + F (9) K = J + I (2) C = E * 3 (5) G = F * 6 (8) J = I * 8 L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I L: (2) C = E * 3 R: (5) G = F * 6 L: (5) G = F * 6 R: (8) J = I * 8 Group combination Scheduling SIMD instructions Exploiting SLP

Scheduling • Previous steps may break inter-group dependencies! • Groups are processed unordered • Moving instructions around risks breaking dependencies • Schedule the groups to guarantee correctness • a -> b: some instructions in group a depends on some in group b. • Cycle: circular dependency between a and b • Break one of them (e.g., a), move a’s dependent instructions after b. • Finally, emit one SIMD instruction for each group Basicblock Loop unrolling Alignment analysis Pre-optimization PackSet seeding PackSet extension Group combination Scheduling x = a[i+0] + k1; y = a[i+0] + k2; q = b[i+0] + y; r = b[i+1] + k3; s = b[i+2] + k4; z = a[i+2] + s; x = a[i+0] + k1; y = a[i+0] + k2; z = a[i+2] + s; SIMD instructions q = b[i+0] + y; r = b[i+1] + k3; s = b[i+2] + k4; Exploiting SLP

Outline • Introduction • Vectorization parallelism • Superword level parallelism • SLP Compiler Algorithm and Example • Evaluation • Conclusion Exploiting SLP

Evaluation • SLP compiler implemented in SUIF • SUIF: a compiler infrastructure • Tested on two benchmark suites • SPEC95fp • Multimedia kernels • Performance measured three ways • SLP availability • Measured by percentage of dynamic instructions eliminated from a sequential program after parallelization • n-1 instructions needed to pack/unpack n values into a single SIMD register • Compared to vector parallelism SLP Availability SLP vs. Vector Parallelism Exploiting SLP

Evaluation • SLP compiler implemented in SUIF • SUIF: a compiler infrastructure • Tested on two benchmark suites • SPEC95fp • Multimedia kernels • Performance measured three ways • SLP availability • Compared to vector parallelism • Speedup on AltiVec • Motorola MPC7400 microprocessor with the AltiVec instruction set 6.7 Speedup on AltiVec Exploiting SLP

Conclusion • Multimedia architectures abundant • Need automatic compilation • SLP is the right paradigm • 20% dynamic instruction savings from non-vectorizable code sequences in SPEC95fp • SLP extraction successful • Simple, local analysis • Provides speedups from 1.24 - 6.70 • Future Work • Beyond basic blocks Exploiting SLP

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Presentation Transcript

Instruction Level Parallelism

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Instruction-level Parallelism

Instruction Level Parallelism

Instruction-Level Parallelism

Instruction-Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Instruction Level Parallelism

Instruction Level Parallelism

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Instruction-Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism: Loop Level Parallelism

Instruction-Level Parallelism

Instruction-level Parallelism