1 / 21

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Superword Level Parallelism with Multimedia Instruction Sets. Samuel Larsen and Saman Amarasinghe , MIT CSAIL Presenter: Baixu Chen, Rui Hou, Yifan Zhao. Outline. Introduction Vector Parallelism Superword Level Parallelism SLP Compiler Algorithm and Example Evaluation

seansmith
Download Presentation

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Superword Level Parallelismwith Multimedia Instruction Sets Samuel Larsen and SamanAmarasinghe, MIT CSAIL Presenter: Baixu Chen, Rui Hou, Yifan Zhao

  2. Outline • Introduction • Vector Parallelism • Superword Level Parallelism • SLP Compiler Algorithm and Example • Evaluation • Conclusion Exploiting SLP

  3. Introduction • Vector Parallelism (VP): • Single instruction multiple data (SIMD) • Some vector registers can hold several values at any one time, and then perform same operation on each value by one pass • Example • In vectorized iteration, each load will load 3 adjacent values into one vector register, and then one vector add will be executed instead of 3 individual scalar add operation. So execution ideally will be 3x speedup • for (i = 0; i < 3; i+=3) { • A[i..i+2] += B[i..i+2] + 2; • } for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i += 3) { A[i] += B[i] + 2; A[i+1] += B[i+1] + 2; A[i+2] += B[i+2] + 2; } original loop vectorized iteration unrolling loop Exploiting SLP

  4. Introduction • SLP vs Vector Parallelism • VP: Identify vectorizable instruction by unrolling the loop for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; } • Identify vectorizable instruction by finding isomorphic instructions, which share similar instruction structure a = b + c * z[i+0];d = e + f * z[i+1]; g = h + i * z[i+2]; a d g b e f c f i z[i+0] z[i+1] z[i+2] = + * SIMD SIMD Exploiting SLP

  5. Introduction • Superword Level Parallelism (SLP): • Generally applicable - SLP is not restricted on parallelization of loops • Find independent and isomorphic instructions within same basic block • Goal: 1) gain more speedup via parallelization; 2) minimize the cost of packing and unpacking • Prefer operating on adjacent memory, whose cost of packing is minimum • SLP vs Vector Parallelism for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; } a = b + c * z[i+0];d = e + f * z[i+1]; g = h + i * z[i+2]; a d g b e f c f i z[i+0] z[i+1] z[i+2] = + * SIMD SIMD Exploiting SLP

  6. Introduction • Superword Level Parallelism (SLP): • Generally applicable - SLP is not restricted on parallelization of loops • Find independent and isomorphic instructions within same basic block • Goal: 1) gain more speedup via parallelization; 2) minimize the cost of packing and unpacking • Prefer operating on adjacent memory, whose cost of packing is minimum • SLP vs Vectorization Parallelism vs ILP ILP VP SLP ILP

  7. Outline • Introduction • Vector Parallelism • Superword Level Parallelism • SLP Compiler Algorithm and Example • Evaluation • Conclusion Exploiting SLP

  8. Loop Unrolling • Allows SLP to also cover loop vectorization • Unroll factor: SIMD datapath size / scalar variable size Basicblock Loop unrolling Alignment analysis Loop unrolling Pre-optimization for (i = 0...4) { A[i] += B[i] + 2; A[i+1] += B[i+1] + 2;} for (i = 0...8) { A[i] += B[i] + 2; } PackSet seeding PackSet extension SLP vectorization Group combination sizeof(A[i]) == 4 bytes datapath == 8 bytes Unroll count == 2 Scheduling SIMD instructions for (i = 0...4) { A[i:i+2] += B[i:i+2] + 2; } Exploiting SLP

  9. Alignment Analysis • SLP vectorization emits wider loads/stores • A[i]+=B[i]+2: load 4 bytes • A[i:i+2]+=B[i:i+2]+2: load 8 bytes • Goal: emits aligned SIMD loads (aligned to datapath size) • Unaligned loads -> slower (most Intel chips); illegal (ARM chips) • Alignment analysis calculates address modulo info • 0x1001 mod 4 = 1, 0x1004 mod 8 = 4, etc. • Modulo = 0 -> aligned! • Merge 2 loads only if resulting in an aligned load • load A[i], 4 byte; load A[i+1], 4 byte -> load A[i], 8 byte • Only does so if A[i] is aligned to 8 bytes Basicblock Loop unrolling Alignment analysis Pre-optimization PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Exploiting SLP

  10. Pre-optimization • Emitting three-address code (flattened instructions) • Each instruction has at most 3 operands • Isomorphic instructions are more identifiable in this form • Classical optimizations • LICM, Dead code elim., Const. Propagation, etc. • Don’t optimize instructions that we don’t need! Basicblock Loop unrolling Alignment analysis Pre-optimization PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Exploiting SLP

  11. Identifying Adjacent Memory References • Find pairs of adjacent, independent memory loads • Adjacent: A[i] and A[i+1] • Independent: no overlap, load to different registers • Packset: a set of pairs of instruction Basicblock Loop unrolling Alignment analysis Pre-optimization Packset (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J + I PackSet seeding L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] PackSet extension Group combination Scheduling SIMD instructions Exploiting SLP

  12. Extending Packset • Add more instructions into packset… • Use existing packed data as source operands (def-use) Basicblock Loop unrolling Alignment analysis Pre-optimization Packset PackSet seeding (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J +I L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I PackSet extension Group combination def-use Scheduling SIMD instructions Exploiting SLP

  13. Extending Packset • Add more instructions into packset… • Use existing packed data as source operands (def-use) • Produce need source operands in packed form (use-def) Packset (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J +I L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I L: (2) C = E * 3 R: (5) G = F * 6 L: (5) G = F * 6 R: (8) J = I * 8 use-def Exploiting SLP

  14. Extending Packset -- Cost model • Estimate a cost before/after optimization • Perform optimization only when profitable • Packing and unpacking has cost • Packing: scalar registers -> SIMD register • Unpacking: SIMD register -> scalar registers A B A B A B Packing A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 2 3 C D A B + = C D C D Unpacking Exploiting SLP

  15. Extending Packset -- Cost model • Using a vector value compensates for its packing costs • 2n instructions -> 1 instruction • Want to generate longer use chain • Choose Packset instruction wisely • Contiguous memory loads/stores -> no packing/unpacking cost • Prioritize min-cost instructions when making choices long tmp1, tmp2, A[], B[] tmp1 = B[i+0] + D[i+0]; tmp2 = B[i+1] + D[i+1]; A[i+0] = tmp1; A[i+1] = tmp2; tmp1 tmp2 B[i+0] B[i+1] D[i+0] D[i+1] = + tmp1 tmp2 A[i+0] A[i+1] <- No packing/unpacking cost! = Exploiting SLP

  16. Combination • Combine all profitable pairs into larger groups • Two groups can be combined when the left statement of one is the same as the right statement of the other • Prevent a statement from appearing in more than one group in final Packset Basicblock Loop unrolling Alignment analysis Pre-optimization PackSet seeding Packset Packset PackSet extension (1) B = A[i+0] (4) F = A[i+1] (7) I = A[i+2] (3) D = C + B (6) H = G + F (9) K = J + I (2) C = E * 3 (5) G = F * 6 (8) J = I * 8 L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I L: (2) C = E * 3 R: (5) G = F * 6 L: (5) G = F * 6 R: (8) J = I * 8 Group combination Scheduling SIMD instructions Exploiting SLP

  17. Scheduling • Previous steps may break inter-group dependencies! • Groups are processed unordered • Moving instructions around risks breaking dependencies • Schedule the groups to guarantee correctness • a -> b: some instructions in group a depends on some in group b. • Cycle: circular dependency between a and b • Break one of them (e.g., a), move a’s dependent instructions after b. • Finally, emit one SIMD instruction for each group Basicblock Loop unrolling Alignment analysis Pre-optimization PackSet seeding PackSet extension Group combination Scheduling x = a[i+0] + k1; y = a[i+0] + k2; q = b[i+0] + y; r = b[i+1] + k3; s = b[i+2] + k4; z = a[i+2] + s; x = a[i+0] + k1; y = a[i+0] + k2; z = a[i+2] + s; SIMD instructions q = b[i+0] + y; r = b[i+1] + k3; s = b[i+2] + k4; Exploiting SLP

  18. Outline • Introduction • Vectorization parallelism • Superword level parallelism • SLP Compiler Algorithm and Example • Evaluation • Conclusion Exploiting SLP

  19. Evaluation • SLP compiler implemented in SUIF • SUIF: a compiler infrastructure • Tested on two benchmark suites • SPEC95fp • Multimedia kernels • Performance measured three ways • SLP availability • Measured by percentage of dynamic instructions eliminated from a sequential program after parallelization • n-1 instructions needed to pack/unpack n values into a single SIMD register • Compared to vector parallelism SLP Availability SLP vs. Vector Parallelism Exploiting SLP

  20. Evaluation • SLP compiler implemented in SUIF • SUIF: a compiler infrastructure • Tested on two benchmark suites • SPEC95fp • Multimedia kernels • Performance measured three ways • SLP availability • Compared to vector parallelism • Speedup on AltiVec • Motorola MPC7400 microprocessor with the AltiVec instruction set 6.7 Speedup on AltiVec Exploiting SLP

  21. Conclusion • Multimedia architectures abundant • Need automatic compilation • SLP is the right paradigm • 20% dynamic instruction savings from non-vectorizable code sequences in SPEC95fp • SLP extraction successful • Simple, local analysis • Provides speedups from 1.24 - 6.70 • Future Work • Beyond basic blocks Exploiting SLP

More Related