1 / 55

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Superword Level Parallelism with Multimedia Instruction Sets. Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp. Overview. Problem statement

toyah
Download Presentation

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp

  2. Overview • Problem statement • New paradigm for parallelism  SLP • SLP extraction algorithm • Results • SLP vs. ILP and vector parallelism • Conclusions • Future work

  3. Multimedia Extensions • Additions to all major ISAs • SIMD operations

  4. Using Multimedia Extensions • Library calls and inline assembly • Difficult to program • Not portable

  5. Using Multimedia Extensions • Library calls and inline assembly • Difficult to program • Not portable • Different extensions to the same ISA • MMX and SSE • SSE vs. 3DNow!

  6. Using Multimedia Extensions • Library calls and inline assembly • Difficult to program • Not portable • Different extensions to the same ISA • MMX and SSE • SSE vs. 3DNow! • Need automatic compilation

  7. Vector Compilation • Pros: • Successful for vector computers • Large body of research

  8. Vector Compilation • Pros: • Successful for vector computers • Large body of research • Cons: • Involved transformations • Targets loop nests

  9. Superword Level Parallelism (SLP) • Small amount of parallelism • Typically 2 to 8-way • Exists within basic blocks • Uncovered with a simple analysis

  10. Superword Level Parallelism (SLP) • Small amount of parallelism • Typically 2 to 8-way • Exists within basic blocks • Uncovered with a simple analysis • Independent isomorphic operations • New paradigm

  11. R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835 1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835

  12. R R G = G + X[i:i+2] B B 2. Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2]

  13. 3. Vectorizable Loops for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

  14. for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] 3. Vectorizable Loops for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3]

  15. 4. Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

  16. for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) 4. Partially Vectorizable Loops for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L)

  17. Exploiting SLP with SIMD Execution • Benefit: • Multiple ALU ops  One SIMD op • Multiple ld/st ops  One wide mem op

  18. Exploiting SLP with SIMD Execution • Benefit: • Multiple ALU ops  One SIMD op • Multiple ld/st ops  One wide mem op • Cost: • Packing and unpacking • Reshuffling within a register

  19. C A 2 D B 3 = + Packing/Unpacking Costs C = A + 2 D = B + 3

  20. A A B B Packing/Unpacking Costs • Packing source operands A = f() B = g() C A 2 D B 3 C = A + 2 D = B + 3 = +

  21. A A B B C C D D Packing/Unpacking Costs • Packing source operands • Unpacking destination operands A = f() B = g() C A 2 D B 3 C = A + 2 D = B + 3 = + E = C / 5 F = D * 7

  22. Optimizing Program Performance • To achieve the best speedup: • Maximize parallelization • Minimize packing/unpacking

  23. Optimizing Program Performance • To achieve the best speedup: • Maximize parallelization • Minimize packing/unpacking • Many packing possibilities • Worst case: n ops n! configurations • Different cost/benefit for each choice

  24. Observation 1:Packing Costs can be Amortized • Use packed result operands A = B + C D = E + F G = A - H I = D - J

  25. Observation 1:Packing Costs can be Amortized • Use packed result operands • Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J

  26. Observation 2:Adjacent Memory is Key • Large potential performance gains • Eliminate ld/str instructions • Reduce memory bandwidth

  27. Observation 2:Adjacent Memory is Key • Large potential performance gains • Eliminate ld/str instructions • Reduce memory bandwidth • Few packing possibilities • Only one ordering exploits pre-packing

  28. SLP Extraction Algorithm • Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  29. A B = X[i:i+1] SLP Extraction Algorithm • Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  30. A B = X[i:i+1] SLP Extraction Algorithm • Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  31. A B = X[i:i+1] H J C D A B = - SLP Extraction Algorithm • Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  32. A B = X[i:i+1] H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  33. A B = X[i:i+1] C D E F 3 5 = * H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  34. A B = X[i:i+1] C D E F 3 5 = * H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  35. SLP Compiler Results • SLP compiler implemented in SUIF • Tested on two benchmark suites • SPEC95fp • Multimedia kernels • Performance measured three ways: • SLP availability • Compared to vector parallelism • Speedup on AltiVec

  36. SLP Availability

  37. SLP vs. Vector Parallelism

  38. Speedup on AltiVec 6.7

  39. SLP vs. Vector Parallelism • Extracted with a simple analysis • SLP is fine grain  basic blocks

  40. SLP vs. Vector Parallelism • Extracted with a simple analysis • SLP is fine grain  basic blocks • Superset of vector parallelism • Unrolling transforms VP to SLP • Handles partially vectorizable loops

  41. SLP vs. Vector Parallelism } Basic block

  42. SLP vs. Vector Parallelism Iterations

  43. SLP vs. ILP • Subset of instruction level parallelism

  44. SLP vs. ILP • Subset of instruction level parallelism • SIMD hardware is simpler • Lack of heavily ported register files

  45. SLP vs. ILP • Subset of instruction level parallelism • SIMD hardware is simpler • Lack of heavily ported register files • SIMD instructions are more compact • Reduces instruction fetch bandwidth

  46. SLP and ILP • SLP & ILP can be exploited together • Many architectures can already do this

  47. SLP and ILP • SLP & ILP can be exploited together • Many architectures can already do this • SLP & ILP may compete • Occurs when parallelism is scarce

  48. SLP and ILP • SLP & ILP can be exploited together • Many architectures can already do this • SLP & ILP may compete • Occurs when parallelism is scarce • Unroll the loop more times • When ILP is due to loop level parallelism

  49. Conclusions • Multimedia architectures abundant • Need automatic compilation

  50. Conclusions • Multimedia architectures abundant • Need automatic compilation • SLP is the right paradigm • 20% non-vectorizable in SPEC95fp

More Related