Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp

Overview • Problem statement • New paradigm for parallelism  SLP • SLP extraction algorithm • Results • SLP vs. ILP and vector parallelism • Conclusions • Future work

Multimedia Extensions • Additions to all major ISAs • SIMD operations

Using Multimedia Extensions • Library calls and inline assembly • Difficult to program • Not portable

Using Multimedia Extensions • Library calls and inline assembly • Difficult to program • Not portable • Different extensions to the same ISA • MMX and SSE • SSE vs. 3DNow!

Using Multimedia Extensions • Library calls and inline assembly • Difficult to program • Not portable • Different extensions to the same ISA • MMX and SSE • SSE vs. 3DNow! • Need automatic compilation

Vector Compilation • Pros: • Successful for vector computers • Large body of research

Vector Compilation • Pros: • Successful for vector computers • Large body of research • Cons: • Involved transformations • Targets loop nests

Superword Level Parallelism (SLP) • Small amount of parallelism • Typically 2 to 8-way • Exists within basic blocks • Uncovered with a simple analysis

Superword Level Parallelism (SLP) • Small amount of parallelism • Typically 2 to 8-way • Exists within basic blocks • Uncovered with a simple analysis • Independent isomorphic operations • New paradigm

R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835 1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835

R R G = G + X[i:i+2] B B 2. Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2]

3. Vectorizable Loops for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] 3. Vectorizable Loops for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3]

4. Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) 4. Partially Vectorizable Loops for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L)

Exploiting SLP with SIMD Execution • Benefit: • Multiple ALU ops  One SIMD op • Multiple ld/st ops  One wide mem op

Exploiting SLP with SIMD Execution • Benefit: • Multiple ALU ops  One SIMD op • Multiple ld/st ops  One wide mem op • Cost: • Packing and unpacking • Reshuffling within a register

C A 2 D B 3 = + Packing/Unpacking Costs C = A + 2 D = B + 3

A A B B Packing/Unpacking Costs • Packing source operands A = f() B = g() C A 2 D B 3 C = A + 2 D = B + 3 = +

A A B B C C D D Packing/Unpacking Costs • Packing source operands • Unpacking destination operands A = f() B = g() C A 2 D B 3 C = A + 2 D = B + 3 = + E = C / 5 F = D * 7

Optimizing Program Performance • To achieve the best speedup: • Maximize parallelization • Minimize packing/unpacking

Optimizing Program Performance • To achieve the best speedup: • Maximize parallelization • Minimize packing/unpacking • Many packing possibilities • Worst case: n ops n! configurations • Different cost/benefit for each choice

Observation 1:Packing Costs can be Amortized • Use packed result operands A = B + C D = E + F G = A - H I = D - J

Observation 1:Packing Costs can be Amortized • Use packed result operands • Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J

Observation 2:Adjacent Memory is Key • Large potential performance gains • Eliminate ld/str instructions • Reduce memory bandwidth

Observation 2:Adjacent Memory is Key • Large potential performance gains • Eliminate ld/str instructions • Reduce memory bandwidth • Few packing possibilities • Only one ordering exploits pre-packing

SLP Extraction Algorithm • Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] SLP Extraction Algorithm • Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] SLP Extraction Algorithm • Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] H J C D A B = - SLP Extraction Algorithm • Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

A B = X[i:i+1] C D E F 3 5 = * H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

SLP Compiler Results • SLP compiler implemented in SUIF • Tested on two benchmark suites • SPEC95fp • Multimedia kernels • Performance measured three ways: • SLP availability • Compared to vector parallelism • Speedup on AltiVec

SLP Availability

SLP vs. Vector Parallelism

Speedup on AltiVec 6.7

SLP vs. Vector Parallelism • Extracted with a simple analysis • SLP is fine grain  basic blocks

SLP vs. Vector Parallelism • Extracted with a simple analysis • SLP is fine grain  basic blocks • Superset of vector parallelism • Unrolling transforms VP to SLP • Handles partially vectorizable loops

SLP vs. Vector Parallelism } Basic block

SLP vs. Vector Parallelism Iterations

SLP vs. ILP • Subset of instruction level parallelism

SLP vs. ILP • Subset of instruction level parallelism • SIMD hardware is simpler • Lack of heavily ported register files

SLP vs. ILP • Subset of instruction level parallelism • SIMD hardware is simpler • Lack of heavily ported register files • SIMD instructions are more compact • Reduces instruction fetch bandwidth

SLP and ILP • SLP & ILP can be exploited together • Many architectures can already do this

SLP and ILP • SLP & ILP can be exploited together • Many architectures can already do this • SLP & ILP may compete • Occurs when parallelism is scarce

SLP and ILP • SLP & ILP can be exploited together • Many architectures can already do this • SLP & ILP may compete • Occurs when parallelism is scarce • Unroll the loop more times • When ILP is due to loop level parallelism

Conclusions • Multimedia architectures abundant • Need automatic compilation

Conclusions • Multimedia architectures abundant • Need automatic compilation • SLP is the right paradigm • 20% non-vectorizable in SPEC95fp

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Presentation Transcript

Instruction Level Parallelism

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Instruction-level Parallelism

Instruction Level Parallelism

Instruction-Level Parallelism

Instruction-Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Instruction-Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism: Loop Level Parallelism

Instruction-Level Parallelism

Instruction-level Parallelism