Exploiting Instruction-Level Parallelism with Software Approach #1

Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. • Goal: to keep a pipeline full.

Latencies Branch: 1, Integer ALU op – branch: 1 Integer load: 1 Integer ALU - integer ALU: 1

Example for ( i = 1000; i > 0; i = i – 1) x[i] = x[i] + s; Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, # -8 BNE R1, R2, LOOP

Without any Scheduling Clock cycle issued Loop: L.D F0, 0(R1) 1 stall 2 ADD.D F4, F0, F2 3 stall 4 stall 5 S.D F4, 0(R1) 6 DADDIU R1, R1, # -8 7 stall 8 BNE R1, R2, LOOP 9 stall 10

With Scheduling Clock cycle issued Loop: L.D F0, 0(R1) 1 DADDIU R1, R1, # -8 2 ADD.D F4, F0, F2 3 stall 4 BNE R1, R2, LOOP 5 S.D F4, 8(R1) 6 delayed branch not trivial

The actual work of operating on the array element takes 3 (load, add, store). • The remaining 3 cycles • Loop overhead (DADDIU, BNE) • Stall • To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhead instructions.

Reducing Loop Overhead • Loop Unrolling • Simple scheme for increasing the number of instructions relative to the branch and overhead instructions • Simply replicates the loop body multiple times, adjusting the loop termination code. • Improves scheduling • It allows instructions from different iterations to be scheduled together. • Uses different registers for each iteration.

Unrolled Loop (No Scheduling) Clock cycle issued Loop: L.D F0, 0(R1) 1 2 ADD.D F4, F0, F2 3 4 5 S.D F4, 0(R1) 6 L.D F6, -8(R1) 7 8 ADD.D F8, F6, F2 9 10 11 S.D F8, -8(R1) 12 L.D F10, -16(R1) 13 14 ADD.D F12, F10, F2 15 16 17 S.D F12, -16(R1) 18 L.D F14, -24(R1) 19 20 ADD.D F16, F14, F2 21 22 23 S.D F16, -24(R1) 24 DADDIU R1, R1, # -32 25 26 BNE R1, R2, LOOP 27 28

Loop Unrolling • Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer. • Unrolling improves the performance of the loop by eliminating overhead instructions.

Loop Unrolling (Scheduling) Clock cycle issued Loop: L.D F0, 0(R1) 1 L.D F6, -8(R1) 2 L.D F10, -16(R1) 3 L.D F14, -24(R1) 4 ADD.D F4, F0, F2 5 ADD.D F8, F6, F2 6 ADD.D F12, F10, F2 7 ADD.D F16, F14, F2 8 S.D F4, 0(R1) 9 S.D F8, -8(R1) 10 DADDIU R1, R1, # -32 11 S.D F12, 16(R1) 12 BNE R1, R2, LOOP 13 S.D F16, 8(R1) 14

Summary • Goal: To know when and how the ordering among instructions may be changed. • This process must be performed in a methodical fashion either by a compiler or by hardware.

To obtain the final unrolled code, • Determine that it is legal to move the S.D after the DADDIU and BNE, and find the amount to adjust the S.D offset. • Determine that unrolling the loop will be useful by finding that the loop iterations are independent, except for the loop maintenance code. • Use different registers to avoid unnecessary constraints. • Eliminate the extra test and branch instructions and adjust the loop termination and iteration code.

Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address. • Schedule the code, preserving any dependences needed to yield the same result as the original code.

Loop Unrolling I(No Delayed Branch) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, -8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) L.D F0, -16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0, -24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDIU R1, R1, # -32 BNE R1, R2, LOOP name dependence true dependence

Loop Unrolling II(Register Renaming) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDIU R1, R1, # -32 BNE R1, R2, LOOP true dependence

With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel. • Potential shortfall in registers • Register pressure • It arises because scheduling code to increase ILP causes the number of live values to increase. It may not be possible to allocate all the live values to registers. • The combination of unrolling and aggressive scheduling can cause this problem.

Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively.

Unrolling with Two-Issue Loop: L.D F0, 0(R1) 1 L.D F6, -8(R1) 2 L.D F10, -16(R1) ADD.D F4, F0, F2 3 L.D F14, -24(R1) ADD.D F8, F6, F2 4 L.D F18, -32(R1) ADD.D F12, F10, F2 5 S.D F4, 0(R1) ADD.D F16, F14, F2 6 S.D F8, -8(R1) ADD.D F20, F18, F2 7 S.D F12, -16(R1) 8 DADDIU R1, R1, #-40 9 S.D F16, 16(R1) 10 BNE R1, R2, LOOP 11 S.D F20, 8(R1) 12

Static Branch Prediction • Static branch predictors are sometimes used in processors where the expectation is that branch behavior is highly predictable at compile time.

Static Branch Prediction • Predict a branch taken • Simplest • Average misprediction rate for SPEC: 34% (9% ~ 59%) • Predict on the basis of branch direction • backward-going branches: taken • forward-going branches: not taken • Unlikely to generate an overall misprediction rate of less than 30% ~ 40%.

Static Branch Prediction • Predict branches on the basis of profile information collected from earlier runs. • An individual branch is often highly biased toward taken or untaken. (bimodally distributed) • Changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction.

VLIW • Very Long Instruction Word: • Rely on compiler technology to minimize the potential data hazard stalls. • Actually format the instructions in a potential issue packet so that the hardware need not check explicitly for dependences. • Wide instructions with multiple operations per instruction. (64, 128 bits or more) • Intel IA-64 architecture

Basic VLIW Approach • VLIWs use multiple, independent functional units. • A VLIW packages the multiple operations into one very long instruction. • The hardware in a superscalar for multiple issue is unnecessary. • Uses loop unrolling, scheduling…

Local Scheduling: Scheduling the code within a single basic block. • Global Scheduling: scheduling code across branches • much more complex • Trace Scheduling: Section 4.5 • Figure 4.5 VLIW instructions

Problems • Increase in code size • Wasted functional units • In the previous example, only about 60% of the functional units were used.

Detecting and Enhancing Loop-level Parallelism • Loop level parallelism : source level • ILP : machine level code after compliation for (i= 1000; i< 0; i--) x[i] = x[i] + s

Advanced Compiler Support for Exposing and Exploiting ILP for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */ }

Loop-Carried Dependence • Data accesses in later iterations are dependent on data values produced in earlier iterations. for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */ } This dependence forces successive iterations of this loop to execute in series. Loop-Carried Dependences

Does a loop-carried dependence mean there is no parallelism??? • Consider:for (i=0; i< 8; i=i+1) { A = A + C[i]; /* S1 */ }Could compute:“Cycle 1”: temp0 = C[0] + C[1]; temp1 = C[2] + C[3]; temp2 = C[4] + C[5]; temp3 = C[6] + C[7];“Cycle 2”: temp4 = temp0 + temp1; temp5 = temp2 + temp3;“Cycle 3”: A = temp4 + temp5; • Relies on associative nature of “+”.

for ( i = 1; i <= 100; i ++) { A[i] = A[i] + B[i]; /* S1 */ B[i + 1] = C[i] + D[i]; /* S2 */ } Loop-Carried Dependence Despite this loop-carried dependence, this loop can be made parallel.

A[1] = A[1] + B[1]; for ( i = 1; i <= 99; i ++) { B[i + 1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100];

Recurrence • A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding. • Detecting a recurrence can be important • Some architectures (especially vector computer) have special support for executing recurrences. • Some recurrences can be the source of a reasonable amount of parallelism.

for ( i = 2; i <= 100; i = i + 1) Y[i] = Y[i – 1] + Y[i]; Dependence distance: 1 for ( i = 6; i <= 100; i = i + 1) Y[i] = Y[i – 5] + Y[i]; Dependence distance: 5 The larger the distance, the more potential parallelism can be obtained by unrolling the loop.

Finding Dependences • Determining whether a dependence actually exists => NP-Complete • Dependence Analysis • Basic tool for detecting loop-level parallelism • Applies only under a limited set of circumstances. • Greatest common divisor (GCD) test, points-to analysis, interprocedural analysis,…

Eliminating Dependent Computation • Algebraic Simplifications of Expressions • Copy propagation • Eliminates operations that copy values. DADDIU R1, R2, #4 DADDIU R1, R1, #4 DADDIU R1, R2, #8

Eliminating Dependent Computation • Tree Height Reduction • Reduces the height of the tree structure representing a computation. ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7 ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4

Eliminating Dependent Computation • Recurrences sum = sum + x1 + x2 + x3 + x4 + x5 sum = (sum + x1) + (x2 + x3) + (x4 + x5)

Software Pipelining • Technique for reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop. • By choosing instructions from different iterations, dependent computations are separated from one another by an entire loop body.

Software Pipelining • Counterpart to what Tomasulo’s algorithm does in hardware • Software pipelining symbolically unrolls the loop and then selects instructions from each iteration. • Start-up code before the loop and finish-up code after the loop required.

Software Pipelining

Software Pipelining - Example • Show a software-pipelined version of the following loop. Omit the start-up and finish-up code. Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, #-8 BNE R1, R2, Loop

Software Pipelining • Software pipelining consumes less code space. • Loop unrolling reduces the overhead of the loop (branch, counter update code). • Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.

Hw support for more parallelism at compile time Conditional Instructions • Predicated instructions • Extension of instruction set • Conditional instruction: an instruction that refers a condition, which is evaluated as part of the instruction execution • Condition is true: executed normally • False: no-op • ex) conditional move

Example if (A == 0) { S = T; } BNEZ R1, L ADDU R2, R3, R0 L: CMOVZ R2, R3, R1 conditional move only if the third operand is equal to zero R1=A, R2=S, R3=T

Conditional moves are used to change a control dependence into a data dependence. • Handling multiple branches per cycle is complex. => Conditional moves provide a way of reducing branch pressure. • A conditional move can often eliminate a branch that is hard to predict, increasing the potential gain.

Exploiting Instruction-Level Parallelism with Software Approach #1

Exploiting Instruction-Level Parallelism with Software Approach #1

Presentation Transcript

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Pipelining and Exploiting Instruction-Level Parallelism (ILP)