Speeding it up Part 2: Pipeline problems & tricks

Speeding it upPart 2: Pipeline problems & tricks dr.ir. A.C. VerschuerenEindhoven University of TechnologySection of Digital Information Systems

multiply uses 2 extra clocks in ALU r1 := r2 + 3 F D E W r3 := r4 x r5 F D E E E W r6 := r4 - 2 F D E W (must wait for E) r7 := r2 - r5 F D E W (must wait for D) r0 := r5 + 22 F D E (must wait for F) Time ‘stall cycles’ Giving more time to a pipeline stage • A pipeline stage which cannot handle the next instruction in one clock cycle, has to 'stall' the stages in front of it W

The bad thing about pipeline stalls • Stalls force 'no operation' cycles on the expensive hardware of the previous stages • The following instructions finish later than absolutely necessary Pipeline stalls should be avoided whenever possible !

r2+3 22+3 25 r1 := 25 initial values: r1 = 11 25 r1 := r2 + 3 D E W r2 = 22 9 r4 := r3 – r1 F D E r3 = 34 r3-r1 34–11 23 r4 := 23 wrong value! '25' not written yet... Another pipeline problem: ‘dependencies’ • In the standard pipeline, instructions which depend upon eachother's results give problems F W

r2+3 22+3 25 r1 := 25 25 r1 := r2 + 3 D E W 9 r4 := r3 - r1 F D D r3-r1 34-11 34-25 9 r4 := 9 D source = E destination D source = W destination Solving the dependency problem • Compare D, E and W stage operands,stall the pipeline if a match is found F D E W

data registers s1 s2 d control program PC I1 I2 I3 memory result forwarding ‘path’ + 1 s1 S1 ALU D s2 multiplexers S2 stage 1 stage 2 stage 3 stage 4 Fetch Decode Execute Write Result forwarding to solve dependencies { source operand control and multiplexer specification: }IF I3.dest = I2.source1 THEN s1 := D ELSE s1 := S1;IF I3.dest = I2.source2 THEN s2 := D ELSE s2 := S2;

memory pipeline hardware r1 := r2 + 3 F D E W 2 write stages r3 := [r4] F D M M M W r6 := r4 - 2 F D E W forwarding ! r7 := r2 - r5 F D E W r0 := r3 + 22 F D E write orderreversed ! Parallel pipelines to speed things up • No need to wait for the completion of slow operations, if handled by separate hardware W

The ‘order of completion’ • In this example, we have 'out-of-order completion' r6 is written before r3, the instruction ordering suggests r3 before r6 ! • The normal case is called 'in-order completion' Shorthand: ‘OOO’

time source dest order write/read or source dest 'true data dependency' source dest source dest read/write dependency source dest source dest or 'antidependency' Dependencies with OOO completion reading 2nd source must wait for 1st destination write, otherwise wrong source value in 2nd instruction write/write dependency writing 2nd destination must be done after writing 1st destination, otherwise leaves wrong result in destination at end writing 2nd destination must be done after reading first source value, otherwise wrong source value in 1st instruction

‘Scoreboarding’ instead of forwarding • Result forwarding helps in a simple pipeline • It becomes rather complex in a multiple pipeline with out-of-order completion • One of the earlier DEC Alpha processors used more than 40 result forwarding paths • A 'register scoreboard' can be used to make sure that dependency relations are kept in order

Operation of a register scoreboard • All registers have a 'scoreboard' bit, initially reset • Instructions wait in the Decode stage until all their source and destination scoreboard bits are reset (to zero) • Instructions which exit the Decode stage set the scoreboard bit in their destination register(s) • A scoreboard bit is reset during the writing of a destination register in any Writeback stage

Scoreboard performance • A simple scoreboard is very conservative in it's stalling decisions • It stalls the pipeline for true data dependencies But removes all forwarding paths in return ! • Write-write and antidependencies are stalled much longer than absolutely necessary • They should be stalled in the Writeback stage,not the Decode stage !

The real reason for some dependencies • Write-write and antidependencies exist because a register is re-used to hold another value ! • If we use a different destination register for the each write action, these dependencies vanish • This requires changing the program,which is not always possible • The amount of available registers may not be enoughevery result a different register ?

Register ‘renaming’ as solution • Write-write and antidependencies can be removed by writing each result in a different hardware register • This removes the direct relation between a register number in the program and a real register Register numbers are renamed into something else ! • Have to make sure that source register references always use the correct (renamed) hardware register

True dependencies Anti-dependencies Write-write dependencies Register renaming example before renaming: after renaming: 1) R1 := R2 + 3 R1b := R2a + 3 2) R3 := R1 x 2 R3b := R1b x 2 3) R1 := R6 + R2 R1c := R6a + R2a 4) R2 := R1 - 15 R2b := R1c - 15 All registers start as R..a

An implementation of register renaming • Use a lookup table in the Decode stagewhich indicates the 'current' hardware registerfor each of the software-visible registers • Source values are read from the hardware registers currently referenced from the lookup table • Each destination register, gets a 'fresh' hardware register whose reference is placed in the lookup table • Later pipeline stages all use the hardware register references for result forwarding and/or writeback

The problem with register renaming • When is a hardware register not needed anymore ? OR, in other words • When can a hardware register be re-used ? • There must be another hardware register assigned for its software register number AND • All source value references to it must have been done Will be soved later

is a jump PC updated here PC := PC + 5 F D E W r8 := r1 -22 r3 := r4 x r5 F F D E W fetch at wrong address! Flow control instructions in the pipeline • When the PC is changed by an instruction,the Fetch stage must wait for the actual update • For instance: a relative jump calculated by the ALU, with PC updated in the Writeback stage

PC updated here is a jump No-operation: NOP – – PC := 25 F D r8 := r1 -22 F D E W r3 := r4 x r5 F fetch at wrong address! Improving the flow control handling • The number of stall cycles can be reduced a lot: update the PC earlier in the pipeline • For instance in the Decode stage

PC updated here is a jump X : PC := 25 F D – – X+1 : r3 := r4 x r5 F D E W 'delay slot' 25 : r8 := r1 -22 F D E execute anyway... Another method: use ‘delay slots’ • The pipeline stall can be removed by executing instructions following the flow control instruction • These are executed before the actual jump is made W

Delay slots: to have or not to have • Using delay slots changes processor behaviourold programs will not run anymore ! • Compilers try to find useful instructions for delay slots • Able to fill 75% of the first delay slots • But only filling 40% of the second delay slots • If no useful instruction can be found,insert a NOP

An alternative to delay slots • Sometimes several stages between fetching and execution (PC update) of a jump instruction • Would lead to many (unfillable) delay slots • Alternative solution: a 'branch target cache' (BTC) • This cache contains for out-of-sequence jumps the new PC value and the first (few) instruction(s) • Is indexed on the address of the jump instruction the BTC ‘knows’ a jump is coming before it is fetched !

BTC checks address 10 Hit ! PC updated to 23... 10 : PC := 22 F – – D D E W 11 : r2 := r6 + 3 F 23 : r8 := r1 -22 F D E BTC provides instruction Operation of the Branch Target Cache • If the Branch Target Cache hits, the fetch stage starts fetching after the target address • The BTC provides the first (few) instruction(s) itself 22 : r3 := r4 x r5 W

Prediction: taken Prediction wrong ! 10 : JNZ r1,22 F D E W Prediction correct ! 11 : r2 := 3 F D E W W 22 : r3 := r4 x r5 F D – E – W 'delay slot' 23 : r7 := r9 F – D – E W – 24 : r6 := 5 12 : r8 := 0 F F D D E E Jump prediction saves time • By predicting the outcome of a conditional jump, no need to wait until test outcome is known • Example: condition test outcome known in W stage W W Must avoid wrong predictions !

How to predict a test outcome (1) • Prediction may be given with bit in instruction • Shifts prediction problem to the assembler/compiler • Instruction set must be changed to hold this flag • The prediction may be based upon the type of test and/or jump direction • End of loop jumps are taken most of the time • A single bit test is generally unpredictable...

How to predict a test outcome (2) • Prediction can be based upon the previous outcome(s) of the condition test • This is done with a 'branch history buffer’ • A cache which holds information for the most recently executed conditional jumps • May be based solely on last execution or more complex (statistical) algorithms • Implemented in separate hardware or combined with branch target/instruction caches • Combination can achieve a 'hit rate' of > 90%!

CALL and RETURN handling • A subroutine CALL can be seen as a jump combined with a memory write • Is not more problematic than a normal JUMP • A subroutine RETURN gives more problems • The new PC value cannot be determined from the instruction location and contents • Special tricks exist to bypass the memory stack read(for instance a ‘return address cache’)

Calculated and indirect jumps • These give huge problems in a pipeline • The new PC value must be determined before fetching can continue • Most of the speedup tricks break down on this problem • A Branch Target Cache can help a littlebit,but only if the actual target remains stable • The predicted target must be checked afterwards !

Speeding it up Part 2: Pipeline problems & tricks