Chapter Six

Chapter Six Pipelining: Overview

Pipelining • Improve performance by increasing instruction throughput

Pipelining • Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

single-cycle vs. pipelined performance • This chapter assume only 8 instructions: lw, sw, add, sub, and, or, slt, beq.

single-cycle vs. pipelined performance • Single clock implementation: clock must be as long as longest instruction, ie. lw at 8 ns • To execute 2 lw instructions: 24 ns • All pipelined stages take a single clock. Clock must accommodate slowest operation, 2ns. • Pipelined time: see next slide.

single-cycle vs. pipelined performance Pipelined time: 14 ns

single-cycle vs. pipelined performance • Speed-up: • Time between instructionspipelined = Time between instructionsnonpipelined • Number of pipe stages • Ideal: 5-stage pipeline gives 5 time speed-up. • Problems: • stages may be imperfectly balanced. • pipelining involves some overhead. • Result: time per instruction in pipelined machine will exceed minimum possible.

single-cycle vs. pipelined performance • Note that we got 14ns vs. 24ns, not a 4 fold increase. • Total execution time is less important: • assume that we had 1003 instructions • Add 1000 instructions to pipeline • Each instruction adds 2ns to total execution time: 2 x 1000 + 14 = 2014ns • Single clock: 8 x 1000 + 24 = 8024ns • Ratio: 8024/2014 = 3.98 • Pipelining improves performance by increasing instruction throughput • Does not decrease the execution time of an individual instruction

Designing instruction sets for pipelining • MIPS instructions are same length. • Makes easier to fetch in stage 1 and decode in stage 2 • In the 80x86 IS, instructions vary from 1 byte to 17 bytes. Pipelining harder. • MIPS has only a few instruction formats. • Source register in same place in each instruction • Second stage can begin reading the register file at same time that hardware is decoding instruction. • If instruction formats were not the same, MIPS would have to split stage 2, giving 6 stages.

Designing instruction sets for pipelining • MIPS memory operands only appear in loads or stores. • Can use the execute stage to calculate memory address and then access memory in following stage. • 80x86: can operate on the operands in memory. • So stages 3 and 4 expand to an address stage, memory stage, then execute stage. • MIPS operands must be aligned in memory. • A single data transfer instruction cannot require two data memory accesses. • Always transfer data between processor and memory in a single pipeline stage.

Pipelining • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction • We’ll build a simple pipeline and look at these issues • We’ll talk about modern processors and what really makes it hard: • exception handling • trying to improve performance with out-of-order execution, etc.

Pipeline hazards • Structural hazards • Hardware cannot support the combination of instructions that we want to execute in the same clock cycle. • Example: assume one memory (eg. One cache). • assume the pipeline example earlier had 4th instruction • in 1 clock cycle the first instruction is accessing data from memory while 4th instruction is fetching instruction from same memory.

Pipeline hazards • Control hazards • Need to make a decision based on the results of one instruction while others are executing. • Branch instruction. • One solution: stall. • Assume we have enough extra hardware to test registers, calculate the branch address, update the PC during second stage (we’ll do this later). • Result: next slide.

Pipeline hazards • Control hazards • The lw instruction, executed if the branch fails, is stalled one extra 2-ns clock cycle before starting. • Called pipeline stall or bubble

Pipeline hazards • Control hazards • Problem: if cannot resolve branch in second stage, must stall more. • Common with longer pipelines. • Too slow. • Solution: Predict whether branch will fail. Execute appropriately. Undo if wrong. • Example: always predict that branches will fail. • Only slows when branch is taken. • See next slide.

Pipeline hazards • Top figure: branch not take. • Bottom figure: branch taken.

Pipeline hazards • More sophisticated prediction: • Always predict that a branch at the bottom of a loop is taken • Dynamic hardware predictors. • Guess depends on the behavior of each branch. • Predictions change over life of a program. • Example: keep a history for each branch as taken or untaken. Use past to predict future. • Accuracy of this: about 90% • If wrong: must restart the pipeline from proper branch address.

Pipeline hazards • Cost of stalls. Assume all instructions have CPI of 1. Branch delays 1 clock. • assume 17% of instructions have branch. • CPI becomes 1.17. • So slowdown is 1.17. • Note that slt and slti are included as branch instructions, but will not stall. So this is an approximation.

Pipeline hazards • Delayed decision (what MIPS actually does) • Delayed branch always executes the next sequential instruction. • Branch takes place after that one instruction delay. • Assembler automatically puts an instruction into the branch delay slot. • Compilers typically fill 50% of the branch delay slots

Pipeline hazards • Data hazards • An instruction depends on the results of a previous instruction that is still in the pipeline. • Example: add $s0, $t0, $t1 sub $t2, $s0, $t3 • Problem: the sub needs the result of the add (ie., $s0) • Can add bubbles, but add doesn’t write result until stage 5! • Cannot handle this with compilers: too common • Solution: forwarding or bypassing • Get the needed value as soon as it is calculated, but before it is written.

Sidetrack: new pipeline representation • Use symbols to represent the physical resources. • IF instruction fetch stage. Box represents instruction memory. • ID: instruction decode/register read stage. Box represents register file. • EX: execution stage. Box represents ALU • MEM: memory access stage. Box represents data memory. • WB: write back stage. Box represents register file. • Shading: right half means read, left half means write.

Pipeline hazards • Example: solution to above instructions.

Pipeline hazards • Forwarding valid only if the destination stage is later in time than the source stage. • Cannot forward from output of memory access stage in first instruction to the input of the execution stage of the following. • Forwarding cannot prevent all pipeline stalls. • Example: lw $s0, 0($t1) ; data loaded into $s0 in stage 4 sub $s0, $s0, $t1 ; data needed in stage 3 • Must stall • See next slide.

Pipeline hazards • Load-use data hazard

Pipeline hazards • Can reorder code to avoid pipeline stalls • Example: # reg $t1 has address of v[k] lw $t0, 0($t1) # reg $t0 (temp) = v[k] lw $t2, 4($t1) # reg $t2 = v[k+1] sw $t2, 0($t1) # v[k] = reg $t2 sw $t0, 4($t1) # v[k+1] = reg $t0 (temp) • Hazard occurs on register $t2 between second lw and first sw. • Swap instructions to eliminate hazard: # reg $t1 has address of v[k] lw $t0, 0($t1) # reg $t0 (temp) = v[k] lw $t2, 4($t1) # reg $t2 = v[k+1] sw $t0, 4($t1) # v[k+1] = reg $t0 (temp) sw $t2, 0($t1) # v[k] = reg $t2

Pipeline hazards • Original MIPS processors required software to follow a load with an instruction independent of that load. • Called a delayed load. • MIPS designed to enable easier forwarding. • Each MIPS instruction writes a single result at end of execution • Forwarding is harder if there are multiple results to forward per instruction • Also harder if instruction needs to write before end of an instruction.

Pipeline hazards: other hazards • Ian’s hazard: instruction is not in cache (memory) • Save the original PC value (current PC - 4) • Stall the pipeline • Fetch the instruction from RAM (or level 2 cache) • Write the cache entry when receive it from RAM • Restart the program at original PC value • refetches the instruction • this time finds it in cache • Data not in cache • Similar, but can continue to execute later instructions while wait (if they don’t use data from the stalled instruction). • Other techniques covered in chapter 7

Chapter Six

Chapter Six

Presentation Transcript

CHAPTER SIX

CHAPTER SIX

Chapter Six

Chapter Six

Chapter Six

Chapter Six

Chapter Six

CHAPTER SIX

Chapter Six

Chapter Six

Chapter Six

Chapter Six

Chapter Six

Chapter Six

Chapter Six

Chapter Six

Chapter Six

Chapter Six

Chapter Six