Pipelining

Pipelining 6.1, 6.2

Performance Measurements • Cycle Time: Time __________________ • Latency: Time to finish a _____________, start to finish • Throughput: Average ______________ per unit time • CyclesPerJob: On average, number of _______________________________.

Goals • Faster clock rate • Use machine more efficiently • No longer execute only one instruction at a time

Laundry • Laundry-o-matic washes, dries & folds • Wash: 30 min • Dry: 40 min • Fold: 20 min • It switches them internally with no delay • How long to complete 1 load? ______

Laundry-o-Matic - SingleCycle 1 2 3 W D F Load 0 30 60 90 120 150 180 210 240 270 Minutes

Laundry-o-Matic • Cycle Time: Clothing is switched every ____ minutes • Latency: A single load takes a total of ______ minutes • Throughput: A load completes each ______ minutes • CyclesPerLoad: Every ____ cycle(s), a load completes

Pipelined Laundry • Split the laundry-o-matic into a washer, dryer, and folder (what a concept) • Moving the laundry from one to another takes 6 minutes

Pipelined Laundry Switch all loads at the same time 1 2 3 W D F Load 0 30 60 90 120 150 180 210 240 270 Minutes

Pipelined Laundry • Cycle Time: Clothing is switched every ____ minutes • Latency: A single load takes a total of ______ minutes • Throughput: A load completes each ______ minutes • CyclesPerLoad: Every ____ cycle(s), a load completes

Single-Cycle vs Pipelined • _________ has the higher cycle time • _________ has the higher clock rate • _________ has the higher single-load latency • _________ has the higher throughput • _________ has the higher CPL (Cycles per Load) • More stages makes a _______ clock rate

Obstacles to speedup in Pipelining W D F • 1. • 2. • Ideal cycle time w/out above limitations with n stage pipeline:

Creating Stages IF ID MEM WB • Fetch – get instruction • Decode – read registers • Execute – use ALU • Memory – access memory • WriteBack – write registers Fetch Decode Execute Memory WriteBack

Pipelined Machine Decode Execute Memory Fetch << 2 << 2 4 Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Sign Ext 16 32 (Writeback) Pipeline Register

IF ID MEM WB add $s0, $0, $0 IF ID MEM WB lw $s1, 0($t0) sw $s2, 0($t1) IF ID MEM WB or $s3, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time-> In what cycle was $s1 written? In what cycle was $s4 read? In what cycle was the Add executed?

Performance Analysis • Measurements related to our machine • Job = single instruction • Latency - Time to finish a complete _______________, start to finish. • Throughput – Average ______________ completed per unit time. • CyclesPerInstruction – An _________ completes each how many cycles? • Which is more important for reducing program execution time?

Machine Comparison Fetch Decode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns Pipelined Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns

Example 2 – How do we speed up pipelined machine? Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

Example 2 – Split more stages Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? _________ and _________

Example 2 – After Split F WB ___ns ___ns ___ns ___ns ___ns ___ns ___ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

Incorrect Execution Easy Right? Not so fast. In what cycle does the sw write $s0? In what cycle does the or read $s0? lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

Correct, Slow ExecutionData Hazards ID & WB take only ½ cycle each. All other stages take the full cycle In what cycle does the sw write $s0? In what cycle does the oi read $s0? Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF ID MEM WB sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

Barriers to pipeline performance • Uneven stages • Pipeline register delays • Data Hazards • ______________________________________

Default: Stall Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF sw $s2, 0($t1) and $s6, $s4, $t3 1 2 3 4 5 6 7 8 Time->

Solution 1: Data Forwarding In what cycle is $s0 produced in the machine? In what cycle is $s0 used in the machine? Add wires that pass data from the stage in which it is produced to the stage in which it is used. lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF ID sw $s2, 0($t1) IF and $s6, $s4, $t3 1 2 3 4 5 6 7 8 Time->

Data-ForwardingWhere are those wires? Decode Execute Memory Fetch << 2 << 2 4 Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Sign Ext 16 32 (Writeback) Pipeline Register

Data ForwardingExample 2 Draw the timing diagram with data forwarding Draw arrows to indicate data passing through forwarding lw $t0, 0($s0) F D M W addi $t0, $t0, 1 F D add $s2, $s2, $t0 F sw $s2, 0($s0) 1 2 3 4 5 6 7 8 9 10 11 12 Time->

Who reorders instructions? • Static scheduling • Compiler • ___________, but does not know when caches miss or loads/stores are to the same locations (conservative) • Dynamic scheduling • Hardware • _____________________________, but has all knowledge (more accurate)

Solution 2: Instruction Reordering (Before reordering) Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF MEM WB ID sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

Solution 2: Instruction Reordering (With Instruction Reordering) lw $s0, 0($t4) IF ID MEM WB WB IF ID IF 1 2 3 4 5 6 7 8 Time->

Instruction Reordering - Are these equivalent? (Get the same answer) lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF MEM WB ID sw $s3, 0($t1) IF ID MEM WB and $s0, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

A deeper look lw $s0, 0($t4) IF ID MEM WB WB sw $s3, 0($t1) IF ID MEM WB and $s0, $s4, $t3 IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->

Dependencies And our friends RAW, WAR, and WAW

RAW – Read after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

WAR - Write after Read add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

WAW –Write after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0

Identify all of the dependencies RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8

Which dependencies might cause stalls (hazards)? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8 So why do we care about WAW and WAR dependencies at all?!?

Can we reorder the or? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8

No, we can not reorder the or RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB RAW sub $t2, $t0, $s3 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8

Can we reorder the mul? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8

No, siree. Mul stays put. RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB IF ID MEM WB sub $t2, $t0, $s3 or $s3, $t7, $s2 IF ID MEM WB 1 2 3 4 5 6 7 8

How to remove WAR, WAW dependencies? add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 and $t4, $s3, $s5 add $s3, $s4, $s6 IF ID MEM WB 1 2 3 4 5 6 7 8

___________________________ use a different register for that result (and all subsequent uses of that result) add $t0, $s0, $s1 IF ID MEM WB sub ____, $t0, $s3 IF ID MEM WB or ____, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 and $t4, ___, $s5 add $s3, $s4, $s6 IF ID MEM WB 1 2 3 4 5 6 7 8

Who renames registers? • Static register renaming • Compiler • Compiler is the one who makes assignments in the first place! • Number of registers limited by ____________ • Dynamic register renaming • Hardware • Can offer more registers – • Number of registers limited by ____________

Minimizing Data HazardsSummary

Dependencies Summary • What is the difference between a hazard and a dependency? • How can we get rid of name dependencies? • What limits this solution?

The Big Picture Fewer Stages More Stages More hardware Smaller More power Faster Clock Fewer Stalls Higher Throughput

The Big Picture • Throughput (inst / ns) = • inst / cycle * cycle / ns

Control Hazard(Incorrect Execution) In what cycle does the nextPC get calculated for the bne? In what cycle does the or get fetched? add $s5, $s4, $t1 IF ID MEM WB IF ID MEM WB bne $s0, $s1, end IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB end: sw $s2, 0($t1) 1 2 3 4 5 6 7 8 Time->

Pipelining

Pipelining

Presentation Transcript

Pipelining

Pipelining

PIPELINING

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining