860 likes | 1.06k Views
Pipelining. 6.1, 6.2. Performance Measurements. Cycle Time: Time __________________ Latency: Time to finish a _____________, start to finish Throughput: Average ______________ per unit time CyclesPerJob: On average, number of _______________________________. Goals. Faster clock rate
E N D
Pipelining 6.1, 6.2
Performance Measurements • Cycle Time: Time __________________ • Latency: Time to finish a _____________, start to finish • Throughput: Average ______________ per unit time • CyclesPerJob: On average, number of _______________________________.
Goals • Faster clock rate • Use machine more efficiently • No longer execute only one instruction at a time
Laundry • Laundry-o-matic washes, dries & folds • Wash: 30 min • Dry: 40 min • Fold: 20 min • It switches them internally with no delay • How long to complete 1 load? ______
Laundry-o-Matic - SingleCycle 1 2 3 W D F Load 0 30 60 90 120 150 180 210 240 270 Minutes
Laundry-o-Matic • Cycle Time: Clothing is switched every ____ minutes • Latency: A single load takes a total of ______ minutes • Throughput: A load completes each ______ minutes • CyclesPerLoad: Every ____ cycle(s), a load completes
Pipelined Laundry • Split the laundry-o-matic into a washer, dryer, and folder (what a concept) • Moving the laundry from one to another takes 6 minutes
Pipelined Laundry Switch all loads at the same time 1 2 3 W D F Load 0 30 60 90 120 150 180 210 240 270 Minutes
Pipelined Laundry • Cycle Time: Clothing is switched every ____ minutes • Latency: A single load takes a total of ______ minutes • Throughput: A load completes each ______ minutes • CyclesPerLoad: Every ____ cycle(s), a load completes
Single-Cycle vs Pipelined • _________ has the higher cycle time • _________ has the higher clock rate • _________ has the higher single-load latency • _________ has the higher throughput • _________ has the higher CPL (Cycles per Load) • More stages makes a _______ clock rate
Obstacles to speedup in Pipelining W D F • 1. • 2. • Ideal cycle time w/out above limitations with n stage pipeline:
Creating Stages IF ID MEM WB • Fetch – get instruction • Decode – read registers • Execute – use ALU • Memory – access memory • WriteBack – write registers Fetch Decode Execute Memory WriteBack
Pipelined Machine Decode Execute Memory Fetch << 2 << 2 4 Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Sign Ext 16 32 (Writeback) Pipeline Register
IF ID MEM WB add $s0, $0, $0 IF ID MEM WB lw $s1, 0($t0) sw $s2, 0($t1) IF ID MEM WB or $s3, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time-> In what cycle was $s1 written? In what cycle was $s4 read? In what cycle was the Add executed?
Performance Analysis • Measurements related to our machine • Job = single instruction • Latency - Time to finish a complete _______________, start to finish. • Throughput – Average ______________ completed per unit time. • CyclesPerInstruction – An _________ completes each how many cycles? • Which is more important for reducing program execution time?
Machine Comparison Fetch Decode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns Pipelined Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns
Example 2 – How do we speed up pipelined machine? Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns
Example 2 – Split more stages Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? _________ and _________
Example 2 – After Split F WB ___ns ___ns ___ns ___ns ___ns ___ns ___ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns
Incorrect Execution Easy Right? Not so fast. In what cycle does the sw write $s0? In what cycle does the or read $s0? lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->
Correct, Slow ExecutionData Hazards ID & WB take only ½ cycle each. All other stages take the full cycle In what cycle does the sw write $s0? In what cycle does the oi read $s0? Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF ID MEM WB sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->
Barriers to pipeline performance • Uneven stages • Pipeline register delays • Data Hazards • ______________________________________
Default: Stall Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF sw $s2, 0($t1) and $s6, $s4, $t3 1 2 3 4 5 6 7 8 Time->
Solution 1: Data Forwarding In what cycle is $s0 produced in the machine? In what cycle is $s0 used in the machine? Add wires that pass data from the stage in which it is produced to the stage in which it is used. lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF ID sw $s2, 0($t1) IF and $s6, $s4, $t3 1 2 3 4 5 6 7 8 Time->
Data-ForwardingWhere are those wires? Decode Execute Memory Fetch << 2 << 2 4 Addr Out Data Data Memory In Data src1 src1data src2 src2data Register File destreg destdata PC op/fun rs rt rd imm Read Addr Out Data Instruction Memory Sign Ext 16 32 (Writeback) Pipeline Register
Data ForwardingExample 2 Draw the timing diagram with data forwarding Draw arrows to indicate data passing through forwarding lw $t0, 0($s0) F D M W addi $t0, $t0, 1 F D add $s2, $s2, $t0 F sw $s2, 0($s0) 1 2 3 4 5 6 7 8 9 10 11 12 Time->
Who reorders instructions? • Static scheduling • Compiler • ___________, but does not know when caches miss or loads/stores are to the same locations (conservative) • Dynamic scheduling • Hardware • _____________________________, but has all knowledge (more accurate)
Solution 2: Instruction Reordering (Before reordering) Stall - wasted cycles lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF MEM WB ID sw $s2, 0($t1) IF ID MEM WB and $s6, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->
Solution 2: Instruction Reordering (With Instruction Reordering) lw $s0, 0($t4) IF ID MEM WB WB IF ID IF 1 2 3 4 5 6 7 8 Time->
Instruction Reordering - Are these equivalent? (Get the same answer) lw $s0, 0($t4) IF ID MEM WB or $s3, $s0, $t3 IF IF IF MEM WB ID sw $s3, 0($t1) IF ID MEM WB and $s0, $s4, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->
A deeper look lw $s0, 0($t4) IF ID MEM WB WB sw $s3, 0($t1) IF ID MEM WB and $s0, $s4, $t3 IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB 1 2 3 4 5 6 7 8 Time->
Dependencies And our friends RAW, WAR, and WAW
RAW – Read after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0
WAR - Write after Read add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0
WAW –Write after Write add $t0, $s0, $s1 sub $t2, $t0, $s3 or $s3, $t7, $s2 and $t2, $t7, $s0
Identify all of the dependencies RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8
Which dependencies might cause stalls (hazards)? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8 So why do we care about WAW and WAR dependencies at all?!?
Can we reorder the or? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8
No, we can not reorder the or RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB RAW sub $t2, $t0, $s3 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8
Can we reorder the mul? RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB 1 2 3 4 5 6 7 8
No, siree. Mul stays put. RAW WAR WAW add $t0, $s0, $s1 IF ID MEM WB and $t2, $t7, $s0 IF ID MEM WB IF ID MEM WB sub $t2, $t0, $s3 or $s3, $t7, $s2 IF ID MEM WB 1 2 3 4 5 6 7 8
How to remove WAR, WAW dependencies? add $t0, $s0, $s1 IF ID MEM WB sub $t2, $t0, $s3 IF ID MEM WB or $s3, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 and $t4, $s3, $s5 add $s3, $s4, $s6 IF ID MEM WB 1 2 3 4 5 6 7 8
___________________________ use a different register for that result (and all subsequent uses of that result) add $t0, $s0, $s1 IF ID MEM WB sub ____, $t0, $s3 IF ID MEM WB or ____, $t7, $s2 IF ID MEM WB and $t2, $t7, $s0 and $t4, ___, $s5 add $s3, $s4, $s6 IF ID MEM WB 1 2 3 4 5 6 7 8
Who renames registers? • Static register renaming • Compiler • Compiler is the one who makes assignments in the first place! • Number of registers limited by ____________ • Dynamic register renaming • Hardware • Can offer more registers – • Number of registers limited by ____________
Dependencies Summary • What is the difference between a hazard and a dependency? • How can we get rid of name dependencies? • What limits this solution?
The Big Picture Fewer Stages More Stages More hardware Smaller More power Faster Clock Fewer Stalls Higher Throughput
The Big Picture • Throughput (inst / ns) = • inst / cycle * cycle / ns
Control Hazard(Incorrect Execution) In what cycle does the nextPC get calculated for the bne? In what cycle does the or get fetched? add $s5, $s4, $t1 IF ID MEM WB IF ID MEM WB bne $s0, $s1, end IF ID MEM WB or $s3, $s0, $t3 IF ID MEM WB end: sw $s2, 0($t1) 1 2 3 4 5 6 7 8 Time->