CPU Pipelining Issues

CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! Read Chapter 4.7-4.8

00 00 00 00 00 +4 PCWB PCREG PCALU PCMEM IRWB IRALU A B WDALU YWB IRMEM YMEM WDMEM IRREG Rs: <25:21> Rt: <20:16> BSEL 1 0 x4 0 1 2 WDSEL 0x80000000 PC<31:29>:J<25:0>:00 0x80000040 5-Stage miniMIPS 0x80000080 JT BT PCSEL 6 5 4 3 2 1 0 Instruction PC Memory • Omits some details A D Instruction • NO bypass or interlock logic Fetch J:<25:0> Register RA1 RA2 WA File RD1 RD2 = JT Imm: <15:0> SEXT SEXT BZ shamt:<10:6> + “16” ASEL Register 0 1 2 File BT Address is available right after instruction enters Memory stage A B ALU ALUFN N V C Z ALU Wr R/W Adr WD PC+4 almost 2 clock cycles Memory Data Memory RD Rt:<20:16> “27” “31” Rd:<15:11> Data is needed just before rising clock edge at end of Write Back stage 0 1 2 3 WASEL Write Register WD WA Back WA WERF File WE

Pipelining • Improve performance by increasing instruction throughput • Ideal speedup is number of stages in the pipeline. Do we achieve this?

Pipelining • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction • Individual Instructions still take the same number of cycles • But we’ve improved the through-put by increasing the number of simultaneously executing instructions

Structural Hazards

Data Hazards • Problem with starting next instruction before first is finished • dependencies that “go backward in time” are data hazards

Software Solution • Have compiler guarantee no hazards • Where do we insert the “nops” ? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) • Problem: this really slows us down!

Forwarding • Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Can't always forward • Load word can still cause a hazard: • an instruction tries to read a register following a load instruction that writes to the same register. • Thus, we need a hazard detection unit to “stall” the instruction

Stalling • We can stall the pipeline by keeping an instruction in the same stage

Types of Data Hazards • RAW hazards corresponding to value dependences are most difficult to deal with, since they can never be eliminated • The second instruction is waiting for information produced by the first instruction • WAR and WAW hazards are name dependences • Two instructions happen to use the same register (name), although they don’t have to • Can often be eliminated by renaming, either in software or hardware • Implies the use of additional resources, hence additional cost • Renaming is not always possible: implicit operands such as accumulator, PC, or condition codes cannot be renamed • These hazards don’t cause problems for MIPS pipeline • Relative timing does not change even with pipelined execution, because reads occur early and writes occur late in pipeline

Branch Hazards • When we decide to branch, other instructions are in the pipeline! • We are predicting “branch not taken” • need to add hardware for flushing instructions if we are wrong

Improving Performance • Try to avoid stalls! E.g., reorder these instructions: • lw $t0, 0($t1) • lw $t2, 4($t1) • sw $t2, 0($t1) • sw $t0, 4($t1) • Add a “branch delay slot” • the next instruction after a branch is always executed • rely on compiler to “fill” the slot with something useful • Superscalar: start more than one instruction in the same cycle

Dynamic Scheduling • The hardware performs the “scheduling” • hardware tries to find instructions to execute • out of order execution is possible • speculative execution and dynamic branch prediction • All modern processors are very complicated • Pentium 4: 20 stage pipeline, 6 simultaneous instructions • PowerPC and Pentium: branch history table • Compiler technology important

NOP 00 00 00 00 00 • broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logic +4 PCMEM PCWB PCALU PCREG IRWB YMEM A YWB B IRMEM WDMEM IRALU IRREG WDALU Rs: <25:21> Rt: <20:16> • added A/B bypass muxes to get data before it’s written to regfile BSEL 1 0 x4 • added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage 0 1 2 WDSEL 0x80000000 PC<31:29>:J<25:0>:00 0x80000040 5-Stage miniMIPS 0x80000080 JT BT PCSEL 6 5 4 3 2 1 0 We wanted a simple, clean pipeline but… Instruction PC Memory A D Instruction Fetch J:<25:0> Register RA1 RA2 WA File RD1 RD2 = JT Imm: <15:0> SEXT SEXT BZ shamt:<10:6> + “16” ASEL Register 0 1 2 File BT A B ALU ALUFN N V C Z ALU Wr R/W Adr WD PC+4 Memory Data Memory RD Rt:<20:16> “27” “31” Rd:<15:11> 0 1 2 3 WASEL Write Register WD WA Back WA WERF File WE

Bypass MUX Details The previous diagram was oversimplified. Really need for the bypass muxes to precede the A and B muxes to provide the correct values for the jump target (JT), write data, and early branch decision logic. Register File RD1 RD2 from ALU/MEM/WB/PC from ALU/MEM/WB/PC A Bypass B Bypass JT = shamt ’16’ SEXT(imm) 2 0 1 0 1 ASEL BSEL BZ AALU BALU WDALU To ALU To ALU To Mem

00 00 00 00 00 +4 PCMEM PCWB PCALU PCREG YWB IRWB WDMEM YMEM IRMEM A IRALU B WDALU IRREG Rs: <25:21> Rt: <20:16> BSEL 1 0 x4 0 1 2 WDSEL 0x80000000 PC<31:29>:J<25:0>:00 0x80000040 Final 5-Stage miniMIPS 0x80000080 JT BT PCSEL 6 5 4 3 2 1 0 Instruction PC Memory •Added branch delay slot and early branch resolution logic to fix a CONTROL hazard • Added lots of bypass paths and detection logic to fix various STRUCTURAL hazards • Added pipeline interlocks to fix load delay STRUCTURAL hazard A D Instruction Fetch J:<25:0> Register RA1 RA2 WA File RD1 RD2 = Imm: <15:0> JT SEXT SEXT BZ NOP shamt:<10:6> + “16” ASEL Register 0 1 2 File BT A B NOP ALU ALUFN N V C Z ALU Wr R/W Adr WD PC+4 Memory Data Memory RD Rt:<20:16> “27” “31” Rd:<15:11> 0 1 2 3 WASEL Write Register WD WA Back WA WERF File WE

Pipeline Summary (I) • Started with unpipelined implementation – direct execute, 1 cycle/instruction – it had a long cycle time: mem + regs + alu + mem + wb • We ended up with a 5-stage pipelined implementation – increase throughput (3x???) – delayed branch decision (1 cycle) Choose to execute instruction after branch – delayed register writeback (3 cycles) Add bypass paths (6 x 2 = 12) to forward correct value – memory data available only in WB stage Introduce NOPs at IRALU, to stall IF and RF stages until LD result was ready

Pipeline Summary (II) • Fallacy #1: Pipelining is easy • Smart people get it wrong all of the time! • Fallacy #2: Pipelining is independent of ISA • Many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). • Fallacy #3: Increasing Pipeline stages improves performance • Diminishing returns. Increasing complexity.

CPU Pipelining Issues

CPU Pipelining Issues

Presentation Transcript

PIPELINING

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining a CPU

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining

Pipelining