460 likes | 608 Views
Lec 9: Pipelining. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Basic Pipelining. Five stage “RISC” load-store architecture Instruction fetch (IF) get instruction from memory Instruction Decode (ID) translate opcode into control signals and read regs Execute (EX)
E N D
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University
Basic Pipelining Five stage “RISC” load-store architecture • Instruction fetch (IF) • get instruction from memory • Instruction Decode (ID) • translate opcode into control signals and read regs • Execute (EX) • perform ALU operation • Memory (MEM) • Access memory if load/store • Writeback (WB) • update register file Following slides thanks to Sally McKee
Sample Code (Simple) • Assume eight-register machine • Run the following code on a pipelined datapath add 3 1 2 ; reg 3 = reg 1 + reg 2 nand 6 4 5 ; reg 6 = ~(reg 4 & reg 5) lw 4 20 (2) ; reg 4 = Mem[reg2+20] add 5 2 5 ; reg 5 = reg 2 + reg 5 sw 7 12(3) ; Mem[reg3+12] = reg 7 Slides thanks to Sally McKee
+ A L U M U X 1 target PC+1 PC+1 0 R0 R1 regA ALU result R2 Register file regB valA M U X PC Inst mem Data mem instruction R3 ALU result mdata R4 valB R5 R6 M U X data R7 offset dest valB Bits 0-2 dest dest dest Bits 15-17 M U X Bits 21-23 op op op IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
Time Graphs Time: 1 2 3 4 5 6 7 8 9 add nand lw add sw fetch decode execute memory writeback fetch decode execute memory writeback fetch decode execute memory writeback fetch decode execute memory writeback fetch decode execute memory writeback Slides thanks to Sally McKee
Pipelining Recap • Powerful technique for masking latencies • Logically, instructions execute one at a time • Physically, instructions execute in parallel • Instruction level parallelism • Decouples the processor model from the implementation • Interface vs. implementation • BUT dependencies between instructions complicate the implementation
What Can Go Wrong? • Data hazards • register reads occur in stage 2 • register writes occur in stage 5 • could read the wrong value if is about to be written • Control hazards • branch instruction may change the PC in stage 4 • what do we fetch before that?
Handling Data Hazards • Detect and Stall • If hazards exist, stall the processor until they go away • Safe, but not great for performance • Detect and Forward • If hazards exist, fix up the pipeline to get the correct value (if possible) • Most common solution for high performance
Handling Data Hazards III: • Detect • dest of early instrs == source of current • Forward: • New bypass datapaths route computed data to where it is needed • New MUX and control to pick the right data • Beware: Stalling may still be required even in the presence of forwarding
Different type of Hazards • Use register value in subsequent instruction add 3 1 2 nand 5 3 4 or 6 3 1 sub 6 3 1 Slides thanks to Sally McKee
Data Hazards: AL ops add 3 1 2 nand 5 3 4 time add fetch decode execute memory writeback nand fetch decode execute memory writeback If not careful, you read the wrong value of R3
Data Hazards: AL ops add 3 1 2 nand 5 3 4 time add fetch decode execute memory writeback nand fetch decode execute memory writeback If not careful, you read the wrong value of R3
Data Hazards: AL ops add 3 1 2 nand 5 3 4 time add fetch decode execute memory writeback nand fetch decode execute memory writeback If not careful, you read the wrong value of R3
Handling Data Hazards: Forwarding • No point forwarding to decode • Forward to EX stage • From output of ALU and MEM stages
Data Hazards: AL ops add 3 1 2 nand 5 3 4 time add fetch decode execute memory writeback nand fetch decodeexecute memory writeback
Data Hazards: AL ops add 3 1 2 and 8 0 2 nand 5 3 4 time add fetch decode execute memory writeback add fetch decode execute memory writeback nand fetch decode execute memory writeback
DM DM DM Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU Forwarding Illustration add $3 I n s t r. O r d e r add DM Reg Reg nand $5 $3
Data Hazards: AL ops add 3 1 2 and 8 0 2 and 8 2 0 nand 5 3 4 time add fetch decode execute memory writeback add fetch decode execute memory writeback add fetch decode execute memory writeback nand fetch decode execute memory writeback Write in first half of cycle, read in second half of cycle
Data Hazards: lw lw 3 12(zero) and 8 0 2 nand 5 3 4 time lw fetch decode execute memory writeback add fetch decode execute memory writeback nand fetch decode execute memory writeback
Data Hazards: lw lw 3 12(zero) and 8 0 2 nand 5 3 4 time lw fetch decode execute memorywriteback add fetch decode execute memory writeback nand fetch decodeexecute memory writeback
Data Hazards: lw lw 3 12(zero) and 8 0 2 nand 5 3 4 time lw fetch decode execute memorywriteback nand fetch decodeexecute memory writeback
DM DM DM Reg Reg Reg Reg Reg Reg IM IM IM ALU ALU ALU A tricky case add 1 1 2 add 11 3 add 1 1 4 I n s t r. O r d e r add 1 1 2 add 11 3 add 1 1 4
Another tricky case add 0 4 1 add 30 4
Handling Data Hazards: Summary • Forward: • New bypass datapaths route computed data to where it is needed • Beware: Stalling may still be required even in the presence of forwarding • For lw • MIPS 2000/3000: a delay slot • MIPS 4000 onwards: stall • But really rely on compiler to treat it like delay slot
Detection • Detection: • Compare regA with previous DestRegs • Compare regB with previous DestRegs • regA and regB going to ALU • Previous destregs • Previously in ALU, Memory and Writeback
+ A L U M U X 1 target PC+1 PC+1 0 R0 R1 regA ALU result R2 Inst mem Register file regB valA M U X PC Data mem instruction R3 ALU result mdata R4 M U X valB R5 R6 M U X data R7 offset dest valB dest dest dest op op op IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
Sample Code Which data hazards do you see? add 3 1 2 nand 5 3 4 add 7 6 3 lw 6 24(3) sw 6 12(2) Slides thanks to Sally McKee
Sample Code Which data hazards do you see? add 3 1 2 nand 5 3 4 add 7 6 3 lw 6 24(3) sw 6 12(2) Slides thanks to Sally McKee
First half of cycle 3 + Hazard A L U fwd fwd fwd M U X 1 2 1 0 R0 3 14 R1 regA 7 R2 Inst mem Register file regB 14 M U X PC Data mem nand 5 3 4 10 R3 3 11 R4 M U X 77 7 R5 data 1 R6 M U X 8 R7 add IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
End of cycle 3 + A L U H1 M U X 1 3 2 0 R0 14 R1 regA 7 R2 Inst mem Register file regB 10 M U X PC Data mem add 7 6 3 10 R3 3 21 11 R4 5 M U X 77 11 R5 data 1 R6 M U X 8 R7 nand add IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
First half of cycle 4 + New Hazard A L U H1 M U X 1 3 2 0 R0 21 14 R1 regA M U X 3 7 R2 Inst mem Register file regB 10 M U X PC Data mem add 7 6 3 10 R3 3 21 11 11 R4 5 M U X 77 11 R5 data 1 R6 M U X 8 R7 nand add IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
End of cycle 4 + A L U H2 H1 M U X 1 4 3 0 R0 14 R1 regA 21 M U X 7 R2 Inst mem Register file regB 1 M U X PC Data mem lw 6 24(3) 10 R3 -2 11 R4 7 5 3 M U X 77 10 R5 data 1 R6 M U X 8 R7 add nand add IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
+ 1 21 A L U H2 H1 First half of cycle 5 M U X 1 4 3 0 R0 3 14 R1 regA 21 M U X 7 R2 Inst mem Register file regB 1 M U X PC Data mem lw 6 24(3) 10 R3 -2 11 R4 7 5 3 M U X 77 10 R5 data 1 R6 M U X 8 R7 add nand add IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
+ A L U H2 H1 End of cycle 5 M U X 1 5 4 0 R0 14 R1 regA -2 M U X 7 R2 Inst mem Register file regB 21 M U X PC Data mem sw 6 12(2) 21 R3 6 22 11 R4 7 5 M U X 77 R5 data 1 R6 M U X 8 R7 24 lw add nand IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
en + en A L U H2 H1 First half of cycle 6 M U X 1 5 4 Hazard 0 R0 6 14 R1 regA -2 M U X 7 R2 Inst mem Register file regB 21 M U X PC Data mem sw 6 12(2) 21 R3 22 11 R4 6 7 5 M U X 77 R5 L 1 R6 M U X data 8 R7 24 lw add nand IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
+ A L U nop H2 End of cycle 6 M U X 1 5 0 R0 14 R1 regA 22 M U X 7 R2 Inst mem Register file regB M U X PC Data mem sw 6 12(2) 21 R3 31 11 R4 6 7 M U X -2 R5 data 1 R6 M U X 8 R7 lw add IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
+ A L U H2 First half of cycle 7 M U X 1 5 Hazard 0 R0 6 14 R1 regA 22 M U X 7 R2 Inst mem Register file regB M U X PC Data mem sw 6 12(2) 21 R3 31 11 R4 6 7 M U X -2 R5 data 1 R6 M U X 8 R7 nop lw add IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
+ A L U H3 End of cycle 7 M U X 1 5 0 R0 14 R1 regA M U X 7 R2 Inst mem Register file regB 1 M U X PC Data mem 21 R3 99 11 R4 6 6 M U X -2 7 R5 data 1 R6 M U X 22 R7 12 sw nop lw IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
+ 99 12 A L U H3 First half of cycle 8 M U X 1 5 0 R0 14 R1 regA M U X 7 R2 Inst mem Register file regB 1 M U X PC Data mem 21 R3 99 11 R4 6 M U X -2 7 R5 data 99 R6 M U X 8 R7 12 sw nop lw IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
+ A L U H3 End of cycle 8 M U X 1 5 0 R0 14 R1 regA M U X 7 R2 Inst mem Register file regB 1 M U X PC Data mem 21 R3 111 11 R4 6 M U X -2 7 R5 data 99 R6 M U X 8 R7 12 sw nop IF/ ID ID/ EX EX/ Mem Mem/ WB Slides thanks to Sally McKee
Control Hazards beqz 3 goToZero nand 5 1 4 add 6 7 8 time beqz fetch decode execute memory writeback nand fetch decode execute memory writeback
Control Hazards for 4-stage pipeline beqz 3 goToZero nand 5 1 4 add 6 7 8 time beqz F/D execute memory writeback nand F/D execute memory writeback
Control Hazards • Stall • Inject NOPs into the pipeline when the next instruction is not known • Pros: simple, clean; Cons: slow • Delay Slots • Tell the programmer that the N instructions after a jump will always be executed, no matter what the outcome of the branch • Pros: The compiler may be able to fill the slots with useful instructions; Cons: breaks abstraction boundary • Speculative Execution • Insert instructions into the pipeline • Replace instructions with NOPs if the branch comes out opposite of what the processor expected • Pros: Clean model, fast; Cons: complex
Control Hazards for 4-stage pipeline beqz 3 goToZero nand 5 1 4 add 6 7 8 goToZero: lui 2 1 time beqz F/D execute memory writeback nand Delay slot! F/D execute memory writeback lui F/D execute memory writeback
Hazard summary • Data hazards • Forward • Stall for lw/sw • Control hazard • Delay slot
Pipelining for Other Circuits • Pipelining can be applied to any combinational logic circuit • What’s the circuit latency at 2ns. per gate? • What’s the throughput? • What is the stage breakdown? Where would you place latches? • What’s the throughput after pipelining?