Computer Architecture and Design – ELEN 350

Computer Architecture and Design – ELEN 350 Part 8 [Some slides adapted from M. Irwin, D. Paterson. D. Garcia and others]

Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste multicycle clock slower than 1/5th of single cycle clock due to stage register overhead Multiple Cycle Implementation: IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw sw R-type Review: Single Cycle vs. Multiple Cycle Timing

How Can We Make It Even Faster? • Split the multiple instruction cycle into smaller and smaller steps • There is a point of diminishing returns where as much time is spent loading the state registers as doing the work • Start fetching and executing the next instruction before the current one has completed • Pipelining – (all?) modern processors are pipelined for performance • Fetch (and execute) more than one instruction at a time (out-of-order superscalar and VLIW ) • Fetch (and execute) instructions from more than one instruction stream (multithreading)

Pipelining Not pipelined Assume 30 min. each task – wash, dry, fold, store – and that separate tasks use separate hardware and so can be overlapped Pipelined

IFetch IFetch IFetch Exec Exec Exec Mem Mem Mem WB WB WB Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Dec lw Dec sw Dec R-type A Pipelined MIPS Processor • Start the next instruction before the current one has completed • improves throughput - total amount of work done in a given time • instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced • clock cycle (pipeline stage time) is limited by the slowest stage • for some instructions, some stages are wasted cycles

Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Pipeline Implementation: IFetch Dec Exec Mem WB lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type Single Cycle, Multiple Cycle, vs. Pipeline Multiple Cycle Implementation:

IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack Add Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 IFetch/Dec Read Address PC Read Data Dec/Exec Address Exec/Mem Write Addr ALU Read Data 2 Mem/WB Write Data Write Data Sign Extend 16 32 System Clock MIPS Pipeline Datapath Modifications • What do we need to add/modify in our MIPS datapath? • State registers between each pipeline stage to isolate them ?

Control MIPS Pipeline Control Path Modifications • All control signals can be determined during Instruction Decode Step • and held in the state registers between pipeline stages ID/EX EX/MEM IF/ID Add MEM/WB Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data Sign Extend 16 32

Pipelining the MIPS ISA • What makes it easy • all instructions have the same length (32 bits) • can fetch in the 1st stage and decode in the 2nd stage • few instruction formats (three) with symmetry across formats • can begin reading register file in 2nd stage • memory operations can occur only in loads and stores • can use the execute stage to calculate memory addresses • each MIPS instruction writes at most one result (i.e., changes the machine state) and does so near the end of the pipeline (MEM and WB) • What makes it hard • structural hazards: what if we had only one memory? • control hazards: what about branches? • data hazards: what if an instruction’s input operands depend on the output of a previous instruction?

ADD ADD M U X M E U X X T N D Pipelined Datapath Pipeline registers wide enough to hold data coming in 4 64 bits 128 bits 97 bits 64 bits PC <<2 Instruction I ADDR RD 32 32 16 5 5 Instruction Memory RN1 RN2 RD1 Zero Register File ALU WD RD2 ADDR Data RD Memory 16 32 WD IF/ID ID/EX EX/MEM MEM/WB

Pipelined Example • Consider the following instruction sequence: lw $t0, 10($t1) sw $t3, 20($t4) add $t5, $t6, $t7 sub $t8, $t9, $t10

Single-Clock-Cycle Diagram:Clock Cycle 1 LW

Single-Clock-Cycle Diagram: Clock Cycle 2 SW LW

Single-Clock-Cycle Diagram: Clock Cycle 3 ADD SW LW

Single-Clock-Cycle Diagram: Clock Cycle 4 ADD SW LW SUB

Single-Clock-Cycle Diagram: Clock Cycle 5 SUB ADD SW LW

Single-Clock-Cycle Diagram: Clock Cycle 6 SUB ADD SW

Single-Clock-Cycle Diagram: Clock Cycle 7 ADD SUB

Single-Clock-Cycle Diagram: Clock Cycle 8 SUB

Pipelined Datapath with Control I Same control signals as the single-cycle datapath

Pipeline Control Implementation • Pass control signals along just like the data – extend each pipeline register to hold needed control bits for succeeding stages • Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6 least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register

Pipelined Datapath with Control II Control signals emanate from the control portions of the pipeline registers

Pipelined Execution and Control • Instruction sequence: lw $10, 20($1) sub $11, $2, $3 and $12, $4, $7 or $13, $6, $7 add $14, $8, $9 Label “before<i>” means i th instruction before lw Clock cycle 1

Clock cycle 2 lw $10, 20($1) sub $11, $2, $3 and $12, $4, $7 or $13, $6, $7 add $14, $8, $9

Can Pipelining Get Us Into Trouble? • Yes:Pipeline Hazards • structural hazards: attempt to use the same resource by two different instructions at the same time • data hazards: attempt to use data before it is ready • An instruction’s source operand(s) are produced by a prior instruction still in the pipeline • control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated • branch and jump instructions, exceptions • Can always resolve hazards by waiting • pipeline control must detect the hazard • and take action to resolve hazards

DM Reg Reg IM ALU Graphically Representing MIPS Pipeline • Can help with answering questions like: • How many cycles does it take to execute this code? • What is the ALU doing during cycle 4? • Is there a hazard, why does it occur, and how can it be fixed?

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Time to fill the pipeline Why Pipeline? For Performance! Time (clock cycles) Once the pipeline is full, one instruction is completed every cycle so complete an instruction every cycle (CPI = 1) Inst 0 I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4

Reading data from memory Mem Mem Mem Mem Mem Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Mem Mem Mem Mem Mem ALU ALU ALU ALU ALU Reading instruction from memory A Single Memory Would Be a Structural Hazard Time (clock cycles) lw I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4 • Can fix with separate instr and data memories

DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU How About Register File Access? Time (clock cycles) add $1, I n s t r. O r d e r Inst 1 Inst 2 add $2,$1,

DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU clock edge that controls loading of pipeline state registers clock edge that controls register writing How About Register File Access? Time (clock cycles) Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half add $1, I n s t r. O r d e r Inst 1 Inst 2 add $2,$1,

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards add $1,$2,$5 I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Read before writedata hazard

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards add $1,$2,$5 sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Read before writedata hazard

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Loads Can Cause Data Hazards • Dependencies backward in time cause hazards lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Loads Can Cause Data Hazards • Dependencies backward in time cause hazards lw $1,4($2) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Load-usedata hazard

DM DM DM Reg Reg Reg Reg Reg Reg stall IM IM IM ALU ALU ALU stall sub $4,$1,$5 and $6,$1,$7 One Way to “Fix” a Data Hazard Can fix data hazard by waiting – stall add $1,

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Another Way to “Fix” a Data Hazard add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 Fix data hazards by forwarding results as soon as they are available to where they are needed

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Another Way to “Fix” a Data Hazard add $1, I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 Fix data hazards by forwarding results as soon as they are available to where they are needed

Forwarding Hardware with Control Datapath with forwarding hardware and control wires – certain details, e.g., branching hardware, are omitted to simplify the drawing Note: so far we have only handled forwarding to R-type instructions…!

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Forwarding with Load-use Data Hazards lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Forwarding with Load-use Data Hazards • Will still need one stall cycle even with forwarding lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5

Adding the Hazard Hardware PCSrc Hazard Unit ID/EX EX/MEM 0 IF/ID 1 Control Add MEM/WB Branch Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data ALU cntrl 16 32 Sign Extend Forward Unit

ID/EX.MemRead 0 ID/EX.RegisterRt Adding the Hazard Hardware PCSrc Hazard Unit ID/EX EX/MEM 0 IF/ID 1 Control Add MEM/WB Branch Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data ALU cntrl 16 32 Sign Extend Forward Unit

Stall Hardware • Along with the Hazard Unit, we have to implement the stall • Prevent the instructions in the IF and ID stages from progressing down the pipeline – done by preventing the PC register and the IF/ID pipeline register from changing • Hazard detection Unit controls the writing of the PC (PC.write) and IF/ID (IF/ID.write) registers • Insert a “bubble” between the lw instruction (in the EX stage) and the load-use instruction (in the ID stage) (i.e., insert a noop in the execution stream) • Set the control bits in the EX, MEM, and WB control fields of the ID/EX pipeline register to 0 (noop). The Hazard Unit controls the mux that chooses between the real control values and the 0’s. • Let the lw instruction and the instructions after it in the pipeline (before it in the code) proceed normally down the pipeline

Computer Architecture and Design – ELEN 350