Advanced Computer Architecture

Advanced Computer Architecture Syed Ali Raza Lecturer GC University, Lahore Pipelining

What is Pipelining? • A way of speeding up execution of instructions • Key idea: overlap execution of multiple instructions

A B C D The Laundry Analogy • A, B, C, Deach have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 30 minutes • “Folder” takes 30 minutes • “Stasher” takes 30 minutesto put clothes into drawers

A B C D If we do laundry sequentially... 2 AM 12 6 PM 1 8 7 11 10 9 • Time Required: 8 hours for 4 loads 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time T a s k O r d e r

A B C D To Pipeline, We Overlap Tasks 2 AM 12 6 PM 1 8 7 11 10 9 • Time Required: 3.5 Hours for 4 Loads Time 30 30 30 30 30 30 30 T a s k O r d e r

A B C D To Pipeline, We Overlap Tasks 2 AM 12 6 PM 1 8 7 11 10 9 Time 30 30 30 30 30 30 30 T a s k O r d e r • Pipelining doesn’t help latencyof single task, it helps throughput of entire workload • Pipeline rate limited by slowestpipeline stage • Multiple tasks operating simultaneously • Potential speedup = Numberpipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup

1ns 200ps 200ps 200ps 200ps 200ps Pipeline Register Pipelining a Digital System • Key idea: break big computation up into pieces • Separate each piece with a pipeline register 1 nanosecond = 10^-9 second 1 picosecond = 10^-12 second

Non-pipelined: 1 operation finishes every 1ns 1ns Pipelined: 1 operation finishes every 200ps 200ps 200ps 200ps 200ps 200ps Pipelining a Digital System • Why do this? Because it's faster for repeated computations

Comments about pipelining • Pipelining increases throughput, but not latency • Answer available every 200ps, BUT • A single computation still takes 1ns • Limitations: • Computations must be divisible into stage size • Pipeline registers add overhead

Pipelining a Processor • If we divide instruction cycle into 5 steps: 1. Instruction Fetch (IF) 2. Instruction Decode and Register Read (ID) 3. Execution operation or calculate address (EX) 4. Memory access (MEM) • Write result into register (WB) • Review: Single-Cycle Processor • All 5 steps done in a single clock cycle • Dedicated hardware required for each step

Review - Single-Cycle Processor • What do we need to add to actually split the datapath into stages?

Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Basic Pipeline for MIPS I n s t r. O r d e r What do we need to add to actually split the datapath into stages?

Basic Pipelined Processor

Pipeline example: lwIF

Pipeline example: lwID

Pipeline example: lwEX

Pipeline example: lwMEM

Pipeline example: lwWB Can you find a problem?

Basic Pipelined Processor (Corrected)

Comments about Pipelining • The good news • Multiple instructions are being processed at same time • This works because stages are isolated by registers • The bad news • Instructions interfere with each other - hazards • Example: different instructions may need the same piece of hardware (e.g., memory) in same clock cycle • Example: instruction may require a result produced by an earlier instruction that is not yet complete

Pipeline Hazards • Limits to pipelining:Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: two different instructions use same h/w in same cycle • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Pipelining of branches & other instructions that change the PC

Structural Hazards • Attempt to use the same resource by two or more instructions at the same time • Example: Single Memory for instructions and data • Accessed by IF stage • Accessed at same time by MEM stage

Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem ALU Mem Mem Reg Reg Instruction Order ALU Mem Mem Reg Mem Reg ALU ALU ALU Mem Mem Mem Reg Reg Reg Reg Reg Reg Mem Mem Mem Mem Mem Structure Hazards LOAD Instr 1 Instr 2 Instr 3 Instr 4 Operation on Memory by 2 different instructions in the same clock cycle

Resolving structural hazards • Problem • Attempt to use the same hardware resource • By two different instructions during the same cycle • Solution 1: Wait • Must detect the hazard • Must have mechanism to delay instruction access to resource • Serious: hazard cannot be ignored • Solution 2: Redesign the pipeline • Add more hardware to eliminate the structural hazard • In our example: use two memories with two memory ports • Instruction Memory • Data Memory Can be implemented as caches

Time(clock cycles) ALU R1 Mem Reg Reg Mem ALU Reg Mem Reg Mem Reg ALU Reg Mem Reg Mem Reg Reg ALU Mem Reg Mem Reg ALU Reg Mem Reg Mem Reg Data Hazard on Registers CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1

Clcok Cycle Read from Ri Store into Ri Register Ri Data Hazard on Registers Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle

ALU R1 Mem Reg Reg Mem ALU Reg Mem Mem Reg Reg ALU Reg Mem Reg Reg Mem ALU Mem Reg Reg Mem Reg Reg ALU Mem Reg Reg Mem Data Hazard on Registers Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 Needs to Stall 2 cycles

Read After Write Instri followed by Instrj Read After Write (RAW)Instrj tries to read operand before Instri writes it InstriLW R1, 0(R2) InstrjSUB R 4, R1, R5

Write After Read InstrI followed by InstrJ • Write After Read (WAR)Instrj tries to write operand before Instri reads it Instri ADD R1, R2, R3Instrj LW R2, 0(R5) • Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, • Reads are always in stage 2, and • Writes are always in stage 5

Write After Write InstrI followed by InstrJ Write After Write (WAW)Instrj tries to write operand before Instri writes it • Leaves wrong result ( Instri not Instrj) Instri LW R1, 0(R2) Instrj LW R1, 0(R3) • Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes

Forwarding to Avoid Data Hazards Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 ALU Mem Reg Reg Mem ADD R1,R2,R3 ALU Mem Mem Reg SUB R4,R1,R3 Reg ALU AND R6,R1,R7 Mem Reg Reg Mem ALU OR R8,R1,R9 Mem Reg Reg Mem ALU XOR R10,R11,R1 Mem Reg Reg Mem

ALU ALU ALU DM DM Reg Reg Reg Reg IM IM DM Reg Reg IM Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 ALU DM Reg Reg IM ALU DM Reg Branch Target available Reg IM ALU DM Reg Reg IM ALU DM Reg Reg IM ALU DM Reg Reg IM Control Hazards on Branches Program execution order in instructions Should’t be executed when branch condition is true ! 40 BEQ R1,R3, 36 44 AND R12,R2, R5 48 OR R13,R6, R2 52 ADD R14,R2, R2 80 LD R4,R7, 100 Branch Delay = 3 cycles

We don’t know yet the instruction being executed is a branch. Fetch the branch successor. Now, target address is available. Now, we know the instruction being executed is a branch. But stall until branch target address is known. 3 Wasted clock cycles for the TAKEN branch Control Hazards on Branches Branch instruction IF ID EX MEM WB Branch successor IF ID EX MEM Branch successor + 1 IF ID EX Branch successor + 2 IF ID

Branch Hazard Alternatives • Stall until branch direction is clear • Predict branch NOT TAKEN • Predict branch TAKEN • Delayed branch

Advanced Computer Architecture