RISC-V Architecture and Pipelining Concepts

Computer Architecture A Quantitative Approach, Fifth Edition Appendix C Pipelining: Basic and Intermediate Concepts Advanced Computer Architecture 2019 @ Utsunomiya University

Outline • A “Typical” RISC ISA (RISC-V) • 5 stage pipelining • Structural Hazards • Data Hazards & Forwarding • Branch Hazards • Handling Exceptions • Floating Point Operations Advanced Computer Architecture 2019 @ Utsunomiya University

A "Typical" RISC ISA • 32-bit fixed format instruction • 32 32-bit GPR (x0 contains zero) • 3-address, reg-reg arithmetic instruction • Single addressing mode for load/store: base + displacement • Simple branch conditions • (Delayed branch) see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 Advanced Computer Architecture 2019 @ Utsunomiya University

Summary of RISC-V ISA (RV32I) x0 x1 ° ° ° x31 0 Programmable storage 2^32 x bytes 31 x 32-bit GPRs (x0 always zero) PC 32-bit instructions on word boundary PC Arithmetic logical add, sub, and, or, xor, slt, sltu, addi, slti, sltiu, andi, ori, xori, lui, auipc sll, srl, sra, slli, srli, srai Memory Access lb, lbu, lh, lhu, lw sb, sh, sw Control jal, jalr beq, bne, blt, bge, bltu, bgeu Advanced Computer Architecture 2019 @ Utsunomiya University

Instruction Format (RV32I) - Location of each register operand field is fixed - MSB indicates sign of immediate value Advanced Computer Architecture 2019 @ Utsunomiya University

signals Datapath vs Control Datapath Controller • Datapath: Storage, FU, interconnect sufficient to perform the desired functions • Inputs are Control Points • Outputs are signals • Controller: State machine to orchestrate operation on the data path • Based on desired function and signals Control Points Advanced Computer Architecture 2019 @ Utsunomiya University

Adder 4 Address Inst ALU Adder 5 Steps of RISC-V Datapath Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next SEQ PC Next PC MUX Next SEQ PC RS1 Reg File z Memory RS2 Data Memory RD MUX MUX Sign Extend WB Data Imm Advanced Computer Architecture 2019 @ Utsunomiya University

Inst. Set Processor Controller IR <= mem[PC]; PC <= PC + 4 Ifetch A <= Reg[IRrs1]; B <= Reg[IRrs2] opFetch Jump (and link) Branch LD RR RI r <= A + IRimm r <= PC PC <= PC +IRimma if bop(A,b) PC <= PC + IRimm r <= A opIRopIRimm r <= A opIRopB WB <= Mem[r] WB <= r WB <= r WB <= r Reg[IRrd] <= WB Reg[IRrd] <= WB Reg[IRrd] <= WB Reg[IRrd] <= WB Advanced Computer Architecture 2019 @ Utsunomiya University

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Adder 5 Steps of RISC-V Datapath & Stage Registers Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next SEQ PC Next PC MUX Next SEQ PC RS1 Reg File z Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Visualizing Pipelining Time (clock cycles) I n s t r. O r d e r Advanced Computer Architecture 2019 @ Utsunomiya University

Pipelining is not quite that easy! • Limits to pipelining:Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: A required resource is busy • Data hazards: Instruction depends on the result of prior instruction still in the pipeline • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU ALU One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Instr 3 Ifetch Instr 4 Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU Bubble Bubble Bubble Bubble Bubble One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Stall Instr 3 How do you “bubble” the pipe? Advanced Computer Architecture 2019 @ Utsunomiya University

Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1: Advanced Computer Architecture 2019 @ Utsunomiya University

Example: Dual-port vs. Single-port • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe/ 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 • Machine A is 1.33 times faster Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem EX WB MEM IF ID/RF I n s t r. O r d e r add x1,x2,x3 sub x4,x1,x3 and x6,x1,x7 or x8,x1,x9 xor x10,x1,x11 Data Hazard on R1 Time (clock cycles) Advanced Computer Architecture 2019 @ Utsunomiya University

Three Generic Data Hazards • Read After Write (RAW) InstrJ tries to read operand before InstrIwrites it • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add x1,x2,x3 J: sub x4,x1,x3 Advanced Computer Architecture 2019 @ Utsunomiya University

I: sub x4,x1,x3 J: add x1,x2,x3 K: add x6,x1,x7 Three Generic Data Hazards • Write After Read (WAR) InstrJ writes operand beforeInstrIreads it • Called an “anti-dependence” by compiler writers.This results from reuse of the name “x1”. • Can’t happen in RISC-V 5 stage pipeline because: • All instructions take 5 stages, and • Reads are always in stage 2, and • Writes are always in stage 5 Advanced Computer Architecture 2019 @ Utsunomiya University

I: sub x1,x4,x3 J: add x1,x2,x3 K: add x6,x1,x7 Three Generic Data Hazards • Write After Write (WAW) InstrJ writes operand beforeInstrIwrites it. • Called an “output dependence” by compiler writersThis also results from the reuse of name “x1”. • Can’t happen in RISC-V stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • Will see WAR and WAW in more complicated pipes Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add x1,x2,x3 sub x4,x1,x3 and x6,x1,x7 or x8,x1,x9 xor x10,x1,x11 Forwarding to Avoid Data Hazard Time (clock cycles) Advanced Computer Architecture 2019 @ Utsunomiya University

ALU HW Change for Forwarding ID/EX EX/MEM MEM/WR NextPC mux Registers Data Memory mux mux Immediate What circuit detects and resolves this hazard? Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add x1,x2,x3 lwx4, 0(x1) swx4,12(x1) or x8,x6,x9 xor x10,x9,x11 Forwarding to Avoid LW-SW Data Hazard Time (clock cycles) Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Reg Reg Reg Reg Reg ALU Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU lwx1,0(x2) I n s t r. O r d e r sub x4,x1,x6 and x6,x1,x7 or x8,x1,x9 Data Hazard Even with Forwarding Time (clock cycles) Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem ALU Bubble ALU ALU Reg Reg DMem DMem Bubble Reg Reg Data Hazard Even with Forwarding Time (clock cycles) I n s t r. O r d e r lw x1, 0(x2) sub x4,x1,x6 and x6,x1,x7 Bubble or x8,x1,x9 ALU DMem How is this detected? Advanced Computer Architecture 2019 @ Utsunomiya University

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd Compiler optimizes for performance. Hardware checks for safety. Advanced Computer Architecture 2019 @ Utsunomiya University

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem 10: beq x1,x3,36 14: and x2,x3,x5 18: or x6,x1,x7 22: add x8,x1,x9 36: xor x10,x1,x11 Control Hazard on BranchesThree Stage Stall What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? Advanced Computer Architecture 2019 @ Utsunomiya University

Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! • Two part solution: • Determine branch taken or not sooner, AND • Compute taken branch address earlier • RISC-V Solution: • Move Zero-test to ID stage • Adder to calculate new PC in ID stage • 1 clock cycle penalty for branch versus 3 Advanced Computer Architecture 2019 @ Utsunomiya University

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Pipelined RISC-V Datapath Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next SEQ PC Next PC MUX Adder Z RS1 Reg File Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD Advanced Computer Architecture 2019 @ Utsunomiya University

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken • Execute successor instructions in sequence • “Squash” instructions in pipeline if branch actually taken • Advantage of late pipeline state update • 47% branches not taken on average • PC+4 already calculated, so use it to get next instruction • ARMandRISC-V use this #3: Predict Branch Taken • 53% branches taken on average • But haven’t calculated branch target address in RISC-V • RISC-V still incurs 1 cycle branch penalty Advanced Computer Architecture 2019 @ Utsunomiya University

Four Branch Hazard Alternatives #4: Delayed Branch • Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn target of taken branch • 1 slot delay allows proper decision and branch target address in 5 stage pipeline Branch delay of length n Advanced Computer Architecture 2019 @ Utsunomiya University

becomes becomes becomes if x2=0 then add x1,x2,x3 add x1,x2,x3 if x1=0 then sub x4,x5,x6 Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through • A is the best choice, fills delay slot & reduces instruction count (IC) • In B, the sub instruction may need to be copied, increasing IC • In B and C, must be okay to execute sub when branch fails add x1,x2,x3 if x1=0 then add x1,x2,x3 if x2=0 then sub x4,x5,x6 delay slot delay slot add x1,x2,x3 if x1=0 then or x7,x8,x9 sub x4,x5,x6 delay slot add x1,x2,x3 if x1=0 then or x7,x8,x9 sub x4,x5,x6 Advanced Computer Architecture 2019 @ Utsunomiya University

Delayed Branch • Compiler effectiveness for single branch delay slot: • Fills about 60% of branch delay slots • About 80% of instructions executed in branch delay slots useful in computation • About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot • Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches • Growth in available transistors has made dynamic approaches relatively cheaper Advanced Computer Architecture 2019 @ Utsunomiya University

Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.60 3.1 1.0 Predict taken 1 1.20 4.2 1.33 Predict not taken 1 1.14 4.4 1.40 Delayed branch 0.5 1.10 4.5 1.45 Advanced Computer Architecture 2019 @ Utsunomiya University

Exception: An unusual event happens to an instruction during its execution • Examples: divide by zero, undefined opcode • Interrupt: Hardware signal to switch the processor to a new instruction stream • Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) • Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) • The effect of all instructions up to and including Ii is totally completed • No effect of any instruction after Ii can take place • The interrupt (exception) handler either aborts program or restarts at instruction Ii+1 Advanced Computer Architecture 2019 @ Utsunomiya University

Precise Exceptions in Static Pipelines Key observation: architected state only change in memory and register write stages. Advanced Computer Architecture 2019 @ Utsunomiya University

Floating Point Operations • FP pipeline will allow for a longer latency • It is impractical to require all FP operations complete in 1 clock cycle • Slow clock rate • Enormous amounts of logic circuit • Multiple FUs (functional units) • Integer unit, FP/Int multiplier, FP adder, FP/Int divider • Two alternatives • Multicycle (unpipelined) FP unit • Pipelined FP unit • Latency and initiation interval (or repeat interval) Advanced Computer Architecture 2019 @ Utsunomiya University

Multicycle (unpipelined) FP units Advanced Computer Architecture 2019 @ Utsunomiya University

Pipelined FP units WAR and WAW hazards may appear because instructions have different lengths Advanced Computer Architecture 2019 @ Utsunomiya University

Latency and initiation interval • Can start every FP-add operationbut each result is obtained after 3 cycles • FP divider is not pipelinedrequires 24 cycles to complete Advanced Computer Architecture 2019 @ Utsunomiya University

RISC-V Architecture and Pipelining Concepts

RISC-V Architecture and Pipelining Concepts

Presentation Transcript

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C: Pipelining

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C

Appendix C