1 / 20

Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab.

Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology. MemWrite. WBSrc. 0x4. clk. we. clk. rs1. rs2. ALU Control. we. rd1. addr. PC. ws. addr. inst. wd. rd2. ALU. GPRs. z. Inst. Memory. rdata. clk.

kaemon
Download Presentation

Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology http://csg.csail.mit.edu/SNU

  2. MemWrite WBSrc 0x4 clk we clk rs1 rs2 ALU Control we rd1 addr PC ws addr inst wd rd2 ALU GPRs z Inst. Memory rdata clk Data Memory Imm Ext wdata zero? OpCode RegDst ExtSel OpSel BSrc Harvard-Style Datapath for MIPS PCSrc RegWrite br rind jabs pc+4 Add Add 31

  3. What problem arises if instructions and data reside in the same memory? At least the instruction fetch and a Load (or Store) cannot be executed in the same cycle Structural hazard

  4. PCSrc RegWrite WBSrc MemWrite PCen 0x4 Add Add clk we clk clk rs1 rs2 rd1 PC 31 we ws addr IR rd2 ALU wd GPRs clk rdata z Data Memory Imm Ext wdata ALU Control IRen OpCode RegDst ExtSel OpSel BSrc zero? AddrSrc Princeton MicroarchitectureDatapath & Control off off off on Fetch phase = PC

  5. AddrSrc=PC IRen=on PCen=off Wen=off AddrSrc=ALU IRen=off PCen=on Wen=on Two-State Controller:Princeton Architecture fetch phase execute phase A flipflop can be used to remember the phase

  6. Hardwired Controller:Princeton Architecture . . . ExtSel, BSrc, OpSel, WBSrc, RegDest, PCsrc1, PCsrc2 old combinational logic (Harvard) op code IR zero? MemWrite RegWrite Wen new combinational logic S PCen IRen AddrSrc 1-bit Toggle FF I-fetch / Execute

  7. Two-Cycle SMIPS Register File stage PC Execute Decode ir +4 Data Memory Inst Memory http://csg.csail.mit.edu/SNU 7

  8. Two-Cycle SMIPS modulemkProc(Proc); Reg#(Addr) pc <- mkRegU; RFilerf<- mkRFile; Memory mem <- mkMemory; PipeReg#(FBundle) ir<- mkPipeReg; Reg#(Bit#(1)) stage <- mkReg(0); letpcir = ir.first(); let pc = pcir.pc; let inst = pcir.inst; ruledoProc; if(stage==0 && ir.notFull) begin //fetch letinstResp <- mem(MemReq{op:Ld, addr:pc, data:?}); ir.enq(FBundle{pc:pc, inst:instResp}); stage <= 1; end http://csg.csail.mit.edu/SNU

  9. Two-Cycle SMIPS cont-1 if(stage==1 && ir.notEmpty) begin //decode letdecInst = decode(inst); Data rVal1 = rf.rd1(decInst.rSrc1); Data rVal2 = rf.rd2(decInst.rSrc2); //execute letexecInst = exec(decInst, pc, rVal1, rVal2); if(execInst.instType==Ld || execInst.instType==St) execInst.data <- mem(MemReq{op:execInst.instType, addr:execInst.addr, data:execInst.data}); pc <= execInst.brTaken ? execInst.addr : pc+4; //writeback http://csg.csail.mit.edu/SNU

  10. Two-Cycle SMIPS cont-2 //writeback if(execInst.instType==Alu|| execInst.instType==Ld) rf.wr(execInst.rDst, execInst.data); ir.deq; stage <= 0; end endrule endmodule; http://csg.csail.mit.edu/SNU

  11. Time = Instructions Cycles Time Program Program * Instruction * Cycle Processor Performance • Instructions per program depends on source code, compiler technology and ISA • Cycles per instructions (CPI) depends upon the ISA and the microarchitecture • Time per cycle depends upon the microarchitecture and the base technology

  12. Single-Cycle Hardwired Control: Harvard architecture We will assume clock period is sufficiently long for all of the following steps to be “completed”: 1. instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time tC > tIFetch + tRFetch+ tALU+ tDMem+ tRWB At the rising edge of the following clock, the PC, the register file and the memory are updated

  13. Clock Period tC-Princeton > max {tM , tRF+ tALU+ tM + tWB} tC-Princeton > tRF+ tALU+ tM + tWB while in the hardwired Harvard architecture tC-Harvard > tM + tRF + tALU+ tM+ tWB which will execute instructions faster?

  14. Clock Rate vs CPI Suppose tM >> tRF+ tALU+ tWB tC-Princeton = 0.5 * tC-Harvard CPIPrinceton= 2 CPIHarvard= 1 No difference in performance! Is it possible to design a controller for the Princeton architecture with CPI < 2 ? CPI = Clock cycles Per Instruction

  15. 0x4 Add we rs1 rs2 we rd1 we addr PC ws IR addr rdata ALU wd rd2 GPRs rdata Memory Memory wdata Imm Ext wdata fetch phase execute phase Princeton microarchitecture(redrawn) The same (mux not shown) Only one of the phases is active in any cycle  a lot of datapath is not in use at any given time

  16. 0x4 Add we rs1 rs2 we rd1 we addr PC ws IR addr rdata ALU wd rd2 GPRs rdata Memory Memory wdata Imm Ext wdata fetch phase execute phase Princeton MicroarchitectureOverlapped execution Can we overlap instruction fetch and execute? Yes, unless IR contains a Load or Store Which action should be prioritized? Execute Stall it How? What do we do with Fetch?

  17. stall? we rs1 rs2 rd1 we ws addr ALU wd rd2 GPRs rdata Memory Imm Ext wdata Stalling the instruction fetch Princeton Microarchitecture 0x4 Add nop we addr PC IR rdata Memory wdata fetch phase execute phase When stall condition is indicated don’t fetch a new instruction and don’t change the PC insert a nop in the IR set the Memory Address mux to ALU (not shown) What if IR contains a jump or branch instruction?

  18. Jump? we rs1 rs2 rd1 we ws addr ALU wd rd2 GPRs rdata Memory Imm Ext wdata Need to stall on branchesPrinceton Microarchitecture 0x4 Add nop we addr PC IR rdata Memory wdata When IR contains a jump or branch-taken no “structural conflict” for the memory but we do not have the correct PC value in the PC memory cannot be used – Address Mux setting is irrelevant insert a nop in the IR insert the nextPC (branch-target) address in the PC

  19. 0x4 Pipelined Princeton Microarchitecture PCSrc PCSrc2 RegWrite WBSrc MemWrite PCen Add Add clk we nop clk clk rs1 rs2 rd1 PC 31 we ws addr IR rd2 ALU wd GPRs clk rdata z Data Memory Imm Ext wdata ALU Control IRSrc stall? OpCode RegDst ExtSel OpSel BSrc zero? MAddrSrc stall

  20. Pipelined Princeton Architecture Clock:tC-Princeton > tRF+ tALU+ tM CPI: (1- f) + 2f cycles per instruction where f is the fraction of instructions that cause a stall What is a likely value of f?

More Related