1 / 77

Chapter 4

CprE 381 Computer Organization and Assembly Level Programming, Fall 2013. Chapter 4. The Processor. Zhao Zhang Iowa State University Revised from original slides provided by MKP. Week 9 Overview. Mini Project B CPU Pipelining: Pipelined Data Path and Control

watson
Download Presentation

Chapter 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Chapter 4 The Processor Zhao Zhang Iowa State University Revised from original slides provided by MKP

  2. Week 9 Overview • Mini Project B • CPU Pipelining: Pipelined Data Path and Control • ALU Data Hazards and Forwarding Chapter 1 — Computer Abstractions and Technology — 2

  3. Mini-Project BOverview Implement single-cycle processor (SCP). There will be three parts • Part 1, SCPv1: Implement the nine-instruction ISA • Part 2, SCPv2a: Support all the instructions needed to run bubble sorting • With coarse-level modeling of datapath elements • Part 3, SCPv2b: Detailed modeling of datapath elements There is a bonus project Chapter 1 — Computer Abstractions and Technology — 3

  4. Project A Late Submission • Start working on Project B, ASAP • You may submit Mini-Project A late for three weeks (with 20% late penalty) • Demo those parts that are working • Late penalty only applies to those parts that are actually late • If you demo Project B successfully, you don’t have to demo any late part of Project A Chapter 1 — Computer Abstractions and Technology — 4

  5. Part 1: SCPv1 Implementing the nine-Instruction MIPS ISA • Memory reference: LW and SW • Arithmetic/logic: ADD, SUB, AND, OR, SLT • Branch: BEQ, J The textbook provides almost all implementation details • Datapath and control • The main control unit (9-bit signals w/o Jump) • The ALU control unit Chapter 1 — Computer Abstractions and Technology — 5

  6. Part 1: SCPv1 Use this diagram as the blueprint for Part 1 Chapter 1 — Computer Abstractions and Technology — 6

  7. SCPv1: Control Signals • Control signal setting for SCPv1 • It is a truth table Note: “R-” means R-format Chapter 1 — Computer Abstractions and Technology — 7

  8. SCPv1: ALU Control • Truth table for ALU Control Chapter 4 — The Processor — 8

  9. SCPv1 Fast Prototyping You are provided with following files • mips32.vhd: A VHDL package • regfile.vhd: For the register file • register.vhd: For the PC • alu.vhd: For the ALU • adder.vhd: For the PC-related adders • mem.vhd: The memory, for both instruction memory and data memory Chapter 1 — Computer Abstractions and Technology — 9

  10. SCPv1 Fast Prototyping Rational behind Part 1: Focus on the structure/organization of the CPU • The provided components are modeled at coarse-level • We know that efficient circuit design exists for those components: Memory, register file, ALU, adder, mux and so on • Work out the details at the late time Chapter 1 — Computer Abstractions and Technology — 10

  11. Strongly Structural Modeling • Your CPU composition must be strongly structural • No behavior modeling can be used. No process statement. • Limited dataflow modeling (see next) • Additional requirement: Declare all components in the architecture body of CPU • Only component instantiation, no entity instantiation Chapter 1 — Computer Abstractions and Technology — 11

  12. Strongly Structural Modeling • Acceptable forms of dataflow modeling • Signal copying/splitting opcode<= inst(31 downto 26); • Signal Merging j_target<= PC(31 downto 28) & j_offset& "00”; • One-level of basic logic gates taken_branch<= branch AND zero; Chapter 1 — Computer Abstractions and Technology — 12

  13. Cpu.vhd • This is a partial sample -- Control Unit CONTROL1: control port map (opcode, reg_dst, alu_src, mem_to_reg,…); -- ALU Control unit ALU_CTRL1: alu_ctrl port map (alu_op, funct, alu_code); -- The mux connected to the dst port of regfile DST_MUX : mux2to1 generic map (M => 5) port map (rt, rd, reg_dst, dst); … Chapter 1 — Computer Abstractions and Technology — 13

  14. Datapath and Control Modeling • For datapath elements and control units, you may use any modeling style (in Part 1) • The provided components all use behavior modeling for simplicity Chapter 1 — Computer Abstractions and Technology — 14

  15. mips32.vhd package MIPS32 is -- Half Cycle Time of the clock signal constant HCT : time := 50 ns; -- Clock Cycle Time of the clock signal constant CCT : time := 2 * HCT; -- MIPS32 logic type subtype m32_logic is std_logic; -- MIPS32 logic vector type subtype m32_vector is std_logic_vector; Pre-defined constants and types to make coding simpler and consistent Chapter 1 — Computer Abstractions and Technology — 15

  16. mips32.vhd -- Word type, for … subtype m32_word is m32_vector(31 downto 0); -- Halfword, byte, and bit fields of varying size subtype m32_halfword is m32_vector(15 downto 0); subtype m32_byte is m32_vector(7 downto 0); subtype m32_1bit is m32_logic; subtype m32_2bits is m32_vector(1 downto 0); subtype m32_3bits is m32_vector(2 downto 0); … end MIPS32; Pre-defined types shorten the names Chapter 1 — Computer Abstractions and Technology — 16

  17. Alu.vhd • Why provide the ALU and the other VHDL programs? • Your implementation might have bugs • We don’t want to fight the bugs in two fronts • You shall test those modules • Always test any modules that you will use • The provided modules have been tested • Some test-bench programs are provided • Write your own test-bench or extend the provided test-bench Chapter 1 — Computer Abstractions and Technology — 17

  18. Alu.vhd entity ALU is port (rdata1 : in m32_word; rdata2 : in m32_word; alu_code : in m32_4bits; result : out m32_word; zero : out m32_1bit); end entity; Chapter 1 — Computer Abstractions and Technology — 18

  19. Alu.vhd architecture behavior of ALU is signal r : m32_word; begin P_ALU : process (alu_code, rdata1, rdata2) variable code, a, b, sum, diff, slt: integer; begin -- Pre-calculate arithmetic results a := to_integer(signed(rdata1)); b := to_integer(signed(rdata2)); sum := a + b; diff := a - b; if (a < b) then slt := 1; else slt := 0; end if; Chapter 1 — Computer Abstractions and Technology — 19

  20. Alu.vhd -- Select the result, convert to signal if necessary case (alu_code) is when "0000" => -- AND r <= rdata1 AND rdata2; when "0010" => -- add r <= std_logic_vector(to_signed(sum, 32)); … end case; end process; -- Drive the alu result output result <= r; -- Drive the zero output with r select zero <= '1' when x"00000000", '0' when others; end behavior; Coarse-level modelingis easy, reliable but maynot be synthesized efficiently Chapter 1 — Computer Abstractions and Technology — 20

  21. Regfile.vhd entity regfile is port(src1 : in m32_5bits; src2 : in m32_5bits; dst : in m32_5bits; wdata : in m32_word; rdata1 : out m32_word; rdata2 : out m32_word; WE : in m32_1bit; reset : in m32_1bit; clock : in m32_1bit); end regfile; Caveat: The clock signal is needed in the single-cycle implementation Chapter 1 — Computer Abstractions and Technology — 21

  22. Regfile.vhd architecture behavior of regfile is signal reg_array : m32_regval_array; begin -- Register reset logic P_WRITE : process (clock) variable r : integer; begin -- Write/reset logic if (rising_edge(clock)) then if (reset = '1') then for i in 0 to 31 loop reg_array(i) <= X"00000000"; end loop; Chapter 1 — Computer Abstractions and Technology — 22

  23. Regfile.vhd elsif (WE = '1') then r := to_integer(unsigned(dst)); if not (r = 0) the reg_array(r) <= wdata; end if; end if; end if; end process; Chapter 1 — Computer Abstractions and Technology — 23

  24. Regfile.vhd P_READ : process (clock, src1, src2) variable r1, r2 : integer; begin -- Read logic r1 := to_integer(unsigned(src1)); r2 := to_integer(unsigned(src2)); rdata1 <= reg_array(r1); rdata2 <= reg_array(r2); end process; end behavior; Chapter 1 — Computer Abstractions and Technology — 24

  25. Demonstration For each of multiple test cases • Trace the program execution • Inspect the register and memory contents at the end of execution Test case consists of • MIPS binary code, e.g. in imem.txt • Data memory content, e.g. in dmem.txt Chapter 1 — Computer Abstractions and Technology — 25

  26. Test Bench Inside test bench: CPU1 : cpu port map (imem_addr, inst, dmem_addr, dmem_read, dmem_write, dmem_wmask, dmem_rdata, dmem_wdata, reset, clock); INST_MEM : mem generic map (mif_filename => "imem.txt") port map (imem_addr(9 downto 2), "0000", clock, x"00000000", '0', inst); DATA_MEM : mem generic map (mif_filename => "dmem.txt") port map (dmem_addr(9 downto 2), dmem_wmask, clock, dmem_wdata, dmem_write, dmem_rdata); Note: Treat memories as external datapath elements Chapter 1 — Computer Abstractions and Technology — 26

  27. Instruction Memory • imem.txt contents (MIF) DEPTH=1024; WIDTH = 32; -- lw $t0, 0($zero) -- lw $t1, 4($zero) -- beq $t0, $t1, +2 -- add $t0, $t0, $t1 -- sw $t0, 8($zero) -- noop CONTENT BEGIN -- Instruction formats --R ======-----=====-----=====------ --I ======-----=====---------------- --J ======-------------------------- 0 : 10001100000010000000000000000000; 1 : 10001100000010010000000000000100; 2 : 00010001000010010000000000000010; 3 : 00000001000010010100000000100000; 4 : 10101100000010000000000000001000; [5..63] : 00000000; END; Chapter 1 — Computer Abstractions and Technology — 27

  28. Part 2. SCPv2 Prototyping (SCPv2a) • Support all MIPS instructions used by the bubble sort example • We have studied how to extend the nine-instruction design to support ADDI, SLL, BNE, and JAL • For each new instruction, think about • Datapath: Any new/revised data elements, any new signal connections • The main control: Any new control signals, any extension to the truth table • The ALU control: Any extension to the truth table Chapter 1 — Computer Abstractions and Technology — 28

  29. Part 3. SCPv2b • SCPv2 Detailed Implementation • Provide detailed modeling for • Register file • ALU • Adder • Use your code from Labs 1-4 and Mini-Project A • You may revise your code • Your final code should be strongly structural • Consult your lab TAif you are not sure Chapter 1 — Computer Abstractions and Technology — 29

  30. Bonus Project Part 1 • Green MIPS SCP (SCP-G) • Extend SCPv2 to support all integer instructions listed on the green sheet • Bonus Project Part 2 is to do pipelined implementation • The lab bonus can overflow in your overall grade • As said, quiz bonus does not overflow • Partial credit will be given • The grading details will be finalized Chapter 1 — Computer Abstractions and Technology — 30

  31. Pipelined CPU CPU A natural idea to improve performance The devil is in the details • Pipelined data path and control • Data hazard from ALU instructions • Data Hazard from Load instructions • Control Hazard from branches • Exception handling in pipelined processor Chapter 1 — Computer Abstractions and Technology — 31

  32. SCP With Jumps Added Chapter 4 — The Processor — 32

  33. Performance Issues • Longest delay determines clock period • Critical path: load instruction • Instruction memory  register file  ALU  data memory  register file • Now we will improve performance by pipelining Chapter 4 — The Processor — 33

  34. Pipelining Analogy • Pipelined laundry: overlapping execution • Parallelism improves performance §4.5 An Overview of Pipelining • Four loads: • Speedup= 8/3.5 = 2.3 • Non-stop: • Speedup= 2n/0.5n + 1.5 ≈ 4= number of stages Chapter 4 — The Processor — 34

  35. Pipeline Performance Look at this example • In single-cycle implementation, the critical path is 800ps (one cycle @ 1.25 GHz) • The longest component latency is 200ps (one cycle @ 5GHz) Note: Latency of mux, extender and so on ignored Chapter 4 — The Processor — 35

  36. MIPS Pipeline Idea • If we divide the execution into stages, clock frequency can be much faster • Five stages, one step per stage • IF: Instruction fetch from memory • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register Chapter 4 — The Processor — 36

  37. MIPS Pipeline Idea General idea: Split the datapath into stages, withcritical path delay <= 1 clock cycle Chapter 4 — The Processor — 37

  38. Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) First look at performance gain Chapter 4 — The Processor — 38

  39. Pipeline Speedup • If all stages are balanced • i.e., all take the same time • Time between instructionspipelined= Time between instructionsnonpipelined Number of stages • Ideal speedup = N for N-stage pipeline • If not balanced, speedup is less • In the example, speedup is up to 4.0 • Speedup due to increased throughput • Latency (time for each instruction) does not decrease, or even increases Chapter 4 — The Processor — 39

  40. Pipelining and ISA Design • MIPS ISA designed for pipelining • All instructions are 32-bits • Easier to fetch and decode in one cycle • c.f. x86: 1- to 17-byte instructions • Few and regular instruction formats • Can decode and read registers in one step Chapter 4 — The Processor — 40

  41. Pipelining and ISA Design • How would you design a pipeline for this instruction format? ModR/M: addressing-form specifier, mixing of register numbers, addressing modes, additional opcode bits SIB: Second addressing byte for base-plus-index and scale-plus-index addressing modes Source: Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z Prefixes (1-4 bytes) Opcode (1-3 bytes), required ModR/M (1 byte ) SIB (1 byte) Addr. Displacement (0, 1, 2, or 4 bytes) Immediate (0, 1, 2, or 4 bytes) Chapter 1 — Computer Abstractions and Technology — 41

  42. Pipelining and ISA Design • MIPS ISA designed for pipelining • Load/store addressing • Can calculate address in 3rd stage, access memory in 4th stage • Alignment of memory operands • Memory access takes only one cycle Chapter 4 — The Processor — 42

  43. Pipelining and ISA Design • How would you design a pipeline that works well for the following instructions? ADD eax, ebx ; add with two registers SUB ebx, 100 ; sub with reg and const ADD eax, [0x1000] ; add reg and memory ADD BYTE PTR [0x1000], 100 ; add with mem and const SUB [esi+4*ebx], eax ; sub with reg and mem (array) Chapter 1 — Computer Abstractions and Technology — 43

  44. MIPS Pipelined Datapath §4.6 Pipelined Datapath and Control MEM Right-to-left flow leads to hazards WB Chapter 4 — The Processor — 44

  45. Pipeline registers • Need registers between stages • To hold information produced in previous cycle Chapter 4 — The Processor — 45

  46. Hazards • Situations that prevent starting the next instruction in the next cycle • Structure hazards • A required resource is busy • Data hazard • Need to wait for previous instruction to complete its data read/write • Control hazard • Deciding on control action depends on previous instruction Chapter 4 — The Processor — 46

  47. Hazards There are ways to handle those hazards. Let’s ignore them for now Assume, for now, no data dependence and control dependence in the program lw $10, 20($1) sub $11, $2, $3 add $12, $3, $4 lw $13, 24($1) sub $14, $5, $6 Can you design a pipeline to run the about instructions correctly? Chapter 1 — Computer Abstractions and Technology — 47

  48. Hazards Program with data dependence sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2) Program with control dependence beq $1, $3, +4addi $2, $2, 1 addi $4, $4, 1 Chapter 1 — Computer Abstractions and Technology — 48

  49. Pipeline Operation • Cycle-by-cycle flow of instructions through the pipelined datapath • “Single-clock-cycle” pipeline diagram • Shows pipeline usage in a single cycle • Highlight resources used • c.f. “multi-clock-cycle” diagram • Graph of operation over time • We’ll look at “single-clock-cycle” diagrams for load & store Chapter 4 — The Processor — 49

  50. IF for Load, Store, … Chapter 4 — The Processor — 50

More Related