
Chapter 4 The Processor
Introduction §4.1 Introduction • CPU performance factors • Instruction count • Determined by ISA and compiler • CPI and Cycle time • Determined by CPU hardware • We will examine two MIPS implementations • A simplified version • A more realistic pipelined version • Simple subset, shows most aspects • Memory reference: lw, sw • Arithmetic/logical: add, sub, and, or, slt • Control transfer: beq, j Chapter 4 — The Processor — 2
Instruction Execution • PC instruction memory, fetch instruction • Register numbers register file, read registers • Depending on instruction class • Use ALU to calculate • Arithmetic/logical result • Memory address for load/store • Branch target address • Access data memory for load/store • PC target address or PC + 4 Chapter 4 — The Processor — 3
CPU Overview Chapter 4 — The Processor — 4
Multiplexers • Can’t just join wires together • Use multiplexers Chapter 4 — The Processor — 5
Control Chapter 4 — The Processor — 6
Logic Design Basics • Information encoded in binary • Low voltage = 0, High voltage = 1 • One wire per bit • Multi-bit data encoded on multi-wire buses • Combinational element • Operate on data • Output is a function of input • State (sequential) elements • Store information §4.2 Logic Design Conventions Chapter 4 — The Processor — 7
A Y + B A Y B A Mux I0 Y ALU Y I1 B F S Combinational Elements • AND-gate • Y = A & B • Adder • Y = A + B • Arithmetic/Logic Unit • Y = F(A, B) • Multiplexer • Y = S ? I1 : I0 Chapter 4 — The Processor — 8
falling edge cycle time rising edge State Elements • Unclocked vs. Clocked • Clocks used in synchronous logic • when should an element that contains state be updated?
An unclocked state element • The set-reset latch • output depends on present inputs and also on past inputs Qn+1 Qn+1 R S State element 0 Qn 0 Qn Set 1 0 1 0 Reset 0 1 1 0 initialization * * 0 1 1 0 Don’t use this mode!!!
Latches and Flip-flops • Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) • Change of state (value) is based on the clock • Latches: whenever the inputs change, and the clock is asserted (level-triggered methodology) • Flip-flop: state changes only on a clock edge(edge-triggered methodology) "logically true", ?could mean electrically low A clocking methodology defines when signals can be read and written You wouldn't want to read a signal at the same time it was being written
D-latch (Transparent Latch) • Two inputs: • the data value to be stored (D) • the clock signal (C) indicating when to read & store D • Two outputs: • the value of the internal state (Q) and it's complement Propagation delay
D flip-flop (1-bit register) • Output changes only on the clock edge • Negative (falling) edge in this example
Comparison between D latch and D flip-flop • A timing diagram for normal operation of a D latch and a D flip-flop (positive edge triggered): • Latch : clock = 1 propagate、open • Flip-flop : clock: 0 → 1 (rising edge) catch input signal
Clk D Q D Clk Q Sequential Elements • Register: stores data in a circuit • Uses a clock signal to determine when to update the stored value • Edge-triggered: update when Clk changes from 0 to 1 (positive edge); or from 1 to 0 (negative edge) Chapter 4 — The Processor — 15
Clk Write D Q Write D Clk Q Sequential Elements • Register with write control • Only updates on clock edge when write control input is 1 • Used when stored value is required several cycles later Chapter 4 — The Processor — 16
Clocking Methodology • Combinational logic transforms data during clock cycles • Between clock edges • Input from state elements, output to state element • Longest delay determines clock period Chapter 4 — The Processor — 17
Register File (read operation) • Built using D flip-flops
Register File (write operation) • Note: we still use the real clock to determine when to write
Simple Implementation • Include the functional units we need for each instruction Why do we need this stuff?
Building a Datapath • Datapath • Elements that process data and addressesin the CPU • Registers, ALUs, mux’s, memories, … • Structure of the datapath • similar to a flow chart • We will build a MIPS datapath incrementally • Refining the overview design §4.3 Building a Datapath Chapter 4 — The Processor — 21
op rs rt rd shamt funct op rs rt 16 bit address op 26 bit address So far: • InstructionMeaning (Register Transfer Language, RTL)add $s1,$s2,$s3 $s1 = $s2 + $s3sub $s1,$s2,$s3 $s1 = $s2 – $s3lw $s1,100($s2) $s1 = Memory[$s2+100] sw $s1,100($s2) Memory[$s2+100] = $s1bne $s4,$s5,L Next instr. is at Label if $s4 ≠ $s5beq $s4,$s5,L Next instr. is at Label if $s4 = $s5j Label Next instr. is at Label • Formats: R I J
0 rs rt rd shamt funct R-type 31:26 25:21 20:16 15:11 10:6 5:0 35 or 43 rs rt address 31:26 25:21 20:16 15:0 4 rs rt address 31:26 25:21 20:16 15:0 Information in Instruction Formats Load/Store Branch opcode always read read, except for load write for R-type and load sign-extend and add Chapter 4 — The Processor — 23
Instruction Fetch Increment by 4 for next instruction 32-bit register Chapter 4 — The Processor — 24
R-Format Instructions • Read two register operands • Perform arithmetic/logical operation • Write register result Chapter 4 — The Processor — 25
Load/Store Instructions • Read register operands • Calculate address using 16-bit offset • Use ALU, but sign-extend offset • Load: Read memory and update register • Store: Write register value to memory Chapter 4 — The Processor — 26
Branch Instructions • Read register operands • Compare operands • Use ALU, subtract and check Zero output • Calculate target address • Sign-extend displacement • Shift left 2 places (word displacement) • Add to PC + 4 • Already calculated by instruction fetch Chapter 4 — The Processor — 27
Branch Instructions Justre-routes wires Sign-bit wire replicated Chapter 4 — The Processor — 28
Composing the Elements • First-cut data path does an instruction in one clock cycle • Each datapath element can only do one function at a time • Hence, we need separate instruction and data memories • Use multiplexers where alternate data sources are used for different instructions Chapter 4 — The Processor — 29
Registeraccess R3<-R1 + R2; R3<-R1 - R2 Arithmeticoperations Figure5.7Data path for the R-type instructions Store M[R1+Immed]<-R2 R2<-M[R1+Immed] Store path Load path Figure5.9 Data path for load/store
R-Type/Load/Store Datapath Chapter 4 — The Processor — 32
PC increment branch BranchTargetAddress Major system state mux : multiple sources Fan out : multiple destinations Register reads Parallelism : speculative execution PC values load / store Data Path for ALU instructions branch ……single cycle implementation PC++ Figure5.14
Full Datapath Chapter 4 — The Processor — 34
ALU Control Signals (Appendix C) • ALU used for • Load/Store: F = add • Branch: F = subtract • R-type: F depends on funct field §4.4 A Simple Implementation Scheme Chapter 4 — The Processor — 36
opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010 subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001 set-on-less-than 101010 set-on-less-than 0111 ALU Control • Use ALUOp to classify instructions • LW/SW/BEQ: ALU function depends solely on opcode field • R-type: ALU function depends on both the opcode and funct fields. Chapter 4 — The Processor — 37
0 rs rt rd shamt funct 31:26 25:21 20:16 15:11 10:6 5:0 35 or 43 rs rt address 31:26 25:21 20:16 15:0 4 rs rt address 31:26 25:21 20:16 15:0 The Main Control Unit • Control signals derived from instruction R-type Load/Store Branch opcode always read read, except for load write for R-type and load sign-extend and add Chapter 4 — The Processor — 38
Datapath With Control Chapter 4 — The Processor — 39
R-Type Instruction Chapter 4 — The Processor — 40
Load Instruction Chapter 4 — The Processor — 41
Branch-on-Equal Instruction Chapter 4 — The Processor — 42
2 address 31:26 25:0 Implementing Jumps Jump • Jump uses word address • Update PC with concatenation of • Top 4 bits of old PC • 26-bit jump address • 00 • Need an extra control signal decoded from opcode Chapter 4 — The Processor — 43
Datapath With Jumps Added Chapter 4 — The Processor — 44
Single Cycle Implementation • Calculate cycle time assuming negligible delays except: • memory (2ns), ALU and adders (2ns), register file access (1ns) • Find the critical path…R-type: 5ns; LW: 7ns; SW:5ns; BEQ: 5ns,…
The single cycle datapath of an add instruction 100 add rd, rs, rt 0 rs rt rd 0 0X20 end • memory (2ns), ALU and adders (2ns), register file access (1ns) rs rt rd start end
0x23 rs rt Offset The single cycle datapath of load instruction 104 lw $rt, $rs, offset rs rt Offset
Performance Issues • Longest delay determines clock period • Critical path: load instruction • Instruction memory register file ALU data memory register file • Not feasible to vary period for different instructions • Violates design principle • Making the common case fast • We will improve performance by pipelining Chapter 4 — The Processor — 49
Pipelining Analogy • Pipelined laundry: overlapping execution • Parallelism improves performance §4.5 An Overview of Pipelining • Four loads: • Speedup= 8hr/3.5hr = 2.3 • Non-stop: • Speedup= 2n/(0.5n + 1.5) ≈ 4= number of stages Chapter 4 — The Processor — 50
MIPS Pipeline • Five stages, one step per stage • IF: Instruction fetch from memory • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register Chapter 4 — The Processor — 51