History

History

Nature of Computing • CPU: Program Control, Arithmetic Logic-Unit • Main Memory • I/O • Mechanic Era Ex: Babbage’s Difference Engine • Electronic Era • First Generation • ENIAC, EDVAC (true binary), IAS, Whirlwind I, UNIVAC, 701 • Assembly Language • Second Generation(transistor Era): start using floating point • IBM 7094 • Third Generation (ICs) • VLSI Era: very dense circuits; contain thousands or million of transistors • System/360

Design Methodology

Review • System representation • Structure • Directed graph • Block diagram (ex: logic circuit) • Behavior (function f): we can determine corresponding output f(a) for any given input a • Output f(a) for xor(x1, x2): f(0,0) = 0

Top-down design 1. Specify the processor-level structure of system 2. Specify the register-level structure of each component type identified in 1 3. Specify the gate-level structure of each component type identified in 2 • Gate-Level • Combinational logic: a combinational function also referred to as a logic or Boolean function is a mapping from the set of 2n input combination of n binary variables output 0 or 1 • AND, OR, NAND,NOR, XOR, NOT gates • SOP (AND-OR) using minterm, POS (OR-AND) using maxterm • Karnough map

x1 x S z Full adder x2 y Cout Cin y 1 D Flip-Flop Clear Clock • Flip-Flop: JK, RS, D Flip-flop • Sequential circuit: consists of a combinational circuit and a set of flip-flops • Combinational logic forms the computational or data processing part of the circuit • Flip-flop store information on the circuit’s past behavior Perform in 1 clock cycle c(i)z(i) = x1(i) plus x2(i) plus c(i-1)

Truth table of full adder (last page) Can use this truth table to build combinational logic circuit

x2 x1 xn m m m k Termination State S Function Select F Multifunction Unit Enable E m m z1 z2 • Register Level • Termination State: control output , activate when value = 0, indicate when and how the unit complete its processing • Function Select F: specify one of several possible operator to perform • Enable: specify time or condition for a selected operation to be perform  often connected to clock sources

X1,X2,…, Xn be m-bit binary words Xi = (Xi1, Xi2,…,Xim) for i=1,2,…,n Z(X1,X2,…, Xn ) = [Z(X11, X12,…,X1m), Z(X21, X22,…,X2m), …, Z(Xn1, Xn2,…,Xnm)] Ex:. Z = XY (and each bit) (Z1, Z2,… Zn) = (X1Y1, X2Y2,… XnYn) • MUX • Select 1 of input to transmit to the final output, intend to route data from 1 of several sources to a common destination • Select F: p bit, input X: 2P-1 input each has m bit, output Z has m bit. If F = 000…01 then X1 is selected

Decoder • 1 out of 2n or 1/ 2n decoder, routing data from a common source to one of several destination • Combination circuit with n input data line and 2n output data lines such that each of the 2n possible input combination Xi activates (sets to 1) exactly 1 of the output lines Zi • Encoder • Use to generate the address or name of an active input line • 2k input data line and k output data lines • Ex. k=3 data input lines x0 x1 x2 x3 x4 x5 x6 x7 =00000010 data output lines z2 z 1z0 = 110 • Problem: if more than 1 input activeuse priority encoder

Arithmetic elements • Register • M-bit register is an ordered set of m flip-flops designed to store an m-bit word (z0 ,z1,…, zm-1) each bit is in one flip-flop • Shift register • Right shift (z0 ,z1,…, zm-1)  (x,z0 ,z1,…, zm-2) • Left shift (z0 ,z1,…, zm-1)  (z1 ,z2,…, zm-1,x) • Register-level Design: tend to be heuristic and depend heavily on the designer’s expertise: design technology as follows • Define desired behavior by a set of sequence of register-transfer operationsalgorithm AL to be executed • Analyze AL to determine the type of components and the number of each type required for the datapath DP • Construct a block diagram for DP using the component identified in 2. Make connection between components so that all data paths implied by AL are present and the given performance-cost constraints are met. • Analyze AL and DP to identified the control signals needed.

Design a control unit CU for DP that meets all the requirements of AL • Verify, typically by computer simulation, that the final design operates correctly and meets all performance-cost goals. • Processor Level • CPU • Memory • IO • Interconnection network • Processor level design: usually take a prototype design of known performance and modify it where necessary to accommodate new technology or meet new performance requirements

Performance Evaluation

Throughput : total amount of work done in a given time • Response time (execute time): time between start and completion of a task • Want to reduce response time and increse throughput Performance = 1/execution time • X is n times faster than Y  performancex/performancey = n T = N/IPS CPI = clock rate/IPS T = N*CPI/clock rate CPU clock cycle = ni=1(CPIi*Ci) where n = number of instruction class CPIi = CPI for class I Ci = count of number of instruction of class I MFLOPS = # floating-point operation in a program/(Executiontiom*106)

Instruction

Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction • Instruction Format or Encoding • – how is it decoded? • Location of operands and result • – where other than memory? • – how many explicit operands? • – how are memory operands located? • – which can or cannot be in memory? • Data type and Size • Operations • – what are supported • Successor instruction • – jumps, conditions, branches • Instruction architecture

26 25 0 31 J-type opcode Branch address • MIPS Instruction formats 26 25 15 0 31 21 20 16 I-type opcode rt rs Immediate field 26 25 15 0 31 10 21 20 16 5 11 6 R-type opcode rt shft Funct rd rs

Register (direct) op rs rt rd register Immediate op rs rt immed Base+index op rs rt immed Memory register + PC-relative op rs rt immed Memory PC + Pseudoindirect addressing Address op Memory PC : • MIPS Addressing mode

High programassemblymachine code • A[300] = h+A[300] suppose $t1 contains base of A $s0 contains h lw $t0, 1200($t1) add $t0, $s0, $t0 sw $t0, 1200($t1) 35 8 9 1200 0 8 0 32 8 16 43 8 9 1200

$zero  0 always contains 0 • $at  1  reserved for assembler • $v0, $v1  2,3 results of function • $a0-$a3  4-7  argument to function • $t0-$t7 8-15 temporary (not preserved across call) • $s0-$s7  16-23 saved temporary (preserved across call) • $k0,$k1  26,27 reserved for OS kernel • $gp  28 pointer to global area • $sp  29 stack pointer • $fp 30 frame pointer • $ra  31  return address (used by function call)

Signed-unsigned number • Check overflow for signed operation • Unsigned operation: for address calculation  sometimes overflow is ignored • EPC (exception program counter) register: contain s address of instruction that cause overflow • mfc0 (move from system control): instruction used to copy EPC into general-purpose register so that MIPS has option of returning to the offending instruction via jump register instruction

Multiplication

Booth’s algorithm • Depending on the current and previous bit • 00 or 11: no arithmetic operation • 01: add multiplicand to the left half of product • 10 subtract the multiplicand from the left half of the product • Shift product register 1 bit right

Division

First version • Register: 32-bit quotient, 64-bit divisor, 64-bit remainder (initialize with dividend) • ALU: 64 bit • Algorithm • Remainder = Remainder – Divisor • If Remainder < 0, Remainder = Remainder + Divisor and shift quotient register 1-bit left and set LSB =0 Else shift quotient register 1-bit left and set LSB =0 • Shift divisor 1-bit right • Loop • Final version • Eliminate quotient register • Register: 32-bit divisor, 64-bit remainder (right half: quotient (initialize with dividend), left half: remainder) • ALU: 32 bit

Algorithm • Shift remainder 1-bit left • Remainder(left half) = Remainder(left half) – Divisor • If Remainder < 0, Remainder(left half) = Remainder(left half) + Divisor and shift remainder register 1-bit left and set LSB = 0 Else shift remainder register 1-bit left and set LSB = 1 • Loop • Shift left half of remainder 1-bit right • Signed division • Dividend = Quotient  Divisor + Remainder  hold

Floating Point

31 30 23 22 0 S E F 31 30 20 19 0 S E F 31 F(cont.) Float or single precision • (-1)S F  2E • Float: 1 32-bit register • Double: 2 32-bit registers • Overflow: exponents is too large to be represented in exponent field (E) • Underflow: negative exponent is too large to fit in E Double or double precision 0

IEEE 754 format • N = (-1)S 2E-127 (1. F) for float • N = (-1)S 2E-1023 (1. F) for double • Floating point addition • Align decimal point (change smaller exponent • Add significant • Normalize sum and check overflow or underflow • Round • Floating point multiplication • Add exponent • Multiply significands • Normalize • Round • Determine sign of product

Floating point instruction in MIPS • lwc1, swc1 : load and store • Floating point (FP) Register: $f0, $f1,…, $f31 • Even number register is for single precision • Pair of FP registers is for double precision  odd number FP register used to load and store

Arithmetic Logic Unit

1-bit ALU: AND, OR, INVERTER, MUX • 32-bit ALU • Ripple carry  adder created by directly linking the carries of 1-bit adder • carry out of LSB can ripple all the way through the adder causing carry out of MSB • Subtraction  add negative version of operand (2’s complement) • a + ~b +1 = a + (~b +1) = a  b • Slt (set less than) if rs <rt, set LSB = 1, else LSB = 0 • a – b < 0: sign bit is the result  send sign bit to set LSB • Branch • a – b = 0  zero = ~(resullt31 + result30 +…+ result0)

ALU control line: MSB (Bnegate), last 2 bit (Operation) • Carry look ahead • generate (gi): gi = aibi • propagate (pi): pi = ai + bi • 4-bit adder • c1 = g0 + p0c0 • c2 = g1 + p1g0 + p1p0c0 • c3 = g2 + p2g1 + p2p1g0 + p2p1p0c0 • c4 = g3 + p3g2 +p3p2g1 + p3p2p1g0 + p3p2p1p0c0

16-bit adder • Super propagate (Pi): • P0 = p3p2p1p0 • P1 = p7p6p5p4 • P2 = p11p0p9p8 • P3 = p15p14p13p12 • Super generate • G0 = g3 + p3g2 + p3p2g1 + p3p2p1g0 • G1 = g7 + p7g6 + p7p6g5 + p7p6p5g4 • G2 = g11 +p11g10 + p11p10g9 + p11p10p9g8 • G3 = g15 + p15g14 + p15p14g13 + p15p14p13g12 • C1 = G0 + P0c0 • C2 = G1 + P1G0 + P1P0c0 • C3 = G2 + P2G1 + P2P1G0 + P2P1P0c0 • C4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0

Single Cycle datapath

How to Design a Processor: step-by-step • 1. Analyze instruction set => datapath requirements • the meaning of each instruction is given by the register transfers • datapath must include storage element for ISA registers • possibly more • datapath must support each register transfer • 2. Select set of datapath components and establish clocking methodology • 3. Assemble datapath meeting the requirements • 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. • 5. Assemble the control logic

31 26 21 16 11 6 0 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 31 26 21 16 0 immediate op rs rt 6 bits 5 bits 5 bits 16 bits 31 26 0 op target address 6 bits 26 bits The MIPS Instruction Formats • All MIPS instructions are 32 bits long. The three instruction formats: • R-type • I-type • J-type • The different fields are: • op: operation of the instruction • rs, rt, rd: the source and destination register specifiers • shamt: shift amount • funct: selects the variant of the operation in the “op” field • address / immediate: address offset or immediate value • target address: target address of the jump instruction

31 26 21 16 11 6 0 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 31 26 21 16 0 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 31 26 21 16 0 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 31 26 21 16 0 op rs rt immediate 6 bits 5 bits 5 bits 16 bits Step 1a: The MIPS-lite Subset for today • ADD and SUB • addU rd, rs, rt • subU rd, rs, rt • OR Immediate: • ori rt, rs, imm16 • LOAD and STORE Word • lw rt, rs, imm16 • sw rt, rs, imm16 • BRANCH: • beq rs, rt, imm16

Logical Register Transfers • RTL gives the meaning of the instructions • All start by fetching the instruction op | rs | rt | rd | shamt | funct = MEM[ PC ] op | rs | rt | Imm16 = MEM[ PC ] inst Register Transfers ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 SUBU R[rd] <– R[rs] – R[rt]; PC <– PC + 4 ORi R[rt] <– R[rs] + zero_ext(Imm16); PC <– PC + 4 LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4 STORE MEM[ R[rs] + sign_ext(Imm16) ] <– R[rt]; PC <– PC + 4 BEQ if ( R[rs] == R[rt] ) then PC <– PC + sign_ext(Imm16)] || 00 else PC <– PC + 4

Step 1: Requirements of the Instruction Set • Memory • instruction & data • Registers (32 x 32) • read RS • read RT • Write RT or RD • PC • Extender • Add and Sub register or extended immediate • Add 4 or extended immediate to PC

Step 2: Components of the Datapath • Combinational Elements • Storage Elements • Clocking methodology

Combinational Logic Elements (Basic Building Blocks) CarryIn A 32 Sum 32 Adder B Carry • Adder • MUX • ALU 32 Select A 32 Y 32 MUX B 32 OP A 32 Result ALU 32 B 32

Storage Element: Register (Basic Building Block) Write Enable Data In Data Out • Register • Similar to the D Flip Flop except • N-bit input and output • Write Enable input • Write Enable: • negated (0): Data Out will not change • asserted (1): Data Out will become Data In N N Clk

Storage Element: Register File RW RA RB 5 5 5 Write Enable busA busW 32 32 32-bit Registers • Register File consists of 32 registers: • Two 32-bit output busses: busA and busB • One 32-bit input bus: busW • Register is selected by: • RA (number) selects the register to put on busA (data) • RB (number) selects the register to put on busB (data) • RW (number) selects the register to be writtenvia busW (data) when Write Enable is 1 • Clock input (CLK) • The CLK input is a factor ONLY during write operation • During read operation, behaves as a combinational logic block: • RA or RB valid => busA or busB valid after “access time.” 32 busB Clk 32

Storage Element: Idealized Memory Write Enable Address Data In DataOut 32 32 • Memory (idealized) • One input bus: Data In • One output bus: Data Out • Memory word is selected by: • Address selects the word to put on Data Out • Write Enable = 1: address selects the memoryword to be written via the Data In bus • Clock input (CLK) • The CLK input is a factor ONLY during write operation • During read operation, behaves as a combinational logic block: • Address valid => Data Out valid after “access time.” Clk

. . . . . . . . . . . . Clocking Methodology Clk Setup Hold Setup Hold Don’t Care • All storage elements are clocked by the same clock edge • Cycle Time = CLK-to-Q + Longest Delay Path + Setup + Clock Skew • (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time

Step 3 • Register Transfer Requirements–> Datapath Assembly • Instruction Fetch • Read Operands and Execute Operation

PC Clk Next Address Logic Address Instruction Memory 3a: Overview of the Instruction Fetch Unit • The common RTL operations • Fetch the Instruction: mem[PC] • Update the program counter: • Sequential Code: PC <- PC + 4 • Branch and Jump: PC <- “something else” Instruction Word 32

31 26 21 16 11 6 0 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 3b: Add & Subtract • R[rd] <- R[rs] op R[rt] Example: addU rd, rs, rt • Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields • ALUctr and RegWr: control logic after decoding the instruction Rd Rs Rt ALUctr RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers Result 32 ALU 32 busB Clk 32

Register-Register Timing Clk Clk-to-Q Old Value New Value PC Instruction Memory Access Time Rs, Rt, Rd, Op, Func Old Value New Value Delay through Control Logic ALUctr Old Value New Value RegWr Old Value New Value Register File Access Time busA, B Old Value New Value ALU Delay busW Old Value New Value Rd Rs Rt ALUctr Register Write Occurs Here RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers Result 32 ALU 32 busB Clk 32

History

History

Presentation Transcript

History

History

History

Pre- History to History

History

History

History

History

History

History

Black History…. Our History

History

HISTORY

History

History

History

History

HISTORY