Computer Architecture

Computer Architecture We will use a quantitative approach to analyze architectures and potential improvements and see how well they work (if at all) We study RISC instruction sets to promote instruction-level, block-level and thread-level parallelism Pipelining, superscalar, branch speculation, vector processing, multi-core & parallel processing Out of order completion architectures Compiler optimizations Improving cache performance (and virtual memory performance if time permits) Early on, we concentrate on the MIPS 5-stage pipeline, later we will also look at other approaches including the Pentium processing

Performance Measures Many different values can be used MIPS, MegaFLOPS – misleading values Clock Speed – does not take into account parallelism/pipeline, stalls, cache, etc Execution time – we use this to compare benchmarks but we have to make sure the benchmarks were run equally (loaded system, unloaded system, etc) Throughput – number of programs per unit of time, possibly useful for servers CPU time, user CPU time, system CPU time CPU performance = 1 / execution time What does it really mean for 1 computer to be faster than another? If we use a benchmark suite and 1 computer consistently outperforms the other, this is useful, otherwise we have to take into account the types of programs where one computer was better than the other

SPEC2006 Bench-marks

Design Concepts Take advantage of parallelism There are many opportunities for parallelism in code Use multiple hardware components (ALU functional units, register ports, caches/memory modules, disk drive access, etc) Distribute instructions to hardware components in an overlapped (pipelined) or distributed (parallel) fashion Use hardware and software approaches Principle of locality of reference Design memory systems to support this aspect of program and data access Focus on the common case Amdahl’s Law (next slide) demonstrates that minor improvements to the common case is usually more useful than large improvements to rare cases Find ways to enhance the hardware for common cases over rare cases

Amdahl’s Law When comparing two systems, we view the speedup as CPU time of old system / CPU time of new system E.g., old system takes 10.5, new system takes 5.25, speedup = 10.5 / 5.25 = 2 Speedup of one enhancement 1 / (1 – F + F / S) F = fraction of the time the enhancement can be used S = the speedup of the enhancement itself (that is, how much faster the computer runs when the enhancement is in use) Example: an integer processor performs FP operations in software routines, a benchmark consists of 14% FP operations, a co-processor could perform all FP operations 4 times faster, if we add the co-processor, our speedup is 1 / (1 - .14 + .14 / 4) = 1.12, or a 12% speedup Why does Amdahl’s Law promote the “common case”? Since we have a reciprocal, the smaller the value, the greater the speedup The denominator subtracts F from 1 and adds F / S, so –F will have a larger impact than F / S

Examples Web server enhancements: Faster CPU (10 times faster on computation, 30% CPU operations, 70% I/O) speedup = 1 / (1 - .3 + .3 / 10) = 1.37 (37% speedup) More hard disk space that improves I/O performance by 2.5 speedup = 1 / (1 - .7 + .7 / 2.5) = 1.72 (72% speedup) Select the common case A benchmark has 20% FP square root operations, 50% total FP operations, 50% other Add an FP sqrt unit with a speedup of 10 speedup = 1 / (1 - .2 + .2 / 10) = 1.22 Add a new FP ALU with a speedup of 1.6 for all FP ops speedup = 1 / (1 - .5 + .5 / 1.6) = 1.23 Again, the common case is the better way to go (slightly)

Another Example Architects have suggested a new feature that can be used 20% of the time and offers a speedup of 3 One architect though feels that he can provide a better enhancement that will offer a 7 time speedup for that particular feature What percentage of the time would the second feature have to be used to match the first enhancement? Speedup from feature 1 = 1 / (1 - .2 + .2 / 3) = 1.154 For speedup from feature 2 = 1.154, we need to solve for x where 1.154 = 1 / (1 – x + x / 7) Algebra gives us the following 1 – x + x / 7 = 1 / 1.154 = .867 1 - .867 = x – x / 7  .133 = (7x – x) / 7 = 6x / 7 7 * .133 / 6 = x = .156, so the second feature would have to be used 15.6% of the time to offer the same speedup

CPU Performance Formulae Another way to compute speedup is to compute the CPU’s performance before and after some enhancement(s) We will use the following formulae CPU time = CPU clock cycles * clock cycle time CPU clock cycles = number of elapsed clock cycles CPU clock cycles = instruction count (IC) * clock cycles per instruction (CPI) = IC * CPI not all instructions will have the same CPI, so we might have to compute this as (SCPIi * ICi) for all classes of instructions i For instance, we might have loads, stores, branches, ALU (integer) operations, FP operations with CPIs of 5, 4, 3, 2 and 10 respectively Clock cycle time = 1 / clock rate (we will abbreviate clock cycle time as CCT going forward) Given two enhancements, compute their CPU exeuction time, speedup of machine 2 over machine 1= CPU time machine1 / CPU time machine2

Example Consider that we can either enhance the FP sqrt unit or enhance all FP units IC breakdown: 25% FP operations, 2% of which are FP square root operations, 75% all other instructions CPI: 4.0 for FP operations (on average across all FP operations), 20 for FP sqrt, 1.33 for all other instructions CPI original machine = 25% * 4.0 + 75% * 1.33 = 2.00 If we enhance all FP units, the overall CPI for FP operations becomes 2.5, if we enhance just the FP sqrt, it reduces to 2.0 Compute the CPU time of each (note that IC and clock rate (CCT) remain the same) CPI all FP = 75% * 1.33 + 25% * 2.5 = 1.625 Speedup enhancing all FP = (IC * 2.00 * CCT) / (IC * 1.625 * CCT) = 1.23 CPI FP sqrt= CPI original – 2% * (20 – 2) = 1.64 Speedup enhancing FP sqrt = (IC * 2.00 * CCT) / (IC * 1.64 * CCT) = 1.22 Enhancing all FP is better by 1.64 / 1.625 = 1.01, or about 1%

Another Example Our current machine has a load-store architecture and we want to know whether we should introduce a register-memory mode for ALU operations Assume a benchmark of 21% loads, 12% stores, 43% ALU operations and 24% branches CPI is 2 for all instructions except ALU which is 1 The new mode will lengthen the ALU CPI of 2, and it also, as a side effect, lengthens Branch CPI to 3 The IC will be reduced because we need fewer loads, let’s assume this new mode will be used in 25% of all ALU operations Use the CPU execution time formula to determine the speedup of the new addressing mode

Solution The number of ALU operations that will use this new mode is 25%, or 43% * 25% = 11% This means that we will have 11% fewer instructions so ICnew = 89% * ICold Those dropped instructions will all be loads, so we will have a different breakdown of instruction mix Loads = (21% - 11%) / 89% = 11% Stores = 12% / 89% = 13% ALU = 43% / 89% = 48% Branches = 24% / 89% = 27% CPIold= 43% * 1 + 57% * 2 = 1.57 CPInew = (48% + 11% + 13%) * 2 + 27% * 3 = 1.89 CPU execution time old = IC * 1.57 * CCT CPU execution time new = .89 * IC * 1.89 * CCT Speedup = 1.57 / (.89 * 1.89) = .933, which is actually a slowdown! We would not want to use this enhancement

Which Formula? In a previous example, we solved the problem of FP sqrt or all FP units which we had solved earlier (slide 6) Which approach should we use? Depends on what information we are given, notice in using Amdahl’s law, we know the fraction of time an enhancement can be used and how much speedup that enhancement gives us We can compute the same thing in the CPU time formula Let’s try another example to prove it Benchmark consists of 35% loads, 15% stores, 40% ALU and 10% branch operations with a CPI breakdown of 5 (loads/stores), 4 (ALU branches) Enhancement: since we have separate INT and FP registers and this benchmark does not use the FP registers, can we use a compiler to move values from INT to FP registers and back rather than using the slower loads & stores? Yes. How much speedup will this give us? Assume the compiler can reduce the loads/stores by 20% because of this enhancement

Solution CPI goes down, IC remains the same, CPU clock time is unchanged Solution using CPU Time formula CPIold= 50% * 5 + 50% * 4 = 4.5 20% of the loads/stores now become register moves, so our new breakdown of instructions is 40% load/store and 60% ALU CPInew= 40% * 5 + 60% * 4 = 4.4 Speedup = (4.5 * IC * CPU clock time) / (4.4 * IC * CPU clock time) = 4.5 / 4.4 = 1.023 or 2.3% speedup Amdahl’s Law Speedup of enhancement is 5/4 (5 cycles down to 4) = 1.25 Fraction the enhancement can be used the enhancement is used in 20% of the loads/stores which were 50% of the total instruction mix and these instructions took up 5 cycles of time each, so it is used .20 * .50 * 5 / 4.5 (the original CPI) = .111 Speedup = 1 / (1 - .111 + .111 / 1.25) = 1.023

Instruction Set Principles We studied instruction set design issues in 362 Here, we develop a RISC instruction set to be used throughout the course, called MIPS We want a fixed-length instruction format and a load-store instruction set both of which will help support a pipeline What other issues should we consider? Number of operands (2-operand or 3-operand)? Number of registers and what type (should we differentiate between data and address registers?) Memory issues What addressing modes should we allow? How many bits should we allow for address displacements, for immediate data?

Comparisons

Addressing Modes Data will either be Constants (immediate data) Stored in registers Stored in memory For data stored in memory, there are numerous ways to specify the address Direct, indirect (pointers), register indirect (pointers in registers), base displacement (sum of displacement and value in register) indexed (sum of values in two registers), etc – see the next slide Complex modes can impact CPI because of the time it takes to obtain or compute the address Design issues How many bits should be allowed for an immediate datum or a displacement? Analysis of SPEC benchmarks indicate no more than 15 bits are needed for a displacement (displacements are < 32K) and 8 bits for most immediate data Which modes? Again, an analysis of SPEC benchmarks indicate that immediate and displacement modes are most common (see figure A.7)

Branch Issues Branches typically use PC-relative branching The branch target location is computed as PC  PC + offset rather than PC  offset, this keeps the offset to fewer bits in the instruction also, with PC + offset, we do not need to know absolute memory locations at compile time allowing code to be moved in memory during execution Branches break down into Conditional branches (test a condition first) Unconditional branches Procedure calls and returns (require storing a return address, probably parameter passing as well) register windows are used in some high performance computers for parameter passing (this is explained in the out-of-class notes) Conditional branches make up 75-82% of all branches Distance (offsets) for most branches can be limited to about 8 bits (255 locations) – see figure A.15 on page A-18

Continued For procedure calls/returns, how is the state saved/restored Through memory or register windows What is the form of condition for conditional branches? Complex conditions can be time consuming Using condition codes is problematic with pipelines A simple zero test (value == 0 or value != 0) is the simplest and fastest approach but requires additional instructions e.g., to compare x == y + 1, do x – y + 1 first, then compare the result to 0 When is the comparison performed? With the branch instruction or prior to the branch?

Types of Instructions Arithmetic (integer) and logic operations FP operations (+, -, *, /) and conversion between int and FP We separate FP and integer operations for several reasons they have different CPIs we will use different register files we will use different execution units in the pipeline Data transfer (loads, stores) Control (conditional, unconditional branches, procedure calls, returns, traps) I/O Strings (move, compare, search) OS operations (OS system calls, virtual memory, other) Graphics (pixel operations, compression/decompression, others) In this course we will only concentrate on the first 4 classes although we will briefly consider vector operations as well, which are often used to support graphics See figure A.13 on page A-16 for a breakdown of the SPECInt92 benchmark programs as executed on the Intel 80x86 architecture

Embedded Application Instruction Sets With RISC, instruction sets were being restricted Fewer instructions in the instruction set Fixed length instructions Fewer and simpler addressing modes Load-store instruction sets With the popularity of embedded applications due to handheld devices, new restrictions are being introduced 16-bit and 32-bit instruction sizes to accommodate narrower buses Requires smaller memory addresses, smaller immediate data, fewer registers This also improves cache performance because we can fit more in the caches An alternative is to use compression on instructions, compress an instruction, fetch it, uncompress it in the CPU – IBM follows this

Compiler Optimizations In order to support the increasingly complex hardware, we need compiler support in the form of machine code optimizations, here are some examples: High-level optimizations on source code example: procedure in-lining, loop transformation Local optimizations on single-lines of code example: change the order of references in a block or expression Global optimizations extend local across branches example: loop unrolling Register allocation to optimize the storage of variables in registers and minimize memory fetches Machine-dependent optimizations take advantage of the specific architecture see Figure A.19, page A-25 and A.20, page A-28

Continued Two examples Sub-expression elimination – assume that a particular expression is used in several expressions, the value can be computed one time and stored in a register, later uses can reference the register and not have to re-compute the same expression Graph coloring – an algorithm used to determine the best (or a good) allocation of local variables to registers this is an NP complete problem, so compilers use a graph coloring approximation or heuristic algorithm instead There is a problem with compiler optimizations: phase-ordering Since compiler optimizations are made in a particular order, one optimization might impact and wipe out the gain by an earlier optimization consider for performing register allocation is performed near the end of optimization but sub-expression elimination, performed earlier in the process, needs some registers, so the earlier optimization relies on having access to registers which may be re-allocated later!

Introduction to MIPS Developed in 1985, since then, there have been many versions, here, we examine a subset called MIPS64 RISC architecture designed for pipeline efficiency optimizing compiler essential to improve efficiency General-purpose register set and load-store architecture 32 64-bit general purpose (integer) registers labeled R0, …, R31 R0 is always 0 8-, 16-, 32-bit values are sign extended to become 64 bit values 32 64-bit floating point registers labeled F0, …, F31 where only half the register is used for floats No explicit character or string types characters treated as ints, ala C strings as arrays of ints Arrays are available, using base-displacement addressing

Continued Two addressing modes used: displacement, immediate direct addressing can be accomplished by using R0 as the displacement register register indirect can be accomplished by using a base of 0 displacements of 12-16 bits and immediate data of 8-16 bits memory is byte addressable and 64-bit addresses are used Approximately 100 operations op code requires 7 bits, however we will reduce this to 6 bits by using one op code for all integer ALU operations Fixed length 32-bit instructions 3 instruction formats used (shown on the next slide) I-type for immediate data, used for loads, stores, conditional branches and ALU operations that have an immediate datum as an operand R-type for register type, used for all other ALU operations and FP operations J-type for jump type, used for jump, jump and link (procedure call), trap, return Immediate data and displacements are limited to 16 bits except for Jump instructions in which case displacements are limited to 26 bits

Continued 3 operand instructions are available as long as all operands are in registers (R-type) or 2 registers and immediate datum (I-type) immediate datum (which is also used for displacement offsets) is limited to 16 bits (2’s complement) but extended to 32 bits funct is the specific type of ALU or FP function

MIPS Instructions See Figure A.26, page A-40 for full list, here we look at the instructions we will be using Loads/Stores LD, SD – load/store double word (we could also use LW, SW for word sized data movements) LD R2, 204(R3) – load item from M[204 + R3] in R2 L.S, L.D, S.S, S.D – load and store single and double precision floats (S = single, D = double) L.S F3, 0(R5) – note the use of integer register for the base ALU operations (integer only) DADD/DSUB, DADDI/DSUBI, DMUL, DDIV – add/subtract/multiply/divide with 3 registers (or 2 registers and an immediate datum for add/subtract) DADD R1, R2, R3 DADDI R1, R2, #1 also similar operations for AND, OR, shift, rotate SLT, SLTI – set less than – used for comparison SLT R1, R2, R3 – if R2 < R3, set R1 to 1, else R1 = 0

Continued Branch operations BEQZ, BNEZ – branch if register tested is 0/not zero BEQZ R1, foo – PC = PC + foo if R1 = 0 J – unconditional jump to given location J foo – sets PC = PC + foo JR – unconditional jump where offset is in given register JR R3 – sets PC = PC + R3 Floating Point operations ADD.D, ADD.S, SUB.D, SUB.S, MUL.D, MUL.S, DIV.D, DIV.S floating point operations, 3 FP registers ADD.S F1, F2, F3 C.__.D, C.__.S – FP comparisons, __ is LT, GT, LE, GE, EQ, NE CVT.__.__ - converts from one type to another using two registers CVT.D.L F2, R4 – convert double in F2 to long in R4

MIPS 5-Stage Architecture See section C.3 pages C-31-C-34

IF & ID Stages IF: PC sent to instruction cache PC incremented by 4, stored in NPC temporarily a MUX in the MEM stage determines if the PC should get the value in NPC or the value computed in EX Instruction stored in IR ID: Bits 6..10 denote one source register (I-type and R-type instructions) Bits 11..15 denote one source register (R-type) Bits 16..32 store immediate datum or displacement, sign extended to 32 bits NPC, A, B and IMM are temporary registers used in later stages

EX Stage This stage computes ALU operations using register A & B or A& IMM, result from ALU placed in ALU output register and passed on to next stage computes effective addresses for loads and stores A + IMM, stored in ALU output and passed onto next stage computes branch target locations and performs the zero test to determine if a branch is taken or not A zero tested ALU adds PC + IMM, value sent to ALU output and passed to next stage

MEM and WB Stages MEM: If load, ALU output stores address, sent to data cache, resulting datum stored in LMD If store, ALU output stores address and B register stores datum, both are sent to data cache If branch, based on condition, MUX either selects NPC or branch target location (as computed in the ALU EX) to send back to PC If ALU, forward result from ALU output directly to LMD WB: If a datum in LMD (load or ALU), store in the register file

Comments on the MIPS Architecture The simplified nature of MIPS means that many tasks will require more than a single assembly/machine operation to complete in CISC instruction sets, some operations can be done in 1 instruction, such as indirect addressing and compare-and-branch operations registers must be pre-loaded with the data before performing an ALU operation two or more instructions to perform scaled or indexed modes The CPI of MIPS operations is less than those in other instruction sets making up for this all operations have a CPI of 4 except Loads and ALU operations which have a CPI of 5 (because they must write their results to registers in the WB stage) The static size of all MIPS operations makes it easier to deal with pre-fetching and pipelining

Continued The architecture requires the following hardware elements to implement: the ALU should have all integer operations (arithmetic, logic) we address floating point operations later in the semester an additional adder is needed for the IF stage (PC increment) several temporary registers are needed IR, A, B, Imm, NPC, ALUOutput, LMD multiplexors to select the following what to do after a condition is evaluated whether a computed value is to be used later in temporary registers A or B whether to use a register value or the immediate datum multiplexors in the ALU to select the output based on the specific ALU operation (not shown in the figure) multiplexors in the register file to select which register to send on to the A or B temporary registers, and a demultiplexor to pass along the LMD value into one of the registers (not shown in the figure)

MIPS Code Example Write a set of code to compute the average of the elements in an int array, assuming the array starts at memory location 50000 and that the number of elements in the array is stored at location 10000 Store the resulting float value at location 10004

Another Example Write a set of code that will find the largest and smallest items in an array, the array’s starting location is stored in R5 and the array contains 500 elements, store the min in R1, the max in R2

Computer Architecture