Instruction Set Architectures (ISAs)

Chapter 5: ISAs • In MARIE, we had simple instructions • 4 bit op code followed by either • 12 bit address for load, store, add, subt, jump • 2 bit condition code for skipcond • 12 0s for instructions that did not need a datum • However, most ISAs are much more complex so there are many more op codes and possibly more than 1 operand • How do we specify the operation? • Each operation will have a unique op code, although op codes might not be equal length (in MARIE, all were 4 bits, in some ISAs, op codes range from 8 bits to 16 or more) • How do we specify the number of operands? • This is usually determined by the op code, although it could also be specified in the instruction as an added piece of instruction information • How do we specify the location of each operand? • We need addressing information

Instruction Formats PDP-10 – fixed length instructions, 9 bit op code (512 operations) followed by 2 operands one operand in a register, the other in memory PDP-11 – variable length with 13 different formats Varies from 4 bit op code to 16 bit op code, 0, 1, 2 and 3 operands can be specified based on the format

2 More Formats The variable length Intel (Pentium) format is shown above, instructions can vary from 1 byte to 17 with op codes being 1 or 2 bytes long, and all instructions having up to 2 operands The fixed length PowerPC format is shown on the right, all instructions are 32 bits but there are five basic forms, and have up to 3 operands as long as the 3 operands are stored in registers

Instruction Format Decisions • Length decisions: • Fixed length • makes instruction fetching predictable (which helps out in pipelining) • Variable length • flexible instructions, can accommodate up to 3 operands including 3 memory references and length is determined by need, so does not waste memory space • Number of addressing modes • Fewer addressing modes makes things easier on the architect, but possibly harder on the programmer • Simple addressing modes makes pipelining easier • How many registers? • Generally, the more the better but with more registers, there is less space available for other circuits or cache (more registers = more expense)

Alignment • Another question is what alignment should be used? • Recall that most machines today have word sizes of 32 bits or 64 bits and the CPU fetches or stores 1 word at a time • Yet memory is organized in bytes • Should we allow the CPU to access something smaller than a word? • If so, we have to worry about alignment • Two methods used • Big Endian – bytes are placed in order in the word • Little Endian – bytes are placed in opposite order • See below where the word is 12345678 • Different architectures use different alignments between these two Intel uses little Endian, and bitmaps were developed this way, so a bitmap must be converted before it can be viewed on a big Endian machine!

Type of CPU Storage • Although all architectures today use register storage, other approaches have been tried: • Accumulator-based – a single data register, the accumulator (MARIE is like this) • This was common in early computers when register storage was very expensive • General-purpose registers – many data registers are available for the programmer’s use • Most RISC architectures are of this form • Special-purpose registers – many data registers, but each has its own implied use (e.g., a counter register for loops, an I/O register for I/O operations, a base register for arrays, etc) • Pentium is of this form • Stack-based – instead of general-purpose registers, storage is a stack, operations are rearranged to be performed in postfix order • An early alternative to accumulator-based architectures, obsolete now

Load-Store Architectures • When deciding on the number of registers to make available, architects also decide whether to support a load-store instruction set • In a load-store instruction set, the only operations that are allowed to reference memory are loads and stores • All other operations (ALU operations, branches) must reference only values in registers or immediate data (data in the instruction itself) • This makes programming more difficult because simple operations like inc X must now first cause X to be loaded to a register and stored back to X after the inc • But it is necessary to support a pipeline, which ultimately speeds up processing! • All RISC architectures are load-store instruction sets and require at least 16 registers (hopefully more!) • Many CISC architectures permit memory-memory and memory-register ALU operations so these machines can get by with fewer registers • Intel has 4 general purpose data registers

The number of operands that an instruction can specify has an impact on instruction sizes Consider the instruction Add R1, R2, R3 Add op code is 6 bits Assume 32 registers, each takes 5 bits This instruction is 21 bits long Consider Add X, Y, Z Assume 256 MBytes of memory Each memory reference is 28 bits This instruction is 90 bits long! However, we do not necessarily want to limit our instructions to having 1 or 2 operands, so we must either permit long instructions or find a compromise The load-store instruction set is a compromise – 3 operands can be referenced as long as they are all in registers, 1 operand can be reference in memory as long as it is in an instruction by itself (load or store use 1 memory reference only) Number of Operands

1, 2 and 3 Operand Examples Instruction Comment SUB Y, A, B Y  A – B MPY T, D, E T  D * E ADD T, T, C T  T + C DIV Y, Y, T Y  Y / T Using three addresses Instruction Comment LOAD D AC  D MPY E AC  D * E ADD C AC  AC + C STOR Y Y  AC LOAD A AC  A SUB B AC  AC – B DIV Y AC  AC / Y STOR Y Y  AC Using one address Instruction Comment MOVE Y, A Y  A SUB Y, B Y  Y – B MOVE T, D T  D MPY T, E T  T * E ADD T, C T  T + C DIV Y, T Y  Y / T Using two addresses Here we compare the length of code if we have one address instructions, two address instructions and three address instructions, each computes Y = (A – B) / (C + D * E) Notice: one and two address operand instructions write over a source operand, thus destroying data See pages 206-207 for another example

Addressing Modes • In our instruction, how do we specify the data? • We have different modes to specify how to find the data • Most modes generate memory addresses, some modes reference registers instead • Below are the most common formats (we have already used Direct and Indirect in our MARIE examples of the last chapter)

Computing These Modes In Register, the operand is stored in a register, and the register is specified in the Instruction Example: Add R1, R2, R3 In Immediate, the operand is in the instruction such as Add #5 This is used when the datum is known at compile time – this is the quickest form of addressing In Direct, the operand is in memory and the instruction contains a reference to the memory location – because there is a memory access, this method is slower than Register Examples: Add Y (in assembly) Add 110111000 (in machine) In Indirect, the memory reference is to a pointer, this requires two memory accesses and so is the slowest of all addressing modes

Continued Indexed or Based is like Direct except that the address referenced is computed as a combination of a base value stored in a register and an offset in the instruction Example: Add R3(300) This is also called Displacement or Base Displacement Register Indirect mode is like Indirect except that the instruction references a pointer in a register, not memory so that one memory access is saved Notice that Register and Register Indirect can permit shorter instructions because the register specification is shorter than a memory address specification In Stack, the operand is at the top of the stack where the stack is pointed to by a special register called the Stack Pointer – this is like Register Indirect in that it accesses a register followed by memory

Example Assume memory stores the values as shown To the left and register R1 stores 800 Assume our instruction is Load 800 The value loaded into the accumulator depends on the addressing mode used, see below: Data is 800 Data’s location is at 800 Data’s location is pointed to by value in 800 Data’s location is at R1 + 800 (1600)

Instruction Types • Now that we have explored some of the issues in designing an instruction set, let’s consider the types of instructions: • Data movement (load, store) + • I/O + • Arithmetic (+, -, *, /, %) * • Boolean (AND, OR, NOT, XOR, Compare) * • Bit manipulation (rotate, shift) * • Transfer of control (conditional branch, unconditional branch, branch and link, trap) * • Special purpose (halt, interrupt, others) • Those marked with * use the ALU • Note: branches add or subtract or change the PC, so these use the ALU • Those marked with + use memory or I/O

We have already covered the fetch-execute process It turns out that, if we are clever about designing our architecture, we can design the fetch-execute cycle so that each phase uses different hardware we can overlap instruction execution in a pipeline the CPU becomes like an assembly line, instructions are fetched from memory and sent down the pipeline one at a time the first instruction is at stage 2 when the second instruction is at stage 1 or, instruction j is at stage 1 when instruction j – 1 is at stage 2 and instruction j – 2 is at stage 3, etc… The length of the pipeline determines how many overlapping instructions we can have The longer the pipeline, the greater the overlap and so the greater the potential for speedup It turns out that long pipelines are difficult to keep running efficiently though, so smaller pipelines are often used Instruction-Level Pipelining

A 6-stage Pipeline • Stage 1: Fetch instruction • Stage 2: Decode op code • Stage 3: Calculate operand addresses • Stage 4: Fetch operands (from registers usually) • Stage 5: Execute instruction (includes computing new PC for branches or doing loads and stores) • Stage 6: Store result (in register if ALU operation or load) This is a pipeline timing diagram showing how instructions overlap

Pipeline Performance • Assume a machine has 6 steps in the fetch-execute cycle • A non-pipelined machine will take 6 * n clock cycles to execute a program of n instructions • A pipelined machine will take n + (6 – 1) clock cycles to execute the same program! • If n is 1000, the pipelined machine is 6000 / 1005 times faster, or almost a speed up of 6 times! • In general, a pipeline’s performance is computed as • Time = (k + n – 1) * tp where k = number of stages, n = number of instructions and tp is the time per stage (plus delays caused by moving instructions down the pipeline) • The non-pipeline machine is Time = k * n • So the speedup is k * n / (k + n – 1) * tp • However, there are problems that a pipeline faces because it overlaps the execution of instructions that cause it to slow down

Pipeline Problems • Pipelines are impacted by • Resource conflicts • If it takes more than 1 cycle to perform a stage, then the next instruction cannot move into that stage • For instance, floating point operations often take 2-10 cycles to execute rather than the single cycle of most integer operations • Data dependences • Consider: • Load R1, X • Add R3, R2, R1 • Since we want to add in the 5th stage, but the datum in R1 is not available until the previous instruction reaches the 6th stage, the Add must be postponed by at least 1 cycle • Branches • In a branch, the PC is changed, but in a pipeline, we may have already fetched one or more instructions before we reach the stage in the pipeline where the PC is changed!

Impact of Branches • In our 6-stage pipeline • we compute the new PC at the 5th stage, so we would have loaded 4 wrong instructions (these 4 instructions are the branch penalty) • thus, every branch slows the pipeline down by 4 cycles because 4 wrong instructions were already fetched • consider the four-stage pipeline below • S1 – fetch instruction • S2 – decode instruction, compute operand addresses • S3 – fetch operands • S4 – execute instruction, store result (this includes computing the PC value) • So, every branch instruction is followed by 3 incorrectly fetched instructions, or a branch penalty of 3

Other Ideas • In order to improve performance, architects have come up with all kinds of interesting ideas to maintain a pipeline’s performance • Superscalar – have multiple pipelines so that the CPU can fetch, decode, execute 2 or more instructions at a time • Use Branch Prediction – when it comes to branching, try to guess if the branch is taken and if so where in advance to lower or remove the branch penalty – if you guess wrong, start over from where you guessed incorrectly • Compiler Optimizations – let the compiler rearrange your assembly code so that data dependencies are broken up and branch penalties are removed by filling the slots after a branch with neutral instructions • SuperPipeline – divide pipeline stages into substages to obtain a greater overlap without necessarily changing the clock speed • We study these ideas in 462

Real ISAs • Intel – 2 operands, variable length, register-memory operations (but not memory-memory), pipeline superscalar with speculation – but at the microcode level • MIPS – fixed length, 3 operand instructions if operands are in registers, load-store otherwise, 8-stage superpipeline, very simple instruction set

Instruction Set Architectures (ISAs)