Hardware structures – Central Processing Unit (CPU).

Introduction to Computer Systems (3) Hardware structures – Central Processing Unit (CPU). PiotrMielecki Ph. D. http://www.wssk.wroc.pl/~mielecki Piotr.Mielecki@pwr.wroc.pl pmielecki@gmail.com mielecki@wssk.wroc.pl

1. Basic definitions and terms. DEFINITION: Central Processing Unit (CPU), or sometimes simply processor, is the component in a digital (and sequential) computer capable of executing a program. (Knott, 1974). • It interprets a program instructions and processes data. • A CPU that is manufactured as a single integrated circuit (IC) is usually known as a microprocessor. Beginning in the mid-1970-ties, microprocessors of still increasing complexity and efficiency gradually replaced other designs, and today the term ”CPU” is usually applied to some type of microprocessor. Modern microprocessors appear in everything, from automobiles to cellular phones or children's toys.

The phrase ”central processing unit” is a description of a certain class of sequential automats that can execute different programs. • This definition can easily be applied to many early computers that existed long before the term ”CPU” ever came into usage (like ”generation 0” computers). Early CPUs were custom-designed as a part of a larger, usually the only one of this kind, computers. Today’s processors are suited for many purposes and mass-produced so relatively cheap. • The form, design and implementation of CPU have changed dramatically since the earliest examples, but their fundamental operation has remained much the same – they work more or less according to von Neumann’s concept with some modifications increasing their efficiency.

2. CPU internal structure and functions. MOST IMPORTANT FUNCTIONAL BLOCS: • Arithmetic & Logic Unit (ALU), which is responsible for executing of particular operations (arithmetical and logical calculations first of all). • Control Unit, which supports CPU’s basic machine cycle: • reads instructions from the memory, • decodes these instructions, • reads additional data (arguments) if the operation needs them, • drives the ALU to execute decoded operation, • writes back the result if it is addressed to the memory (instructions can address their arguments and results in different ways – addressing modes).

Memory CPU Control Unit Instruction Fetcher Memory Interface Unit Instruction code Program Counter (PC) Address Buffer Instruction register Read/Write Buffer Instruction decoder Data Execution Unit Arithmetic And Logic Unit (ALU) Status register Internal register #1 … Internal register #n

CONTROL UNIT: • Basic function of the Control Unit is to decode the binary encoded instruction, first read (fetched) from memory from the address pointed by special-purpose register, usually called Program Counter (PC) or Instruction Pointer (IP). • Communication between Control Unit and memory is possible with Memory Interface Unit, which supports addressing and reading/writing data from/to memory. • The result of decoding may be affected by results of instructions executed before – some conditions (like carry, overflow etc.) are marked by setting appropriate bits (flags) in Status Register. Instruction Decoder takes them into consideration decoding instructions of some classes (conditional branches first of all). • Reading arguments from memory for particular operation (addition, for example) may be needed before the execution. Instruction Decoder forces then additional memory access cycle(s), issuing appropriate address(es) with Memory Interface Unit.

CONTROL UNIT: • In the simple or older processors the decoding of the instructions defined in the CPU’s Instruction Set Architecture (ISA) is/was implemented in the hardware structureof the CPU (hardwired control). • In mid-1960-ties IBM utilized first time (in series 360 machines) the alternative solution – a microprogram (or microcode) used to assist in translating instruction codes into various control signals driving the CPU. This kind of CPU is called microprogrammed control. • In modern processors (like Intel starting from i486, Pentium and newer, for example) the decoding process starts a microcode appropriate for each instruction defined. This microcode is built from very simple instructions, much simpler than defined in the assembly language of the processor. So we can say that a microprogram implements a CPU instruction set. Just as a single high level language statement is compiled to a series of machine instructions (load, store, shift, etc.), in a CPU using microcode, each machine instruction is implemented by a series of microinstructions. Microcode in modern processors is sometimes rewritable so that it can be modified to change the way the CPU decodes instructions even after it has been manufactured.

EXECUTION UNIT: • In most of processors the ALU is supported by additional elements, according to particular design. All these elements, together with ALU itself, are often described as Execution Unit. One of most important devices co-operating with ALU is Status Register, which keeps bits / markers (flags) set by ALU in some conditions (if the result of arithmetic operation is equal to zero, for example). The conditional instructions (branches) can check these flags to force jump in the program. • The internal registers were introduced to most of CPU designs to increase the speed of execution – they can exchange data between themselves and supply ALU with arguments much faster than external memory (see ”von Neumann bottleneck” problem mentioned in Lecture 2). Assembly language programmers often use the internal registers as temporary variables when implementing more complex data-manipulations, composed of series of ALU operations (adding of 32-bit arguments with 8 or 16-bit processor for example).

3. CPU operation. INSTRUCTION CYCLE (1): The program stored in computer’s memory is represented by a series of binary codes. There are four basic steps that nearly all von Neumann CPUs use in their operation: fetch, decode, execute, and write-back. These steps are repeated in the endless cycle (processor’s instruction cycle), which is the basic algorithm implemented in the design of the CPU circuit: • Fetch – the first step involves reading an instruction (which is represented by a binary code or sequence of codes – machine words) from memory. The location in memory is determined by a Program Counter (or Instruction Pointer), which stores address that identifies the current position in the program. In other words, the PC register keeps track of the CPU’s place in the current program. After an instruction is fetched, the PC is incremented by the length of the instruction word in terms of memory units (words stored in memory can have different length than CPU’s internal registers, the instruction can be composed of more than one memory word etc.).

INSTRUCTION CYCLE (2): • Decode – the instruction is broken up into parts that have significance to other portions of the CPU. The way in which the binary instruction value is interpreted is defined by the CPU's Instruction Set Architecture. Usually one group of bits in the instruction, called the Operation Code (opcode), indicates which operation to perform. The remaining parts of the word provide information required for that instruction, such as arguments (operands) for an arithmetic or logic operation. Any operand may be given as a constant value (called an immediate value), or as a place to locate a value: internal register or a memory address, as determined by some addressing mode. • Operand fetch (optional) – if the instruction needs to read the additional data from the memory from the address included in the instruction word or pointed indirectly, the CPU needs to execute additional access cycle to memory – that means additional memory read cycle should optionally be performed before execution.

INSTRUCTION CYCLE (3): • Execution – various elements of the CPU are dynamically connected so they can perform the desired operation, or the microcode performs a sequence of microoperations to complete this operation. If, for instance, an arithmetic addition operation is requested, ALU will be connected to a set of inputs or the values of inputs will be send to inputs of ALU. The ALU’s inputs provide the numbers to be added, and the output will contain the final sum. The ALU contains the circuitry to perform simple arithmetical and logical operations (like addition, bitwise operations etc.). If, for instance, the addition produces result too large for the CPU to handle, an arithmetic overflow flag in a Status Register may also be set. • Write-back – CPU simply ”writes back” the results of the Execute step to some form of memory. Very often the results are written to one of the internal CPU registers for quick access by subsequent instructions. In other cases results may be written to slower, but cheaper and larger, main memory (RAM).

INSTRUCTION CYCLE (4): • Some types of instructions can change the value in the Program Counter rather than directly produce result data (it’s also a kind of write-back). These are generally called jumpsor branchesand make possible behavior like loops, conditional program execution (through the use of a conditional jump), and calling subroutines in programs. • Many instructions will also (or only) change the state of bits (flags) in a Status Register. These flags can be used to decide how a program behaves, since they often indicate the outcome of various operations. For example, one type of ”compare” instruction considers two values of arguments and only sets a flag in the Status Register according to which one is greater (without changing the values of arguments). This flag could then be used by a later jump instruction to determine program flow.

MEMORY Address (16-bit) 00000001 00000000: 0 0 1 1 1 0 1 0 Operation code: 3Ah – LDA 00000001 00000001: 0 0 0 0 0 0 0 0 Low-order address: 00h 00000001 00000010: 1 1 1 1 0 0 0 0 High-order address: F0h 00000001 00000011: 1-st byte of next instruction … … LDA (F000h) 11110000 00000000: 1 0 0 0 1 0 0 1 Data in memory: 89h Result : Internal register A = 89h Diagram showing how the instruction “Load Accumulator Direct” – LDA of a simple 8-bit processor Intel 8080 is decoded and how it works – loads the internal register A (also called Accumulator) with the value read from memory, pointed by 16-bit address written next to operation code.

INSTRUCTION CYCLE (5): • After the execution of the instruction and write-back of the resulting data, the entire process repeats, with the next instruction cycle normally fetching the next-in-sequence instruction because of the incremented value in the Program Counter. If the completed instruction was a jump, the PC will be modified to contain the address of the instruction that was jumped to, and program execution continues normally. • To complete the instruction cycle CPU (as a synchronized, sequential automat) needs to pass trough a sequence of discrete states, synchronized by a special signal. This electric signal, known as a ”clock”, usually takes the form of a periodic square wave. Each pulse of this wave causes the sequential circuit to pass from current to the next state. Processors with microprogrammed control have usually internal frequency multiplier, which provides for internal CPU’s circuits clock signal much faster (some GHz, for example) than external, system clock (hundreds of MHz on the mainboard). • Simple processors can execute only one instruction with one or two pieces of data at a time – this architecture is called usually the ”classic RISC pipeline” or ”subscalarCPU” .

4. Parallel processor architectures – basic concepts. To read one word from memory, for example, processor needs usually more than one system clock pulses, so the entire instruction cycle takes normally from a few to over a dozen (12) clock cycles. In more advanced CPU designs multiple instructions can be fetched at the same moment, then decoded and executed simultaneously, so each pulse of the synchronization clock can ”release” one or even more than one completed instructions (scalar and superscalar processors). • The main disadvantage of the subscalar CPU comes from the fact that only one instruction is executed at a time. The entire CPU must wait for that instruction to complete before proceeding to the next instruction. Even adding a second Execution Unit does not improve performance much; rather than one pathway waiting, now two pathways are waiting. This design, wherein the CPU’s execution resources can operate on only one instruction at a time, can only possibly reach scalar performance (one instruction on CPU’s ”output” per each clock cycle). However, the performance is nearly always subscalar (less than one instruction per cycle).

F – Fetch D – Decode MR – Memory read EX – Execute WB – Write-back F D MR EX WB F D MR EX WB F D MR EX WB CLK t Example of subscalar CPU – it takes 15 clock cycles to complete 3 instructions (we’ve assumed that each basic cycle, like fetch, decode etc. takes exactly 1 clock cycle). • Attempts to achieve scalar and better performance have resulted in a variety of design methods that cause the CPU to work less sequentially and more in parallel. When referring to parallelism in CPUs, two terms are generally used to classify these design techniques: • Instruction level parallelism (ILP) seeks to increase the rate at which instructions are executed within a CPU (that is, to increase the utilization of on-die execution resources). • Thread level parallelism (TLP) purposes to increase the number of threads (effectively individual programs) that one or set of CPUs can execute simultaneously (multi CPU computers or multi-core processors for example).

INSTRUCTIONS 1 F D MR EX WB 9 CLK cycles – 5 instructions (still subscalar performance) 2 F D MR EX WB 3 F D MR EX WB 4 F D MR EX WB 5 F D MR EX WB CLK t One of the simplest methods used to accomplish increased parallelism on instruction level (ILP) is to begin the first steps of instruction before the prior instruction finishes executing.This is the simplest form of a technique known as instruction pipelining. Pipelining allows one or more instruction to be completed at any given time by breaking down the execution pathway into discrete stages. Each stage of the instruction is completed by separate unit of the CPU, so when decoding first instruction the processor can fetch next one. INPUT F D MR EX WB OUTPUT

In the last example 9 cycles were used to complete 5 instructions, which was still under the scalar level of efficiency. But at the 5-th cycle we have succeeded with utilizing all modules of CPU (all 5 elements of the pipeline are busy, none waits). At this moment of time we can say that CPU has efficiency of 1 instruction for 1 cycle. If next instructions will be still arriving to the input of the pipeline (the unit which performs fetch operations) this efficiency can be preserved for some next cycles: 1 CLK cycle – 1 instruction finished (scalar efficiency) INSTRUCTIONS 1 F D MR EX WB 2 F D MR EX WB 3 F D MR EX WB 4 F D MR EX WB 5 F D MR EX WB 6 F D MR EX WB 7 F D MR EX WB CLK t

The pipeline can be ”broken” when the execution of particular instruction depends on results of instructions not finished yet (data dependency conflict). This is possible when: • the instruction has to wait for a result of calculations or other processing on variable(s) to make further processing, • the instruction is of conditional branch type and has to wait for result of some operation to make decision (to jump or not to jump?). • The second condition is easier to detect just checking the operation codes of instructions. To predict the first situation mentioned above is much harder.

SUPERSCALAR PROCESSORS (1): Processors that are said to be superscalar include a long instruction pipeline and multiple identical execution units. In a superscalar pipeline, multiple instructions are read and passed to a dispatcher, which decides whether or not the instructions can be executed in parallel (simultaneously). If so they are dispatched to available execution units, resulting in the ability for several instructions to be executed simultaneously. In general, the more instructions a superscalar CPU is able to dispatch simultaneously to waiting execution units, the more instructions will be completed in a given cycle. Most of the difficulty in the design of a superscalar CPU architecture lies in creating an effective dispatcher. The dispatcher needs to be able to quickly and correctly determine whether instructions can be executed in parallel, as well as dispatch them in such a way as to keep as many execution units busy as it’s possible. This requires that the instruction pipeline is filled as often as possible so superscalar processors use large amounts of instruction cache memory.

SUPERSCALAR PROCESSORS (2): Dispatcher also uses hazard-avoiding techniques like branch prediction, speculative execution, and out-of-order execution to reach high level of performance. By attempting to predict which branch (or path) a conditional instruction will take, the CPU can minimize the number of times that the entire pipeline must wait until a conditional instruction is completed. Speculative execution often provides modest performance increases by executing portions of code that may or may not be needed after a conditional operation completes. Out-of-order execution somewhat rearranges the order in which instructions are executed to reduce delays due to data dependencies. The simplest example of processor which can reach superscalar efficiency is CPU with two 5-stage pipelines: INPUT F D MR EX WB OUTPUT INPUT F D MR EX WB OUTPUT

SUPERSCALAR PROCESSORS (3): 1 CLK cycle – more than 1 instruction finished (superscalar efficiency) INSTRUCTIONS 1 F D MR EX WB 2 F D MR EX WB 3 F D MR EX WB 4 F D MR EX WB 5 F D MR EX WB 6 F D MR EX WB 7 F D MR EX WB 8 F D MR EX WB 9 F D MR EX WB 10 F D MR EX WB CLK t

5. RISC vs. CISC processors. A complex instruction set computer (CISC) has a processor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, a memory store at the end, all in a single instruction. This term was introduced after the concept of reduced instruction set computer (RISC) was defined, in contrast to RISC. Before the first RISC processors were designed, many computer architects tried to design instruction sets to support high-level programming languages by providing ”high-level” instructions such as procedure call and return (CALL, RET), loop instructions such as ”decrement counter and jump if non-zero” (DJNZ) and sophisticated addressing modes to allow data structure and array accesses to be combined into single instructions. The compact nature of such a CISC ISA results in smaller program sizes and fewer calls to main memory, which at the time (the 1960s) resulted in a tremendous savings on the cost of a computer.

DISADVANTAGES OF CISC PROCESSORS: It was observed that not always was possible to reach high performance implementing more and more complex instructions. For instance, badly designed, or low-end versions of complex architectures (which used microcode to implement many hardware functions) could lead to situations where it was possible to improve performance by not using a complex instruction, but instead using a sequence of simpler instructions. One reason for this was that such ”high level” instruction sets, often also highly encoded on microcode level (for a compact executable code), may be very complicated to decode and execute efficiently within a limited number of transistors inside the CPU hardware. These architectures therefore require a great deal of work on the part of the processor’s hardware designer (or a slower microcode solution). At the time where transistors were a limited resource, this also left less room on the processor to optimize performance in other ways, which gave room for the ideas that led to the original RISC designs in the mid 1970s (IBM 801 – IBM’s Watson Research Center).

The terms RISC and CISC have become less meaningful with the evolution of both CISC and RISC designs and implementations. The first highly pipelined, popular ”CISC” implementations, such as x486 family from Intel, AMD, Cyrix, and IBM, supported every instructions that older Intel processors did, but achieved high efficiency only on a fairly simple x86 subset (resembling a RISC instruction set, but without the load-store limitations of RISC). Today’s Intel and AMD processors also decode and split more complex instructions into a series of smaller internal ”micro-operations” which can thereby be executed in a pipelined (parallel) fashion, thus achieving high performance on a much larger subset of instructions.

Hardware structures – Central Processing Unit (CPU).