1 / 84

Computer Hardware and System Software Concepts

Computer Hardware and System Software Concepts. Processor Structure Von Neumann Machines Pipelined Clocked logic systems. Von Neumann Machine. John von Neumann proposed the concept of a stored program computer early in 1950.

Download Presentation

Computer Hardware and System Software Concepts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Hardware and System Software Concepts • Processor Structure • Von Neumann Machines • Pipelined • Clocked logic systems

  2. Von Neumann Machine • John von Neumann proposed the concept of a stored program computer early in 1950. • In a von Neumann machine, the program and the data occupy the same memory. • The machine has a program counter (PC) which points to the current instruction in memory. • The PC is updated on every instruction. • When there are no branches, program instructions are fetched from sequential memory locations. • A branch simply updates the PC to some other location in the program memory

  3. Synchronous Machines • Most machines nowadays are synchronous, that is they are controlled by a clock. • Datapaths

  4. Synchronous Machines • Registers and combinatorial logic blocks alternate along the data-paths through the machine. • Data advances from one register to the next on each cycle of the global clock: as the clock edge clocks new data into a register, its current output (processed by passing through the combinatorial block) is latched into the next register in the pipeline. • The registers are master-slave flip-flops which allow the input to be isolated from the output, ensuring a "clean" transfer of the new data into the register.

  5. Synchronous Machines • In a synchronous machine, the slowest possible propagation delay, tpdmax, through any combinatorial block must be less than the smallest clock cycle time, tcyc - otherwise a pipeline hazard will occur and data from a previous stage will be clocked into a register again. • If tcyc < tpd for any operation in any stage of the pipeline, the clock edge will arrive at the register before data has propagated through the combinatorial block.

  6. Synchronous Machines... • There may also be feedback loops - in which the output of the current stage is fed back and latched in the same register: a conventional state machine • This sort of logic is used to determine the next operation (ie next microcode word or next address for branching purposes).

  7. Basic Processor Structure • We will consider the basic structure of a simple processor.

  8. Basic Processor Structure... • ALU • Arithmetic Logic Unit - this circuit takes two operands on the inputs (labelled A and B) and produces a result on the output (labelled Y). The operations will usually include, as a minimum: • add, subtract • and, or, not • shift right, shift left • ALUs in more complex processors will execute many more instructions.

  9. Basic Processor Structure... • Register File • A set of storage locations (registers) for storing temporary results. • Early machines had just one register - usually termed an accumulator. • Modern RISC processors will have at least 32 registers. • Instruction Register • The instruction currently being executed by the processor is stored here.

  10. Basic Processor Structure... • Control Unit • The control unit decodes the instruction in the instruction register and sets signals which control the operation of most other units of the processor. • For example, the operation code (opcode) in the instruction will be used to determine the settings of control signals for the ALU which determine which operation (+,-,^,v,~,shift,etc) it performs.

  11. Basic Processor Structure... • Clock • The vast majority of processors are synchronous, that is, they use a clock signal to determine when to capture the next data word and perform an operation on it. • In a globally synchronous processor, a common clock needs to be routed (connected) to every unit in the processor.

  12. Basic Processor Structure... • Program counter • The program counter holds the memory address of the next instruction to be executed. • It is updated every instruction cycle to point to the next instruction in the program. • Memory Address Register • This register is loaded with the address of the next data word to be fetched from or stored into main memory.

  13. Basic Processor Structure... • Address Bus • This bus is used to transfer addresses to memory and memory-mapped peripherals. • It is driven by the processor acting as a bus master. • Data Bus • This bus carries data to and from the processor, memory and peripherals. It will be driven by the source of data, ie processor, memory or peripheral device.

  14. Basic Processor Structure... • Multiplexed Bus • Of necessity, high performance processors provide separate address and data buses. • To limit device pin counts and bus complexity, some simple processors multiplex address and data onto the same bus: naturally this has an adverse affect on performance. • When a bus is used for multiple purposes, eg address and data, it's called a multiplexed bus.

  15. Executing Instructions • Let's examine the steps in the execution of a simple memory fetch instruction, eg 101c16: lw $1,0($2) • This instruction tells the processor to take the address stored in register 2, add 0 to it and load the word found at that address in main memory into register 1. • As the next instruction to be executed (our lw instruction) is at memory address 101c16, the program counter contains 101c.

  16. Execution Steps • The control unit sets the multiplexer to drive the PC onto the address bus. • The memory unit responds by placing 8c41000016 - the lw $1,0($2) instruction as encoded for a MIPS processor - on the data bus from where it is latched into the instruction register. • The control unit decodes the instruction, recognises it as a memory load instruction and directs the register file to drive the contents of register 2 onto the A input of the ALU and the value 0 onto the B input. At the same time, it instructs the ALU to add its inputs.

  17. Execution Steps.... • The output from the ALU is latched into the MAR. The controller ensures that this value is directed onto the address bus by setting the multiplexor. • When the memory responds with the value sought, it is captured on the internal data bus and latched into register 1 of the register file. • The program counter is now updated to point to the next instruction and the cycle can start again.

  18. Another Example • Lets assume the next instruction is an add instruction: 102016: add $1,$3,$4 • This instruction tells the processor to add the contents of registers 3 and 4 and place the result in register 1. • The control unit sets the multiplexer to drive the PC onto the address bus • The memory unit responds by placing 0023202016 - the encoded add $1,$3,$4 instruction - on the data bus from where it is latched into the instruction register.

  19. Another Example... • The control unit decodes the instruction, recognizes it as an arithmetic instruction and directs the register file to drive the contents of register 1 onto the A input of the ALU and the contents of register 3 onto the B input. At the same time, it instructs the ALU to add its inputs. • The output from the ALU is latched into the register file at register address 4. • The program counter is now updated to point to the next instruction.

  20. Key Terms • von Neumann machine • A computer which stores its program in memory and steps through instructions in that memory. • Pipeline • A sequence of alternating storage elements (registers or latches) and combinatorial blocks, making up a datapath through the computer. • program counter • A register or memory location which holds the address of the next instruction to be executed. • synchronous (system/machine) • A computer in which instructions and data move from one pipeline stage to the next under control of a single (global) clock.

  21. Performance • Assume that the whole system is driven by a clock at f MHz. This means that each clock cycle takes t = 1/f microseconds • Generally, a processor will execute one step every cycle, thus, for a memory load instruction, our simple processor needs:

  22. Performance... • PC to bus 1 • Memory response tac • Decode and register access 1 • ALU operation and latch result to MAR 1 • Memory response tac • Increment PC - Overlap with step 3 • Total = 3 + 2*ta

  23. Performance... • If the memory response time is, say, 100ns, then our simple processor needs 3x10+2*100 = 230ns to execute a load instruction. • For the add instruction, we make a similar table: • an add instruction requires 3x10+100 = 130ns to execute. • A store operation will also need more than 200ns, so instructions will require, on average, about 150ns.

  24. Performance Measures • One commonly used performance measure is MIPS or millions of instructions per second. • Our simple processor will achieve:1/(150x10-9) = ~6.6 x 106 instructions per second= ~6.6 MIPS • 100MHz is a very common figure for processors in 1998 • A MIPS rating of 6.6 is very ordinary.

  25. Bottlenecks • It will be obvious that access to main memory is a major limiting factor in the performance of a processor. • Management of the memory hierarchy to achieve maximum performance is one of the major challenges for a computer architect. • Unfortunately, the hardware maxim smaller is faster conflicts with programmers' and users' desires for more and more capabilities and more elaborate user interfaces in their programs - resulting in programs that require megabytes of main memory to run!

  26. Bottlenecks... • This has led the memory manufacturers to concentrate on density (improving the number of bits stored in a single package) rather than speed. • They have been remarkably successful in this: the growth in capacity of the standard DRAM chips which form the bulk of any computer's semiconductor memory has matched the increase in speed of processors.

  27. Bottlenecks... • However the increase in DRAM access speeds has been much more modest - even if we consider recent developments in synchronous RAM and FRAM. • Another reason for the manufacturer's concentration on density is that a small increase in DRAM access time has a negligible effect on the effective access time which needs to include overheads for bus protocols.

  28. Bottlenecks... • Cache memories are the most significant device used to reduce memory overheads. • However, a host of other techniques such as pipelining, pre-fetching, branch prediction, etc are all used to alleviate the impact of memory fetch times on performance.

  29. ALU • The Arithmetic and Logic Unit is the 'core' of any processor: it's the unit that performs the calculations. • A typical ALU will have two input ports (A and B) and a result port (Y). • It will also have a control input telling it which operation (add, subtract, and, or, etc) to perform and additional outputs for condition codes (carry, overflow, negative, zero result).

  30. ALU... • ALUs may be simple and perform only a few operations: integer arithmetic (add, subtract), boolean logic (and, or, complement) and shifts (left, right, rotate). • Such simple ALUs may be found in small 4- and 8-bit processors used in embedded systems.

  31. ALU... • More complex ALUs will support a wider range of integer operations (multiply and divide), floating point operations (add, subtract, multiply, divide) and even mathematical functions (square root, sine, cosine, log, etc). • The largest market for general purpose programmable processors is the commercial one, where the commonest arithmetic operations are addition and subtraction. Integer multiply and all other more complex operations were performed in software - although this takes considerable time (a 32-bit integer multiply needs 32 adds and shifts), the low frequency of these operations meant that their low speed detracted very little from the machine's overall performance.

  32. ALU... • Thus designers would allocate their valuable silicon area to cache and other devices which had a more direct impact on processor performance in the target marketplace. • More recently, transistor geometries have shrunk to the point where it's possible to get 107 transistors on a single die. • Thus it becomes feasible to include floating point ALUs on every chip - probably more economic than designing separate processors without the floating point capability.

  33. ALU... • In fact, some manufacturers will supply otherwise identical processors with and without floating point capability. • This can be achieved economically by marking chips which had defects only in the region of the floating point unit as "integer-only" processors and selling them at a lower price for the commercial information processing market! • This has the desirable effect of increasing your semiconductor yield quite significantly - a floating point unit is quite complex and occupies a considerable area of silicon

  34. ALU...

  35. ALU... • In simple processors, the ALU is a large block of combinatorial logic with the A and B operands and the opcode (operation code) as inputs and a result, Y, plus the condition codes as outputs. • Operands and opcode are applied on one clock edge and the circuit is expected to produce a result before the next clock edge. • Thus the propagation delay through the ALU determines a minimum clock period and sets an upper limit to the clock frequency.

  36. ALU... • In advanced processors, the ALU is heavily pipelined to extract higher instruction throughput. • Faster clock speeds are now possible because complex operations (eg floating point operations) are done in multiple stages: each individual stage is smaller and faster.

  37. Software or Hardware? • The question of which instructions should be implemented in hardware and which can be left to software continues to occupy designers. • A high performance processor with 107 transistors is very expensive to design - $108 is probably a minimum! • Thus the trend seems to be to place everything on the die. • However, there is an enormous market for lower capability processors - for embedded systems, primarily.

  38. Note for hackers • A small "industry" has grown up around the phenomenon of "clock-chipping" - the discovery that a processor will generally run at a frequency somewhat higher than its specification. • Of necessity, manufacturers are somewhat conservative about the performance of their products and have to specify performance over a certain temperature range. • For commercial products this is commonly 0oC - 70oC.

  39. Note for hackers... • A reputable computer manufacturer will also be somewhat conservative, ensuring that the temperature inside the case of his computer normally never rises above, say 45oC. • This allows sufficient margin for error in both directions - chips sometimes degrade with age and computers may encounter unusual environmental conditions - so that systems will continue to function to their specifications.

  40. Note for hackers... • Clock-chippers rely on the fact that propagation delays usually increase with temperature so that a chip specified at x MHz at 70oC may well run at 1.5x at 45oC. • Needless to say this is a somewhat reckless strategy: your processor may functional perfectly well for a few months in winter - and then start failing, initially occasionally, and then more regularly as summer approaches!

  41. Note for hackers... • The manufacturer may also have allowed for some degradation with age so that a chip specified for 70oC now will still function at xMHz in two years time. • Thus a clock-chipped processor may start to fail after a few months at the higher speed - again the failures may be irregular and occasional initially, and start to occur with greater frequency as the effects of age show themselves. • Restoring the original clock chip may be all that's needed to give you back a functional computer!

  42. Key terms • condition codes • a set of bits which store general information about the result of an operation, eg result was zero, result was negative, overflow occurred, etc.

  43. Register File • The Register File is the highest level of the memory hierarchy. • In a very simple processor, it consists of a single memory location - usually called an accumulator. • The result of ALU operations was stored here and could be re-used in a subsequent operation or saved into memory. • In a modern processor, it's considered necessary to have at least 32 registers for integer values and often 32 floating point registers as well.

  44. Register File... • Thus the register file is a small, addressable memory at the top of the memory hierarchy. • It's visible to programs (which address registers directly), so that the number and type (integer or floating point) of registers is part of the instruction set architecture (ISA).

  45. Register File... • Registers are built from fast multi-ported memory cells. • They must be fast: a register must be able to drive its data onto an internal bus in a single clock cycle. • They are multi-ported because a register must be able to supply its data to either the A or the B input of the ALU and accept a value to be stored from the internal data bus.

  46. Register File...

  47. Register File Capacity • A modern processor will have at least 32 integer registers each capable of storing a word of 32 (or, more recently, 64) bits. • A processor with floating point capabilities will generally also provide 32 or more floating point registers, each capable of holding a double precision floating point word. • These registers are used by programs as temporary storage for values which will be needed for calculations.

  48. Register File Capacity... • Because the registers are "closest" to the processor in terms of access time - able to supply a value within a single clock cycle - an optimising compiler for a high level language will attempt to retain as many frequently used values in the registers as possible. • Thus the size of the register file is an important factor in the overall speed of programs. • Earlier processors with fewer than 32 registers (eg early members of the x86 family) severely hampered the ability of the compiler to keep frequently referenced values close to the processor.

  49. Register File Capacity... • However, it isn't possible to arbitrarily increase the size of the register file. With too many registers: • the capacitative load of too many cells on the data line will reduce its response time, • the resistance of long data lines needed to connect many cells will combine with the capacitative load to reduce the response time,

  50. Register File Capacity... • the number of bits needed to address the registers will result in longer instructions. A typical RISC instruction has three operands: sub $5, $3, $6requiring 15 bits with 32 (= 25) registers, • the complexity of the address decoder (and thus its propagation delay time) will increase as the size of the register file increases.

More Related