1 / 82

Cosc 2150: Computer Organization

Cosc 2150: Computer Organization. Chapter 9 Alternative Architectures. Introduction. We have so far studied only the simplest models of computer systems; classical single-processor von Neumann systems.

faolan
Download Presentation

Cosc 2150: Computer Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cosc 2150:Computer Organization Chapter 9 Alternative Architectures

  2. Introduction • We have so far studied only the simplest models of computer systems; classical single-processor von Neumann systems. • This chapter presents a number of different approaches to computer organization and architecture. • Some of these approaches are in place in today’s commercial systems. Others may form the basis for the computers of tomorrow.

  3. RISC Machines • The underlying philosophy of RISC machines is that a system is better able to manage program execution when the program consists of only a few different instructions that are the same length and require the same number of clock cycles to decode and execute. • RISC systems access memory only with explicit load and store instructions. • In CISC systems, many different kinds of instructions access memory, making instruction length variable and fetch-execute time unpredictable.

  4. RISC Machines • The difference between CISC and RISC becomes evident through the basic computer performance equation: • RISC systems shorten execution time by reducing the clock cycles per instruction. • CISC systems improve performance by reducing the number of instructions per program.

  5. RISC Machines • The simple instruction set of RISC machines enables control units to be hardwired for maximum speed. • The more complex-- and variable-- instruction set of CISC machines requires microcode-based control units that interpret instructions as they are fetched from memory. This translation takes time. • With fixed-length instructions, RISC lends itself better to pipelining and speculative execution.

  6. RISC Machines • Consider the program fragments: • The total clock cycles for the CISC version might be: (2 movs  1 cycle) + (1 mul  30 cycles) = 32 cycles • While the clock cycles for the RISC version is: (3 movs  1 cycle) + (5 adds  1 cycle) + (5 loops  1 cycle) = 13 cycles • With RISC clock cycle being shorter, RISC gives us much faster execution speeds. mov ax, 0 mov bx, 10 mov cx, 5 Begin add ax, bx loop Begin mov ax, 10 mov bx, 5 mul bx, ax RISC CISC

  7. RISC Machines • Because of their load-store ISAs, RISC architectures require a large number of CPU registers. • These register provide fast access to data during sequential program execution. • They can also be employed to reduce the overhead typically caused by passing parameters to subprograms. • Instead of pulling parameters off of a stack, the subprogram is directed to use a subset of registers. • Register Windows.

  8. RISC Machines • This is how registers can be overlapped in a RISC system. • The current window pointer (CWP) points to the active register window.

  9. Overlapping Register Windows

  10. Circular Buffer diagram

  11. RISC Machines • It is becoming increasingly difficult to distinguish RISC architectures from CISC architectures. • Some RISC systems provide more extravagant instruction sets than some CISC systems. • Some systems combine both approaches. • The following two slides summarize the characteristics that traditionally typify the differences between these two architectures.

  12. RISC vs CISC Machines • CISC • Single register set. • One or two register operands per instruction. • Parameter passing through memory. • Multiple cycle instructions. • Microprogrammed control. • Less pipelined. • RISC • Multiple register sets. • Three operands per instruction. • Parameter passing through register windows. • Single-cycle instructions. • Hardwired control. • Highly pipelined. Continued....

  13. RISC vs CISC Machines • CISC • Many complex instructions. • Variable length instructions. • Complexity in microcode. • Many instructions can access memory. • Many addressing modes. • RISC • Simple instructions, few in number. • Fixed length instructions. • Complexity in compiler. • Only LOAD/STORE instructions access memory. • Few addressing modes.

  14. Why CISC (1)? • Compiler simplification? • Disputed… • Complex machine instructions harder to exploit • Optimization more difficult • Smaller programs? • Program takes up less memory but… • Memory is now cheap • May not occupy less bits and just look shorter in symbolic form • More instructions require longer op-codes • Register references require fewer bits

  15. Why CISC (2)? • Faster programs? • Bias towards use of simpler instructions • More complex control unit • Microprogram control store larger • thus simple instructions take longer to execute • It is far from clear that CISC is the appropriate solution

  16. Background to IA-64 • Intel & Hewlett-Packard (HP) jointly developed • New architecture • 64 bit architecture • Not extension of x86 • Not adaptation of HP 64bit RISC architecture • Exploits vast circuitry and high speeds • Systematic use of parallelism • Departure from superscalar

  17. Motivation • Instruction level parallelism • Implicit in machine instruction • Not determined at run time by processor • Long or very long instruction words (LIW/VLIW) • Branch predication (not the same before) • Speculative loading • Intel & HP call this Explicit Parallel Instruction Computing (EPIC) • IA-64 is an instruction set architecture intended for implementation on EPIC • Itanium was the first Intel product

  18. IA-64 vs Superscalar

  19. Organization • Large number of registers • 256 total • 128 64-bit registers for integer, logical, and general purpose • 128 82-bit for floating point and graphic use • And 64 1-bit predicate registers for predicated execution (explained later) • Multiple execution units: • typical superscalar supports 4 • IA-64 has 8 or more. • 4 types of execution units • I-unit for integer, logical, and compare instructions • M-unit for Load/Store instructions • B-unit Branch instructions • F-unit Floating point instructions

  20. Instruction format • defined as a 128-bit bundle • contains 3 instructions called syllables • processor fetches 1 or more bundles at a time • ie min of 3 instructions at a time. • A template field in the bundle tells which instructions can be executed in parallel in the bundle. • Up to the compile to do it right, as well as reorder the instructions to maximize instruction level parallelism • Each instruction gets 41 bits • Opcode 4 bits • middle 31 for general formation (7bits for each register in an add %r1, %r2, %r3, plus 10 for modifiers to the opcode) • predicate register 6 bits.

  21. Assembly Language Format • [qp] mnemonic [.comp] dest = srcs // • qp - predicate register • 1 at execution then execute and commit result to hardware • 0 result is discarded • mnemonic - name of instruction • comp – one or more instruction completers used to qualify mnemonic • dest – one or more destination operands • srcs – one or more source operands • // - comment • Instruction groups and stops indicated by ;; • Sequence without read after write or write after write • Do not need hardware register dependency checks

  22. Assembly Examples ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group • Second instruction depends on value in r1 • Changed by first instruction • Can not be in same group for parallel execution

  23. Instruction format diagram

  24. Predicated execution • A technique where the compiler determines which instructions may execute in parallel. • The compiler eliminates branches from the program by using condition execution • Example is an if-then-else • we've seen this in IAS and ARC, but IA-64 … • Both the true and false side are executed in parallel • The compare sets the predicate value, for the Then True side or the else false side • The processor discards one outcome based on the predicate value.

  25. Predication diagram

  26. Control Speculation • Known as speculative loading • allows the processor to load data from memory before the program needs it • avoid memory latency delays • Avoids bottlenecks in obtaining data from memory • As well as working with predicate execution

  27. Speculative Loading diagram

  28. Data speculation • Used with speculative loading as well • Allows reordering of code and the problems of register naming fixed by using a look up table (Advanced Load Address Table) • determines if the data is dependent on loads that may not have happened yet. • Works well if loads and stores have little chance of overlapping.

  29. Software Pipelining • unrolling loops • example: for (int i=0; i<=5; i++) { y[i] = x[i] + c; } • a small tight loop, with parallel execution, this won't work. i is a dependent variable, for x and y arrays. • So, the compiler is responsible for "unrolling" the loop (ie: y[0] = x[0] + c; y[1] = x[1] +c; y[2] = x[2] + c; etc…), so they can run in parallel.

  30. Itanium Organization

  31. Flynn’s Taxonomy • Many attempts have been made to come up with a way to categorize computer architectures. • Flynn’s Taxonomy has been the most enduring of these, despite having some limitations. • Flynn’s Taxonomy takes into consideration the number of processors and the number of data paths incorporated into an architecture. • A machine can have one or many processors that operate on one or many data streams.

  32. Flynn’s Taxonomy • The four combinations of multiple processors and multiple data paths are described by Flynn as: • SISD: Single instruction stream, single data stream. These are classic uniprocessor systems. • SIMD: Single instruction stream, multiple data streams. Execute the same instruction on multiple data values, as in vector processors. • MIMD: Multiple instruction streams, multiple data streams. These are today’s parallel architectures. • MISD: Multiple instruction streams, single data stream.

  33. Flynn’s Taxonomy • Flynn’s Taxonomy falls short in a number of ways: • First, there appears to be no need for MISD machines. • Second, parallelism is not homogeneous. This assumption ignores the contribution of specialized processors. • Third, it provides no straightforward way to distinguish architectures of the MIMD category. • One common approach is to divide these systems into those that share memory, and those that don’t, as well as whether the interconnections are bus-based or switch-based.

  34. Flynn’s Taxonomy • Symmetric multiprocessors (SMP) and massively parallel processors (MPP) are MIMD architectures that differ in how they use memory. • SMP systems share the same memory and MPP do not. • An easy way to distinguish SMP from MPP is: MPP  many processors + distributed memory + communication via network SMP  fewer processors + shared memory + communication via memory

  35. Flynn’s Taxonomy • Other examples of MIMD architectures are found in distributed computing, where processing takes place collaboratively among networked computers. • A network of workstations (NOW) uses otherwise idle systems to solve a problem. • A collection of workstations (COW) is a NOW where one workstation coordinates the actions of the others. • A dedicated cluster parallel computer (DCPC) is a group of workstations brought together to solve a specific problem. • A pile of PCs (POPC) is a cluster of (usually) heterogeneous systems that form a dedicated parallel system. 35

  36. Flynn’s Taxonomy • Flynn’s Taxonomy has been expanded to include SPMD (single program, multiple data) architectures. • Each SPMD processor has its own data set and program memory. Different nodes can execute different instructions within the same program using instructions similar to: If myNodeNum = 1 do this, else do that • Yet another idea missing from Flynn’s is whether the architecture is instruction driven or data driven. The next slide provides a revised taxonomy.

  37. Flynn’s Taxonomy

  38. Parallel and Multiprocessor Architectures • Parallel processing is capable of economically increasing system throughput while providing better fault tolerance. • The limiting factor is that no matter how well an algorithm is parallelized, there is always some portion that must be done sequentially. • Additional processors sit idle while the sequential work is performed. • Thus, it is important to keep in mind that an n -fold increase in processing power does not necessarily result in an n -fold increase in throughput.

  39. Parallel and Multiprocessor Architectures • Recall that pipelining divides the fetch-decode-execute cycle into stages that each carry out a small part of the process on a set of instructions. • Ideally, an instruction exits the pipeline during each tick of the clock. • Superpipelining occurs when a pipeline has stages that require less than half a clock cycle to complete. • The pipeline is equipped with a separate clock running at a frequency that is at least double that of the main system clock. • Superpipelining is only one aspect of superscalar design.

  40. Parallel and Multiprocessor Architectures • Superscalar architectures include multiple execution units such as specialized integer and floating-point adders and multipliers. • A critical component of this architecture is the instruction fetch unit, which can simultaneously retrieve several instructions from memory. • A decoding unit determines which of these instructions can be executed in parallel and combines them accordingly. • This architecture also requires compilers that make optimum use of the hardware.

  41. Parallel and Multiprocessor Architectures • Very long instruction word (VLIW) architectures differ from superscalar architectures because the VLIW compiler, instead of a hardware decoding unit, packs independent instructions into one long instruction that is sent down the pipeline to the execution units. • IA-64 earlier in the lecture. • One could argue that this is the best approach because the compiler can better identify instruction dependencies. • However, compilers tend to be conservative and cannot have a view of the run time code.

  42. Parallel and Multiprocessor Architectures • Vector computers are processors that operate on entire vectors or matrices at once. • These systems are often called supercomputers. • Vector computers are highly pipelined so that arithmetic instructions can be overlapped. • Vector processors can be categorized according to how operands are accessed. • Register-register vector processors require all operands to be in registers. • Memory-memory vector processors allow operands to be sent from memory directly to the arithmetic units.

  43. Parallel and Multiprocessor Architectures • A disadvantage of register-register vector computers is that large vectors must be broken into fixed-length segments so they will fit into the register sets. • Memory-memory vector computers have a longer startup time until the pipeline becomes full. • In general, vector machines are efficient because there are fewer instructions to fetch, and corresponding pairs of values can be prefetched because the processor knows it will have a continuous stream of data.

  44. Parallel and Multiprocessor Architectures • MIMD systems can communicate through shared memory or through an interconnection network. • Interconnection networks are often classified according to their topology, routing strategy, and switching technique. • Of these, the topology is a major determining factor in the overhead cost of message passing. • Message passing takes time owing to network latency and incurs overhead in the processors.

  45. Parallel and Multiprocessor Architectures • Interconnection networks can be either static or dynamic. • Processor-to-memory connections usually employ dynamic interconnections. These can be blocking or nonblocking. • Nonblocking interconnections allow connections to occur simultaneously. • Processor-to-processor message-passing interconnections are usually static, and can employ any of several different topologies, as shown on the following slide.

  46. Parallel and Multiprocessor Architectures

  47. Parallel and Multiprocessor Architectures • Dynamic routing is achieved through switching networks that consist of crossbar switches or 2  2 switches.

  48. Parallel and Multiprocessor Architectures • Multistage interconnection (or shuffle) networks are the most advanced class of switching networks. They can be used in loosely-coupled distributed systems, or in tightly-coupled processor-to-memory configurations. •

  49. Parallel and Multiprocessor Architectures • There are advantages and disadvantages to each switching approach. • Bus-based networks, while economical, can be bottlenecks. Parallel buses can alleviate bottlenecks, but are costly. • Crossbar networks are nonblocking, but require n2 switches to connect n entities. • Omega networks are blocking networks, but exhibit less contention than bus-based networks. They are somewhat more economical than crossbar networks, n nodes needing log2n stages with n / 2 switches per stage.

  50. Parallel and Multiprocessor Architectures • Tightly-coupled multiprocessor systems use the same memory. They are also referred to as shared memory multiprocessors. • The processors do not necessarily have to share the same block of physical memory: • Each processor can have its own memory, but it must share it with the other processors. • Configurations such as these are called distributedshared memory multiprocessors.

More Related