1 / 58

Computer architecture II

Computer architecture II. Introduction. Today’s overview. Why parallel computing? Technology trends Processors Storage Architectural Application trends Challenging computational problems What is a parallel computer? Classical parallel computer classifications Architecture

oberon
Download Presentation

Computer architecture II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer architecture II Introduction Computer Architecture II

  2. Today’s overview • Why parallel computing? • Technology trends • Processors • Storage • Architectural • Application trends • Challenging computational problems • What is a parallel computer? • Classical parallel computer classifications • Architecture • Memory access • Cluster and grid computing (definitions) • Top 500 • Parallel architectures and their convergence Computer Architecture II

  3. Units of Measure in HPC • High Performance Computing (HPC) units are: • Flops: floating point operations • Flop/s: floating point operations per second • Bytes: size of data (a double precision floating point number is 8 bytes long) • Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 106 byte (also 220 = 1048576) Giga Gflop/s = 109 flop/sec Gbyte = 109 byte (also 230 = 1073741824) Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte (also 240 = 10995211627776) Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte (also 250 = 1125899906842624) Exa Eflop/s = 1018 flop/sec Ebyte = 1018 byte Computer Architecture II

  4. Why parallel computing? Processor Control unit Arithmetic Logic Unit • Sequential computer • von Neumann model • One processor • One memory • One instruction executed at a time • Fastest machines: a couple of billion of operations per second (GFLOPS) Connecting logic I/O system Memory Computer Architecture II

  5. Tunnel Vision by Experts • “I think there is a world market for maybe five computers.” • Thomas Watson, chairman of IBM, 1943. • “There is no reason for any individual to have a computer in their home” • Ken Olson, president and founder of digital equipment corporation, 1977. • “640K [of memory] ought to be enough for anybody.” • Bill Gates, chairman of Microsoft,1981. Slide source: Warfield et al. Computer Architecture II

  6. Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Microprocessors have become smaller, denser, and more powerful. Slide source: Jack Dongarra Computer Architecture II

  7. Computer Architecture II

  8. Impact of Device Shrinkage • What happens when the transistor size shrinks by a factor of x ? • Clock rate goes up by x because wires are shorter • actually less than x, because of power consumption • Transistors per unit area goes up by x2 • Die size also tends to increase • typically another factor of ~x • Raw computing power of the chip goes up by ~ x4 ! • of which x3 is devoted either to parallelism or locality Computer Architecture II

  9. Microprocessor Transistors per Chip Growth in transistors per chip Increase in clock rate Computer Architecture II

  10. Limiting forces: Increased cost and difficulty of manufacturing Computer Architecture II

  11. How fast can a serial computer be? (James Demmel) 1 Tflop/s, 1 Tbyte sequential machine r = 0.3 mm • Consider the 1 Tflop/s sequential machine: • Data must travel some distance, r, to get from memory to CPU. • To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm. • Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: • Each word occupies about 3 square Angstroms, or the size of a small atom. • No choice but parallelism Computer Architecture II

  12. Storage: Locality and Parallelism Conventional Storage Hierarchy Proc Proc Proc Cache Cache Cache • Large memories are slow, fast memories are small • Storage hierarchies are large and fast on average • Parallel processors, collectively, have large, fast cache ($) • the slow accesses to “remote” data we call “communication” • Algorithm should do most work on local data L2 Cache L2 Cache L2 Cache L3 Cache L3 Cache L3 Cache potential interconnects Memory Memory Memory Computer Architecture II

  13. Processor-DRAM Gap (latency) µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Computer Architecture II

  14. Storage Trends • Divergence between memory capacity and speed even more pronounced • Capacity increased by 1000x from 1980-95, speed only 2x • Gigabit DRAM by c. 2000, but gap with processor speed much greater • Larger memories are slower, while processors get faster • Need to transfer more data in parallel • Need deeper cache hierarchies • How to organize caches? • Parallelism increases effective size of each level of hierarchy, without increasing access time • Disks: Parallel disks plus caching Computer Architecture II

  15. Architectural Trends • Resolve the tradeoff between parallelism and locality • Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect • Tradeoffs may change with scale and technology advances • Understanding microprocessor architectural trends => Helps build intuition about design issues or parallel machines => Shows fundamental role of parallelism even in “sequential” computers Computer Architecture II

  16. Phases in “VLSI” Generation Computer Architecture II

  17. Architectural Trends • Greatest trend in VLSI generation is increase in parallelism • Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit • slows after 32 bit • adoption of 64-bit now under way (Opteron, Itanium), 128-bit far (not performance issue) • Mid 80s to mid 90s: instruction level parallelism (ILP) • pipelining and simple instruction sets, + compiler advances (RISC) • on-chip caches and functional units => superscalar execution • greater sophistication: out of order execution, speculation, prediction • Current step: • thread level parallelism • multicore Computer Architecture II

  18. Pipeline of a superscalar processor In-order In-order Out-of-order

  19. How far will ILP go? • Simulation for discovering the maximum available ILP • Infinite fetch bandwidth • Infinite function units • perfect branch prediction • Cache misses: 0 cycles Computer Architecture II

  20. Multithreaded architectures Computer Architecture II

  21. Multithreaded architectures Examples: Pentium 4 Xeon, Ultrasparc T1 (32 &64 threads) Itanium Montecito (also dualcore) Computer Architecture II

  22. Multi-core • Intel: • Dual Pentium Extreme Edition 840 (first) • Quad Core Xeon 5300 • 80-core chip capable of cranking through 1.28TFlops. • AMD: Dual Core Opteron, Quad Core FX (3GHz) • Sun: Rock: 16 cores (due 2008) • IBM: 2 cores Power6 5GHz Computer Architecture II

  23. Alternative: Cell • general-purpose Power Architecture core of modest performance • coprocessing elements multimedia and vector processing applications • PowerPC core • controls 8 SPE (Synergistic processing elements): SIMD • Cache coherent • 25. GB/s XDR memory controller Computer Architecture II

  24. Alternative: Cell • SPE: register hierarchy • 128x128b single cycle registers • 16kx128b 6 cycles registers • DMA in parallel with SIMD processing Computer Architecture II

  25. Overview of Cell processor Computer Architecture II

  26. Performance (p processors) Performance (1 processor) New Applications Time (1 processor) More Performance Time (p processors) Application Trends • Demand for cycles fuels advances in hardware, and vice-versa • Cycle drives exponential increase in microprocessor performance • Drives parallel architecture harder: most demanding applications • Goal of applications in using parallel machines: Speedup Speedup (p processors) = • For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Computer Architecture II

  27. Improving the speedup of Parallel Applications • AMBER molecular dynamics simulation program • Motion of large biological models (proteins, DNA) • 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D • 9/94: optimize the balance • 8/94: optimize the communication Computer Architecture II

  28. Particularly Challenging Computations • Science • Global climate modeling • Astrophysical modeling • Biology: genomics; protein folding; drug design • Computational Chemistry • Computational Material Sciences and Nanosciences • Engineering • Crash simulation • Semiconductor design • Earthquake and structural modeling • Computation fluid dynamics (airplane design) • Combustion (engine design) • Business • Financial and economic modeling • Transaction processing, web services and search engines • Defense • Nuclear weapons -- test by simulations • Cryptography Computer Architecture II

  29. $5B Market in Technical Computing Source: IDC 2004, from USA´s National Research Council Future of Supercomputer Report Computer Architecture II

  30. Scientific Computing Demand Computer Architecture II

  31. NRC report on Future of Supercomputing • “In climate modeling or plasma physics, there is a broad consensus that up to seven orders of magnitude of performance improvements will be needed to achieve well-defined computational goals.” Computer Architecture II

  32. What is Parallel Architecture? • A parallel computer is a collection of processing elements that cooperate to solve large problems fast • Some broad issues: • Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and Scalability • how does it all translate into performance? • how does it scale? Computer Architecture II

  33. Why Study Parallel Architecture? • Role of a computer architect: • To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. • Parallelism: • Provides alternative to faster clock for performance • Applies at all levels of system design • Is a fascinating perspective from which to view architecture • Is increasingly central in information processing Computer Architecture II

  34. 1stArchitecture classification • There are several different methods used to classify computers • No single taxonomy fits all designs • Flynn's taxonomy uses the relationship of program instructions to program data. • SISD - Single Instruction, Single Data Stream • SIMD - Single Instruction, Multiple Data Stream • MISD - Multiple Instruction, Single Data Stream (no practical examples) • MIMD - Multiple Instruction, Multiple Data Stream Computer Architecture II

  35. SISD • One instruction stream • One data stream • One instruction issued on each clock cycle • One instruction executed on one element of data (scalar) at a time • Traditional von Neumann architecture Computer Architecture II

  36. SIMD • Also von Neumann architectures but more powerful instructions • Each instruction may operate on more than one data element • Usually intermediate host executes program logic and broadcasts instructions to other processors • Synchronous (lockstep) • Rating how fast these machines can issue instructions is not a good measure of their performance • Two major types: • Vector SIMD • Parallel SIMD Computer Architecture II

  37. Vector SIMD • Single instruction results in multiple operands being updated • Scalar processing operates on single data elements. Vector processing operates on whole vectors (groups) of data at a time. • Examples: • Cell • Cray 1 • NEC SX-2 • Fujitsu VP • Hitachi S820 Computer Architecture II

  38. Parallel SIMD • Several processors execute the same instruction in lockstep • Each processor modifies a different element of data • Drawback: idle processors • Advantage: no explicit synchronization required • Examples • Connection Machine CM-2 • Maspar MP-1, MP-2 Computer Architecture II

  39. MIMD • Several processors executing different instructions on different data • Advantages: • different jobs can be performed at a time • A better utilization can be achieved • Drawbacks: • Explicit synchronization needed • Difficult to program • Examples • MIMD Accomplished via Parallel SISD machines: Sequent, nCUBE , Intel iPSC/2, IBM RS6000 cluster, ALL CLUSTERS • MIMD Accomplished via Parallel SIMD machines: Cray C 90, Cray 2, NEC SX-3, Fujitsu VP 2000, Convex C-2, Intel Paragon, CM 5, KSR-1, IBM SP1, IBM SP2 Computer Architecture II

  40. 2nd Classification: Memory architectures • Shared memory • UMA • NUMA • CC-NUMA • Distributed memory • COMA Computer Architecture II

  41. UMA (Uniform Memory Access) Mk M1 M2 … Interconnect P1 P2 … Pn Computer Architecture II

  42. NUMA (Non Uniform Memory Access) PEn PE2 PE1 Pn P2 P1 … Mn M2 M1 Interconnect Computer Architecture II

  43. CC-NUMA (Cache Coherent NUMA) PE1 PE2 PEn P1 P2 Pn C1 C2 Cn M1 M2 Mn Interconnect Computer Architecture II

  44. Distributed memory PE1 PE2 PEn P1 P2 Pn … M1 M2 Mn Interconnect Computer Architecture II

  45. COMA (Cache Only Machine) PE1 PE2 PEn P1 P2 Pn … C1 C2 Cn Interconnect Computer Architecture II

  46. Memory architecture Logical view shared distributed Very important! shared UMA Physical view NUMA M. Dist. distributed scalability Future! “Easy” Programming Computer Architecture II

  47. Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network Computer Architecture II

  48. Clusters and Cluster Computing • Definition of a cluster: • Communication infrastructure: • High performance networks, faster than traditional LAN ( Myrinet, Infiniband, Gbit Ethernet) • Low latency communication protocols • Loosly coupled compared to traditional proprietary supercomputers (eg. IBM SP, Intel Paragon) A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computers cooperatively working together as a single, integrated computing resource. [Buyya98] Computer Architecture II

  49. Cluster architecture Computer Architecture II

  50. Clusters and Cluster Computing • Cluster networks: • Ethernet (10Mbps) (*), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps), ATM, Myrinet (1.2Gbps), Fiber Channel, FDDI, Infiniband, etc. • Cluster projects: • Beowulf (CalTech and NASA) - USA • Condor - Wisconsin State University, USA • DQS (Distributed Queuing System) - Florida State University, USA. • HPVM -(High Performance Virtual Machine),UIUC&now UCSB,USA • far - University of Liverpool, UK • Gardens - Queensland University of Technology, Australia • Kerrighed – INRIA, France • MOSIX - Hebrew University of Jerusalem, Israel • NOW (Network of Workstations) - Berkeley, USA Computer Architecture II

More Related