1 / 35

Chapter 7

Chapter 7. Multicores, Multiprocessors, and Clusters. Introduction. §9.1 Introduction. Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Task-level (process-level) parallelism Independent tasks running in parallel

cheryl
Download Presentation

Chapter 7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 7 Multicores, Multiprocessors, and Clusters

  2. Introduction §9.1 Introduction • Goal: connecting multiple computersto get higher performance • Multiprocessors • Scalability, availability, power efficiency • Task-level (process-level) parallelism • Independent tasks running in parallel • High throughput for independent jobs • Parallel processing program • Single program run on multiple processors Chapter 7 — Multicores, Multiprocessors, and Clusters — 2

  3. Introduction §9.1 Introduction • Multicore microprocessors • Processors are called cores in a multicore chip • Chips with multiple processors (cores) • Almost always Shared Memory Processors (SMPs) • Clusters • Microprocessors housed in many independent servers • Search engines, web servers, email servers, and databases • Memory not shared Chapter 7 — Multicores, Multiprocessors, and Clusters — 3

  4. Introduction §9.1 Introduction • The number of cores is expected to increase with Moore’s Law • Challenge is to create hardware and software that will make it easy to write correct parallel processing programs that will execute efficiently in performance and energy as the number of cores per chip increases. Chapter 7 — Multicores, Multiprocessors, and Clusters — 4

  5. Hardware and Software • Hardware • Serial: e.g., Pentium 4 • Parallel: e.g., Intel Core i7 • Software • Sequential: e.g., matrix multiplication • Concurrent: e.g., operating system • Sequential/concurrent software can run on serial/parallel hardware • Challenge: making effective use of parallel hardware Chapter 7 — Multicores, Multiprocessors, and Clusters — 5

  6. Hardware and Software §7.6 SISD, MIMD, SIMD, SPMD, and Vector Chapter 7 — Multicores, Multiprocessors, and Clusters — 6

  7. Parallel Processing Programs • Difficulties • Not with the hardware • There are too few important application programs written to complete tasks sooner on multiprocessors • It is difficult to write software that uses multiple processors to complete one task faster • Problem gets worse as the number of processors increases Chapter 7 — Multicores, Multiprocessors, and Clusters — 7

  8. Parallel Programming • Parallel software is the problem • Need to get significant performance improvement • Otherwise, just use a faster uniprocessor, since it’s easier! §7.2 The Difficulty of Creating Parallel Processing Programs Chapter 7 — Multicores, Multiprocessors, and Clusters — 8

  9. Parallel Programming • Challenges • Scheduling the workload • Partitioning the work into parallel pieces • Balancing the load evenly between the processors • Synchronize the work • Communication overhead between the processors §7.2 The Difficulty of Creating Parallel Processing Programs Chapter 7 — Multicores, Multiprocessors, and Clusters — 9

  10. Instruction and Data Streams • An alternate classification §7.6 SISD, MIMD, SIMD, SPMD, and Vector Chapter 7 — Multicores, Multiprocessors, and Clusters — 10

  11. Instruction and Data Streams • Conventional uniprocessor • SISD – Single instruction stream and single data stream • Conventional multiprocessor • MIMD – multiple instruction streams and multiple data streams • Can be separate programs running on different processors or • Single Program that runs on all processors • Rely on conditional statements when different processors should execute different sections of code §7.6 SISD, MIMD, SIMD, SPMD, and Vector Chapter 7 — Multicores, Multiprocessors, and Clusters — 11

  12. Instruction and Data Streams • SIMD – Single instruction stream and multiple data stream • Operate on vectors of data, e.g. One instruction to add 64 numbers by sending 64 data streams to 64 ALUs to form 64 sums within a single clock cycle • Works best when dealing with arrays in for loops • Must have a lot of identical structured data • E.g. Graphics processing • MISD – Multiple instruction stream and single data stream • Perform a series of computations on a single data stream in a pipelined fashion • E.g. Parse input from network, decrypt data, decompress data, search for match §7.6 SISD, MIMD, SIMD, SPMD, and Vector Chapter 7 — Multicores, Multiprocessors, and Clusters — 12

  13. Hardware Multithreading • While MIMD relies on multiple processes or threads to try to keep multiple processors busy, hardware multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion to try to utilize the hardware resources efficiently. §7.5 Hardware Multithreading Chapter 7 — Multicores, Multiprocessors, and Clusters — 13

  14. Hardware Multithreading • Performing multiple threads of execution in parallel • Processor must duplicate the independent state of each thread • Replicate registers, and program counter, etc. • Must have fast switching between threads • Process switch can take thousands of cycles • Thread switch can be instantaneous §7.5 Hardware Multithreading Chapter 7 — Multicores, Multiprocessors, and Clusters — 14

  15. Hardware Multithreading • Fine-grain multithreading • Switch threads after each cycle • Interleave instruction execution • If one thread stalls, others are executed • Instruction-level parallelism • Coarse-grain multithreading • Only switch on long stall (e.g., L2-cache miss) • Disadvantage – pipeline start up cost • No instruction-level parallelism §7.5 Hardware Multithreading Chapter 7 — Multicores, Multiprocessors, and Clusters — 15

  16. Simultaneous Multithreading • In multiple-issue dynamically scheduled pipelined processor • Schedule instructions from multiple threads • Instructions from independent threads execute when function units are available • Within threads, dependencies handled by scheduling and register renaming • Example: Intel Pentium-4 HT • Two threads: duplicated registers, shared function units and caches Chapter 7 — Multicores, Multiprocessors, and Clusters — 16

  17. Multithreading Example Chapter 7 — Multicores, Multiprocessors, and Clusters — 17

  18. Multithreading Example • The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support • A major stall such as an instruction cache miss, can leave the entire processor idle. • Coarse-grained, the long stalls are partially hidden by switching to another thread that uses the resources of the processor. • Pipelined start-up overhead still leads to idle cycles §7.5 Hardware Multithreading Chapter 7 — Multicores, Multiprocessors, and Clusters — 18

  19. Multithreading Example • Fine-grained • Instruction-level parallelism • The interleaving of threads mostly eliminates idle clock cycles • SMT • Thread-level parallelism and instruction-level parallelism are both exploited • Multiple threads using the issue slots in a single clock cycle §7.5 Hardware Multithreading Chapter 7 — Multicores, Multiprocessors, and Clusters — 19

  20. Future of Multithreading • Will it survive? In what form? • Power considerations  simplified microarchitectures • Simpler forms of multithreading • Tolerating cache-miss latency • Thread switch may be most effective • Multiple simple cores might share resources more effectively Chapter 7 — Multicores, Multiprocessors, and Clusters — 20

  21. Shared Memory • SMP: shared memory multiprocessor • Hardware provides single physicaladdress space for all processors • Nearly all current multicore chips §7.3 Shared Memory Multiprocessors Chapter 7 — Multicores, Multiprocessors, and Clusters — 21

  22. Shared Memory • Memory access time can be either: • uniform memory access (UMA) – the access time to a word does not depend on which processor asks for it. or • nonuniform memory access (NUMA) – some memory accesses are much faster than others, depending on which processor asks for which word • Because main memory is divided and attached to different microprocessors or to different memory controllers • Synchronize shared variables using locks §7.3 Shared Memory Multiprocessors Chapter 7 — Multicores, Multiprocessors, and Clusters — 22

  23. Example: Sum Reduction • Sum 100,000 numbers on 100 processor UMA • Each processor has ID: 0 ≤ Pn ≤ 99 Chapter 7 — Multicores, Multiprocessors, and Clusters — 23

  24. Example: Sum Reduction • Sum 100,000 numbers on 100 processor UMA • Each processor has ID: 0 ≤ Pn ≤ 99 • Partition 1000 numbers per processor • Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; • Now need to add these partial sums • Reduction: divide and conquer • Half the processors add pairs, then quarter, … • Need to synchronize between reduction steps Chapter 7 — Multicores, Multiprocessors, and Clusters — 24

  25. Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1); Chapter 7 — Multicores, Multiprocessors, and Clusters — 25

  26. Private Memory • Each processor has private physical address space • Processor must communicate via explicit message passing (send and receive) §7.4 Clusters and Other Message-Passing Multiprocessors Chapter 7 — Multicores, Multiprocessors, and Clusters — 26

  27. Loosely Coupled Clusters • Network of independent computers • Each has private memory and OS • Connected using I/O system • E.g., Ethernet/switch, Internet • Suitable for applications with independent tasks • Web servers, databases, simulations, … • High availability, scalable, affordable • Problems • Administration cost (prefer virtual machines) • Low interconnect bandwidth • c.f. processor/memory bandwidth on an SMP Chapter 7 — Multicores, Multiprocessors, and Clusters — 27

  28. Clusters • Clusters with high-performance message-passing dedicated networks • Many supercomputers today use custom networks • Much more expensive than local area networks Chapter 7 — Multicores, Multiprocessors, and Clusters — 28

  29. Warehouse-Scale Computers • Building to house, power, and cool 100,000 servers with dedicated network • Like large clusters but their architecture and operation are more sophisticated • They act as one giant computer Chapter 7 — Multicores, Multiprocessors, and Clusters — 29

  30. Grid Computing • Separate computers interconnected by long-haul networks • E.g., Internet connections • Work units farmed out, results sent back • Can make use of idle time on PCs Chapter 7 — Multicores, Multiprocessors, and Clusters — 30

  31. Networks inside a chip • Thanks to Moore’s Law and the increasing number of cores per chip, we now need networks inside a chip as well. Chapter 7 — Multicores, Multiprocessors, and Clusters — 31

  32. Network Characteristics • Performance • Latency per message on an unloaded network to send and receive a message • Throughput • The maximum number of messages that can be transmitted in a given time period. • Congestion delays caused by contention for a portion of the network • Fault tolerance – work with broken components • Collision detection Chapter 7 — Multicores, Multiprocessors, and Clusters — 32

  33. Network Characteristics • Cost – includes: • Number of switches • Number of links on a switch to connect to the network • The width (number of bits) per link • The length of the links when the network is mapped into silicon • Routability in silicon • Power – energy efficiency Chapter 7 — Multicores, Multiprocessors, and Clusters — 33

  34. Interconnection Networks • Network topologies • Arrangements of processors (black squares), switches (blue dots), and links §7.8 Introduction to Multiprocessor Network Topologies Bus Ring N-cube (N = 3) 2D Mesh Fully connected Chapter 7 — Multicores, Multiprocessors, and Clusters — 34

  35. Concluding Remarks • Goal: higher performance by using multiple processors • Difficulties • Developing parallel software • Devising appropriate architectures • Many reasons for optimism • Changing software and application environment • Chip-level multiprocessors with lower latency, higher bandwidth interconnect • An ongoing challenge for computer architects! §7.13 Concluding Remarks Chapter 7 — Multicores, Multiprocessors, and Clusters — 35

More Related