1 / 29

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism. Uniprocessor performance. From Hennessy and Patterson, Computer Architecture: A Quantitative Approach , 4th edition, October, 2006. VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002

lluvia
Download Presentation

Chap. 4 Multiprocessors and Thread-Level Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chap. 4 Multiprocessors and Thread-Level Parallelism

  2. Uniprocessor performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006 • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present

  3. From ILP to TLP & DLP • (Almost) All microprocessor companies moving to multiprocessor systems • Single processors gain performance by exploiting instruction level parallelism (ILP) • Multiprocessors exploit either: • Thread level parallelism (TLP), or • Data level parallelism (DLP) • What’s the problem?

  4. From ILP to TLP & DLP (cont.) • We’ve got tons of infrastructure for single-processor systems • Algorithms, languages, compilers, operating systems, architectures, etc. • These don’t exactly scale well • Multiprocessor design: not as simple as creating a chip with 1000 CPUs • Task scheduling/division • Communication • Memory issues • Even programming  moving from 1 to 2 CPUs is extremely difficult

  5. Why Multiprocessors? • Slowdown in uniprocessor performance arising from diminishing returns in exploiting ILP, combined with growing concern on power • Growth in data-intensive applications • Data bases, file servers, … • Growing interest in servers, server perf. • Increasing desktop perf. less important • Outside of graphics • Improved understanding in how to use multiprocessors effectively • Especially server where significant natural TLP

  6. Multiprocessing • Flynn’s Taxonomy of Parallel Machines • How many Instruction streams? • How many Data streams? • SISD: Single I Stream, Single D Stream • A uniprocessor • SIMD: Single I, Multiple D Streams • Each “processor” works on its own data • But all execute the same instrs in lockstep • E.g. a vector processor or MMX • =>Data Level Parallelism

  7. Flynn’s Taxonomy • MISD: Multiple I, Single D Stream • Not used much • MIMD: Multiple I, Multiple D Streams • Each processor executes its own instructions and operates on its own data • This is your typical off-the-shelf multiprocessor(made using a bunch of “normal” processors) • Includes multi-core processors, Clusters, SMP servers •  Thread Level Parallelism • MIMD popular because • Flexible: can run both N programs, or work on 1 multithreaded program together • Cost-effective: same processor in desktop & MIMD

  8. Back to Basics • “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” • Parallel Architecture = Computer Architecture + Communication Architecture • 2 classes of multiprocessors WRT memory: • Centralized Memory Multiprocessor • < few dozen processor chips (and < 100 cores) in 2006 • Small enough to share single, centralized memory • Physically Distributed-Memory multiprocessor • Larger number chips and cores than 100. • BW demands  Memory distributed among processors

  9. Centralized Shared Memory Multiprocessors

  10. Distributed Memory Multiprocessors

  11. Centralized-Memory Machines • Also “Symmetric Multiprocessors” (SMP) • “Uniform Memory Access” (UMA) • All memory locations have similar latencies • Data sharing through memory reads/writes • P1 can write data to a physical address A,P2 can then read physical address A to get that data • Problem: Memory Contention • All processor share the one memory • Memory bandwidth becomes bottleneck • Used only for smaller machines • Most often 2,4, or 8 processors

  12. Shared Memory Pros and Cons • Pros • Communication happens automatically • More natural way of programming • Easier to write correct programs and gradually optimize them • No need to manually distribute data(but can help if you do) • Cons • Needs more hardware support • Easy to write correct, but inefficient programs(remote accesses look the same as local ones)

  13. Distributed-Memory Machines • Two kinds • Distributed Shared-Memory (DSM) • All processors can address all memory locations • Data sharing like in SMP • Also called NUMA (non-uniform memory access) • Latencies of different memory locations can differ(local access faster than remote access) • Message-Passing • A processor can directly address only local memory • To communicate with other processors,must explicitly send/receive messages • Also called multicomputers or clusters • Most accesses local, so less memory contention (can scale to well over 1000 processors)

  14. Message-Passing Machines • A cluster of computers • Each with its own processor and memory • An interconnect to pass messages between them • Producer-Consumer Scenario: • P1 produces data D, uses a SEND to send it to P2 • The network routes the message to P2 • P2 then calls a RECEIVE to get the message • Two types of send primitives • Synchronous: P1 stops until P2 confirms receipt of message • Asynchronous: P1 sends its message and continues • Standard libraries for message passing:Most common is MPI – Message Passing Interface

  15. Message Passing Pros and Cons • Pros • Simpler and cheaper hardware • Explicit communication makes programmers aware of costly (communication) operations • Cons • Explicit communication is painful to program • Requires manual optimization • If you want a variable to be local and accessible via LD/ST, you must declare it as such • If other processes need to read or write this variable, you must explicitly code the needed sends and receives to do this

  16. Challenges of Parallel Processing • First challenge is % of program inherently sequential (limited parallelism available in programs) • Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? • 10% • 5% • 1% • <1%

  17. Amdahl’s Law Answers

  18. Challenges of Parallel Processing • Second challenge is long latency to remote memory (High cost of communications) • delay ranges from 50 clock cycles to 1000 clock cycles. • Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) • What is performance impact if 0.2% instructions involve remote access? • 1.5X • 2.0X • 2.5X

  19. CPI Equation • CPI = Base CPI + Remote request rate x Remote request cost • CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 • No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve local access

  20. Challenges of Parallel Processing • Application parallelism  primarily via new algorithms that have better parallel performance • Long remote latency impact  both by architect and by the programmer • For example, reduce frequency of remote accesses either by • Caching shared data (HW) • Restructuring the data layout to make more accesses local (SW)

  21. Cache Coherence Problem • Shared memory easy with no caches • P1 writes, P2 can read • Only one copy of data exists (in memory) • Caches store their own copies of the data • Those copies can easily get inconsistent • Classic example: adding to a sum • P1 loads allSum, adds its mySum, stores new allSum • P1’s cache now has dirty data, but memory not updated • P2 loads allSum from memory, adds its mySum, stores allSum • P2’s cache also has dirty data • Eventually P1 and P2’s cached data will go to memory • Regardless of write-back order, the final value ends up wrong

  22. Small-Scale—Shared Memory • Caches serve to: • Increase bandwidth versus bus/memory • Reduce latency of access • Valuable for both private data and shared data • What about cache consistency? • Read and write a single memory location (X) by two processors (A and B) • Assume Write-through cache

  23. Cache coherence problem

  24. u = ? u = ? u = 7 5 4 3 1 2 u u u :5 :5 :5 Example Cache Coherence Problem • Processors see different values for u after event 3 • With write back caches, value written back to memory depends on which cache flushes or writes back value • Processes accessing main memory may see very stale value • Unacceptable for programming, and it’s frequent! P P P 2 1 3 $ $ $ I/O devices Memory

  25. Cache Coherence Definition • A memory system is coherent if • A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and read by P, always returns the value written by P. • Preserves program order • If P1 writes to X and P2 reads X after a sufficient time, and there are no other writes to X in between, P2’s read returns the value written by P1’s write. • any write to an address must eventually be seen by all processors • Writes to the same location are serialized: two writes to location X are seen in the same order by all processors. • preserves causality

  26. Maintaining Cache Coherence • Hardware schemes • Shared Caches • Trivially enforces coherence • Not scalable (L1 cache quickly becomes a bottleneck) • Snooping • Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • Needs a broadcast network (like a bus) to enforce coherence • Directory • Sharing status of a block of physical memory is kept in just one location, the directory • Can enforce coherence even with a point-to-point network

  27. State Address Data Snoopy Cache-Coherence Protocols • Cache Controller “snoops” all transactions on the shared medium (bus or switch) • relevant transaction if for a block it contains • take action to ensure coherence • invalidate, update, or supply value • depends on state of the block and the protocol • Either get exclusive access before write via write invalidate or update all copies on write

  28. u = ? u = ? u = 7 5 4 3 1 2 u u u :5 :5 :5 u = 7 Example: Write-thru Invalidate • Must invalidate before step 3 • Write update uses more broadcast medium BW all recent MPUs use write invalidate P P P 2 1 3 $ $ $ I/O devices Memory

More Related