1 / 55

Advanced Computer Architecture: Parallel Processing and Communication Models

Explore Flynn's taxonomy, the challenge of parallel processing, coherence and consistency problems, synchronization, and communication models in advanced computer architecture.

gaughan
Download Presentation

Advanced Computer Architecture: Parallel Processing and Communication Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Computer Architecture5MD00 / 5Z033Multi-Processing 1 Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2010

  2. Topics • Flynn's taxonomy • Why Parallel Processors • Communication models • Challenge of parallel processing • Coherence problem • Consistency problem • Synchronization • Book: Chapter 4, appendix E, H ACA H.Corporaal

  3. Flynn's Taxomony • SISD (Single Instruction, Single Data) • Uniprocessors • SIMD (Single Instruction, Multiple Data) • Vector architectures also belong to this class • Multimedia extensions (MMX, SSE, VIS, AltiVec, …) • Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine, GPUs, …… • MISD (Multiple Instruction, Single Data) • Systolic arrays / stream based processing • MIMD (Multiple Instruction, Multiple Data) • Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin • Flexible • Most widely used Compare the earlier presented classification!! ACA H.Corporaal

  4. Flynn's Taxomony ACA H.Corporaal

  5. Why parallel processing • Performance drive • Diminishing returns for exploiting ILP and OLP • Multiple processors fit easily on a chip • Cost effective (just connect existing processors or processor cores) • Low power: parallelism may allow lowering Vdd However: • Parallel programming is hard ACA H.Corporaal

  6. CPU CPU1 CPU2 Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P1 = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P2 = f/2 2C V’2 =fCV’2 < P1 ACA H.Corporaal

  7. Parallel Architecture • Parallel Architecture extends traditional computer architecture with a communication network • abstractions (HW/SW interface) • organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node ACA H.Corporaal

  8. Communication models: Shared Memory Shared Memory (read, write) (read, write) Process P2 Process P1 • Coherence problem • Memory consistency issue • Synchronization problem ACA H.Corporaal

  9. Communication models: Shared memory • Shared address space • Communication primitives: • load, store, atomic swap Two varieties: • Physically shared => Symmetric Multi-Processors (SMP) • usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) ACA H.Corporaal

  10. Processor Processor Processor Processor One or more cache levels One or more cache levels One or more cache levels One or more cache levels SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and bus interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel can be 1 bus, N busses, or any network Main memory I/O System ACA H.Corporaal

  11. Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DSM: Distributed Shared Memory • Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Main memory I/O System ACA H.Corporaal

  12. Shared Address Model Summary • Each processor can name every physical location in the machine • Each process can name all data it shares with other processes • Data transfer via load and store • Data size: byte, word, ... or cache blocks • Memory hierarchy model applies: • communication moves data to local proc. cache ACA H.Corporaal

  13. receive send Process P2 Process P1 send receive FiFO Communication models: Message Passing • Communication primitives • e.g., send, receive library calls • standard MPI: Message Passing Interface • www.mpi-forum.org • Note that MP can be build on top of SM and vice versa ! ACA H.Corporaal

  14. Message Passing Model • Explicit message send and receive operations • Send specifies local buffer + receiving process on remote computer • Receive specifies sending process on remote computer + local buffer to place data • Typically blocking communication, but may use DMA Message structure Header Data Trailer ACA H.Corporaal

  15. Network interface Network interface Network interface Network interface DMA DMA DMA DMA Message passing communication Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory Interconnection Network ACA H.Corporaal

  16. Communication Models: Comparison • Shared-Memory • Compatibility with well-understood (language) mechanisms • Ease of programming for complex or dynamic communications patterns • Shared-memory applications; sharing of large data structures • Efficient for small items • Supports hardware caching • Messaging Passing • Simpler hardware • Explicit communication • Implicit synchronization (with any communication) ACA H.Corporaal

  17. Network: Performance metrics • Network Bandwidth • Need high bandwidth in communication • How does it scale with number of nodes? • Communication Latency • Affects performance, since processor may have to wait • Affects ease of programming, since it requires more thought to overlap communication and computation How can a mechanism help hide latency? • overlap message send with computation, • prefetch data, • switch to other task or thread ACA H.Corporaal

  18. Challenges of parallel processing Q1: can we get linear speedup Suppose we want speedup 80 with 100 processors. What fraction of the original computation can be sequential (i.e. non-parallel)? Answer: fseq = 0.25% Q2: how important is communication latency Suppose 0.2 % of all accesses are remote, and require 100 cycles on a processor with base CPI = 0.5 What’s the communication impact? ACA H.Corporaal

  19. Three fundamental issues for shared memory multiprocessors • Coherence, about: Do I see the most recent data? • Consistency, about: When do I see a written value? • e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? • SynchronizationHow to synchronize processes? • how to protect access to shared data? ACA H.Corporaal

  20. CPU CPU cache cache a' 550 a' 100 b' 200 b' 200 memory memory a 100 a 100 b 200 b 440 I/O I/O Coherence problem, in single CPU system CPU not coherent cache a' 100 b' 200 memory not coherent a 100 b 200 I/O 1) CPU writes to a 2) IO writes b ACA H.Corporaal

  21. Coherence problem, in Multi-Proc system CPU-1 CPU-2 cache cache a' 550 a'' 100 b' 200 b'' 200 memory a 100 b 200 ACA H.Corporaal

  22. What Does Coherency Mean? • Informally: • “Any read must return the most recent write (to the same address)” • Too strict and too difficult to implement • Better: • A write followed by a read by the same processor P with no writes in between returns the value written • “Any write must eventually be seen by a read” • If P writes to X and P' reads X then P' will see the value written by P if the read and write are sufficiently separated in time • Writes to the same location by different processors are seen in the same order by all processors ("serialization") • Suppose P1 writes location X, followed by P2 writing also to X. If no serialization, then some processors would read value of P1 and others of P2. Serialization guarantees that all processors read the same sequence. ACA H.Corporaal

  23. Two rules to ensure coherency • “If P1 writes x and P2 reads it, P1’s write will be seen by P2 if the read and write are sufficiently far apart” • Writes to a single location are serialized: seen in one order • Latest write will be seen • Otherwise could see writes in illogical order (could see older value after a newer value) ACA H.Corporaal

  24. Potential HW Coherency Solutions • Snooping Solution (Snoopy Bus): • Send all requests for data to all processors (or local caches) • Processors snoop to see if they have a copy and respond accordingly • Requires broadcast, since caching information is at processors • Works well with bus (natural broadcast medium) • Dominates for small scale machines (most of the market) • Directory-Based Schemes • Keep track of what is being shared in one centralized place • Distributed memory => distributed directory for scalability(avoids bottlenecks, hot spots) • Scales better than Snooping • Actually existed BEFORE Snooping-based schemes ACA H.Corporaal

  25. Processor Processor Processor Processor Cache Cache Cache Cache Example Snooping protocol • 3 states for each cache line: • invalid, shared (read only), modified (also called exclusive, you may write it) • FSM per cache, gets requests from processor and bus Main memory I/O System ACA H.Corporaal

  26. Snooping Protocol 1: Write Invalidate • Get exclusive access to a cache block (invalidate all other copies) before writing it • When processor reads an invalid cache block it is forced to fetch a new copy • If two processors attempt to write simultaneously, one of them is first (having a bus helps). The other one must obtain a new copy, thereby enforcing serialization Example: address X in memory initially contains value '0' ACA H.Corporaal

  27. Processor Cache Bus Memory Basics of Write Invalidate • Use the bus to perform invalidates • To perform an invalidate, acquire bus access and broadcast the address to be invalidated • all processors snoop the bus, listening to addresses • if the address is in my cache, invalidate my copy • Serialization of bus access enforces write serialization • Where is the most recent value? • Easy for write-through caches: in the memory • For write-back caches, again use snooping • Can use cache tags to implement snooping • Might interfere with cache accesses coming from CPU • Duplicate tags, or employ multilevel cache with inclusion corei ACA H.Corporaal

  28. Snoopy-Cache State Machine-I CPU Read hit • State machinefor CPU requestsfor each cache block CPU Read Shared (read-only) Invalid Place read miss on bus CPU Write CPU Read miss Place read miss on bus CPU read miss Write back block Place Write Miss on bus CPU Write Place Write Miss on Bus Cache Block State Exclusive (read/write) CPU read hit CPU write hit CPU Write Miss Write back cache block Place write miss on bus ACA H.Corporaal

  29. Snoopy-Cache State Machine-II • State machinefor bus requestsfor each cache block Write missfor this block Shared (read/only) Invalid Write Back Block; (abort memory access) Write Back Block; (abort memory access) Write miss for this block Read missfor this block Exclusive (read/write) ACA H.Corporaal

  30. Responds to Events Caused by Processor ACA H.Corporaal

  31. Responds to Events on Bus ACA H.Corporaal

  32. Snooping protocol 2: Write update/broadcast • Update all cached copies. To keep bandwidth requirements under control, need to track whether words are shared or private Example: address X in memory initially contains value '0' ACA H.Corporaal

  33. Qualitative Performance Differences • Compare: write invalidate versus write update: • Multiple writes to same word require • multiple write broadcasts in write update protocol • one invalidation in write invalidate • When cache block contains multiple words, each word written to a cache block requires • multiple write broadcasts in write update protocol • one invalidation in write invalidate • write invalidate works on cache blocks, write update on words/bytes • Delay between writing a word in one processor and reading the new value in another is less in write update • And the winner is? • write invalidate because bus bandwidth is most precious ACA H.Corporaal

  34. X2 X2 X1 X1 True Vs. False Sharing • True sharing: the word(s) being read is (are) the same as the word(s) being written • False sharing: the word being read is different from the word being written, but they are in same cache block P1$: P2$: ACA H.Corporaal

  35. Why This Protocol Works • Every valid cache block is either • in shared state in multiple caches • in exclusive state in one cache • Transition to exclusive requires write miss on the bus • this invalidates all other copies • if another cache has the block exclusively, that cache generates a write back !! • If a read miss occurs on the bus to an exclusive block, the owning cache makes its state shared • if corresponding processor again wants to write, it needs to re-gain exclusive access ACA H.Corporaal

  36. Complication: Write Races • Cannot update cache until bus is obtained • Otherwise, another processor may get bus first, and then write the same cache block! • Two step process: • Arbitrate for bus • Place miss on bus and complete operation • If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart • Split transaction bus: • Bus transaction is not atomic; can have multiple outstanding transactions for a block • Multiple misses can interleave; allowing two caches to grab block in the Exclusive state • Must track and prevent multiple misses for one block ACA H.Corporaal

  37. Implementation Simplifications • We place a write miss on the bus even if we have a write hit to a shared block • ownership or upgrade misses • real protocols support invalidate operations • Real protocols really distinguish between shared and private data • don't need to get exclusive access to private data ACA H.Corporaal

  38. Three fundamental issues for shared memory multiprocessors • Coherence, about: Do I see the most recent data? • SynchronizationHow to synchronize processes? • how to protect access to shared data? • Consistency, about: When do I see a written value? • e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? ACA H.Corporaal

  39. What's the Synchronization problem? • Assume: Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */ shared int balance shared int balance private int amount private int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t1 sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance ACA H.Corporaal

  40. Critical Section Problem • n processes all competing to use some shared data • Each process has code segment, called critical section, in which shared data is accessed. • Problem – ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section • Structure of process while (TRUE){ entry_section (); critical_section (); exit_section (); remainder_section (); } ACA H.Corporaal

  41. Attempt 1 – Strict Alternation Two problems: • Satisfies mutual exclusion, but not progress(works only when both processes strictly alternate) • Busy waiting Process P0 Process P1 shared int turn; while (TRUE) { while (turn!=0); critical_section(); turn = 1; remainder_section(); } shared int turn; while (TRUE) { while (turn!=1); critical_section(); turn = 0; remainder_section(); } ACA H.Corporaal

  42. Attempt 2 – Warning Flags • Satisfies mutual exclusion • P0 in critical section: flag[0]!flag[1] • P1 in critical section: !flag[0]flag[1] • However, contains a deadlock (both flags may be set to TRUE !!) Process P0 Process P1 shared int flag[2]; while (TRUE) { flag[0] = TRUE; while (flag[1]); critical_section(); flag[0] = FALSE; remainder_section(); } shared int flag[2]; while (TRUE) { flag[1] = TRUE; while (flag[0]); critical_section(); flag[1] = FALSE; remainder_section(); } ACA H.Corporaal

  43. Software solution: Peterson’s Algorithm (combining warning flags and alternation) Process P0 Process P1 shared int flag[2]; shared int turn; while (TRUE) { flag[0] = TRUE; turn = 0; while (turn==0&&flag[1]); critical_section(); flag[0] = FALSE; remainder_section(); } shared int flag[2]; shared int turn; while (TRUE) { flag[1] = TRUE; turn = 1; while (turn==1&&flag[0]); critical_section(); flag[1] = FALSE; remainder_section(); } • Software solution is slow ! • Difficult to extend to more than 2 processes ACA H.Corporaal

  44. Hardware solution for Synchronization • For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization • Hardware primitives needed • all solutions based on "atomically inspect and update a memory location" • Higher level synchronization solutions can be build on top ACA H.Corporaal

  45. Uninterruptable Instructions to Fetch and Update Memory • Atomic exchange: interchange a value in a register for a value in memory • 0 => synchronization variable is free • 1 => synchronization variable is locked and unavailable • Test-and-set: tests a value and sets it if the value passes the test • similar: Compare-and-swap • Fetch-and-increment: it returns the value of a memory location and atomically increments it • 0 => synchronization variable is free ACA H.Corporaal

  46. Build a 'spin-lock' using exchange primitive • Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock: LI R2,#1 ;load immediate lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked? • What about MP with cache coherency? • Want to spin on cache copy to avoid full memory latency • Likely to get cache hits for such variables • Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic • Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try: LI R2,#1 ;load immediate lockit: LW R3,0(R1) ;load var BNEZ R3,lockit ;not free=>spinEXCH R2,0(R1) ;atomic exchange BNEZ R2,try ;already locked? ACA H.Corporaal

  47. Alternative to Fetch and Update • Hard to have read & write in 1 instruction: use 2 instead • Load Linked (or load locked) + Store Conditional • Load linked returns the initial value • Store conditional returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise • Example doing atomic swap with LL & SC: try: OR R3,R4,R0 ; R4=R3LL R2,0(R1) ; load linkedSC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R3=0) • Example doing fetch & increment with LL & SC: try: LL R2,0(R1) ; load linked ADDI R3,R2,#1 ; increment SC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R2=0) ACA H.Corporaal

  48. Three fundamental issues for shared memory multiprocessors • Coherence, about: Do I see the most recent data? • SynchronizationHow to synchronize processes? • how to protect access to shared data? • Consistency, about: When do I see a written value? • e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? ACA H.Corporaal

  49. Memory Consistency OpenMP semantics: Memory consistency: When is a write observed ACA H.Corporaal

  50. Memory Consistency: The Problem • Observation: If writes take effect immediately (are immediately seen by all processors), it is impossible that both if-statements evaluate to true • But what if write invalidate is delayed ………. • Should this be allowed, and if so, under what conditions? ACA H.Corporaal

More Related