Topic 4 - PowerPoint PPT Presentation

topic 4 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Topic 4 PowerPoint Presentation
play fullscreen
1 / 100
Topic 4
147 Views
Download Presentation
devika
Download Presentation

Topic 4

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. “I think I can safely say that nobody understands Quantum Mechanics” Richard Feynman Topic 4 Shared Memory Architectures Design and Issues The great tragedy of science ... the slaying of a beautiful theory by an ugly fact. T. H. Huxley ELEG652-06F

  2. Reading List • Slides: Topic 4x • Henn. & Paterson: Chapter 5 and 6 • Culler & Singer: Chapter 5 • Other papers as assigned in class or homework ELEG652-06F

  3. Side Note 3:TLP and SMT • Thread Level Parallelism • Parallelism that arises from running multiple threads at the same time • Multithreading • Multiple threads shared FU in a single processor in a overlap fashion • Distinct saved thread state • PC, registers, page table, etc • Fast switch between threads ELEG652-06F

  4. Side Note 3:TLP and SMT • Types of Multithreading • Fine grained • Switch between threads on each instruction • Interleaved thread execution • Hide throughput losses in both short and long stalls • Slow down execution of single thread • Coarse grained • Switch threads only when a high latency (or stall) is found • Limited ability to overcome throughput losses • Pipeline start up • Symmetric Multithreading • A better utilization of resources • Lift the lock on the pipeline • Several threads’ instructions are inside the pipeline. • Normal Multithreaded: The pipeline is “locked” by the thread. ELEG652-06F

  5. Issue Slot Time Fine MT Coarse MT SMT Super Scalar Examples Intel HT animation: http://www.intel.com/technology/computing/dual-core/demo/popup/demo.htm ELEG652-06F

  6. P1 Switch Pn P1 Pn Bus $ $ I/O Devices L1 $ Shared Cache Bus Based Shared Memory Main Memory Main Mem P1 Pn P1 Pn $ $ $ $ Mem Mem IC Dance Hall Distributed Memory IC Mem Mem Modern MultiprocessorsCommon Extended Memory Hierarchies ELEG652-06F

  7. Review • Multiprocessors • Multiple CPU computer with shared memory • Centralized Multiprocessor • A group of processors sharing a bus and the same physical memory • Uniform Memory Access (UMA) • Symmetric Multi Processors (SMP) • Distributed Multiprocessors • Memory is distributed across several processors • Memory forms a single logical memory space • Non-uniform memory access multiprocessor (NUMA) • Multicomputers • Disjointed local address spaces for each processor • Asymmetrical Multi computers • Consists of a front end (user interaction and I/O devices) and a back end (parallel tasks) • Symmetrical Multi Computers • All components (computers) has identical functionality • Clusters and Networks of workstations ELEG652-06F

  8. Programming Execution Models • A set of rules to create programs • Message Passing Model • De Facto Multicomputer Programming Model • Multiple Address Space • Explicit Communication / Implicit Synchronization • Shared Memory Models • De Facto Multiprocessor Programming Model • Single Address Space • Implicit Communication / Explicit Synchronization ELEG652-06F

  9. Advantages Less Contention Highly Scalable Simplified Synch Message Passing  Synch + Comm. Disadvantages Load Balancing Deadlock / Livelock prone Waste of Bandwidth Overhead of small messages Distributed Memory MIMD ELEG652-06F

  10. Advantages No Partitioning No data movement (explicitly) Minor modifications (or not all) of toolchains and compilers Disadvantages Synchronization Scalability High-Throughput-Low-Latency network Memory Hierarchies DSM Shared Memory MIMD ELEG652-06F

  11. A set of rules for thread creation, scheduling and destruction Thread Model Memory Model Synchronization Model Rules that deal with access to shared data Shared Memory Execution Model A group of rules that deals with data replication, coherency, and memory ordering Thread Virtual Machine Shared Data Private Data Data that can be access by other threads Data that is not visible to other threads ELEG652-06F

  12. User Level Shared Memory Support • Shared Address Space Support and Management • Access Control and Management • Memory Consistency Model • Cache Management Mechanism ELEG652-06F

  13. Grand Challenge Problems • Shared Memory Multiprocessor  Effective at a number of thousand units • Optimize and Compile parallel applications • Main Areas: Assumptions about • Memory Coherency • Memory Consistency ELEG652-06F

  14. Memory Coherency Ensure that a memory op (a write) will become visible to all actors. Doesn’t impose restrictions on when it becomes visible Per location consistency Memory Consistency Ensure that two or more memory ops has a certain order among them. Even when those operations are from different actors Review ELEG652-06F

  15. Topic 4a Memory Consistency Foundation for Building Shared Memory Machines ELEG652-06F

  16. Memory [Cache] CoherencyThe Problem P1 P2 P3 4 3 1 U:? U:? 3 U:7 U:5 U:5 5 U:5 1 2 What value P1 and P2 will read? ELEG652-06F

  17. Memory Consistency Problem B = 0 … A = 1 L1: print B A = 0 … B = 1 L2: print A Assume that L1 and L2 are issue only after the other 4 instructions have been completed. What are the possible values that are printed on the screen? Is 0, 0 a possible combination? The MCM: A software and hardware contract ELEG652-06F

  18. MCM Attributes • Memory Operations • Location of Access • Near memory (cache, near memory modules, etc) V.S. far memory • Direction of Access • Write or Read • Value Transmitted in Access • Size • Causality of Access • Check if two access are “casually” related and if they are in which order are they completed • Category of Access • Static Property of Accesses ELEG652-06F

  19. MCMCategory of Access As Presented in Mosberger 93 Memory Access Shared Private Non-Competing Competing Synchronization Non synchronization Acquire Release Non-exclusive Exclusive Uniform V.S. Hybrid ELEG652-06F

  20. MCM Myths(Adve 96) • Myth 1: A MCM applies only to systems that allow multiple copies of data • Reality: Hardware and Aggressive Software Optimizations may violate the canonical SC model • Myth 2: Most commercial systems are Sequential Consistent • Reality: Systems like Cray T3D, AlphaServer 8200/8400, IBM Power5 use weak memory models ELEG652-06F

  21. MCM Myths • Myth 3: MCM only affects hardware design • Reality: It affects programming models and the optimization that compiler may take advantage of. • Myth 4: Cache Coherency inherently supports SC • Reality: CC is just a part of a given MCM. Other aspects deal with the atomicity of writes and the order in which memory ops are issued by the processor ELEG652-06F

  22. MCM Myths • Myth 5: The MCM is dependent of the cache behavior (invalidate or update) • Reality: MCMs can allow both types of cache behaviors • Myth 6: A system memory behavior is given (solely) by the processor (or memory) behavior • Reality: Both processor and memory contribute to the overall system memory behavior ELEG652-06F

  23. MCM Myths • Myth 7: Relaxed Memory models usually requires extra synchronization • Reality: Correctly labeling of operations or provided safety nets are provided by most models ELEG652-06F

  24. Conventional MCMs As Presented in Mosberger 93 Uniform Atomic Consistency Sequential Consistency Causal Consistency Hybrid Processor Consistency Weak Consistency Cache Consistency PRAM Release Consistency Entry Consistency Slow Memory ELEG652-06F

  25. Conventional MCM • Atomic Consistency • Operation interval  Memory Ops happens only inside this interval • Many operations are allowed in the same interval • Static: Reads happens at the beginning and writes happens at the end • Dynamic: Happens at any point as along as the result as if it was run on a serial execution • “Any read to a memory location X returns the value stored by the most recent write operation to X” ELEG652-06F

  26. Conventional MCM • Sequential Consistency • “… the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport 79] • Weaker than Atomic Consistency • Allows all combinations of inter instructions (from different processors) • Some of these combinations are not allowed under AC • P1  W(x) = 1 • P2  R(x) = 0 R(x) = 1 • Which is legal under SC but not AC ELEG652-06F

  27. Conventional MCM • Causal Consistency • Events (writes) that are causally related must be seen in the same order by all processors. • Example: W1(x), R2(x), W2(y) are casually related because the value of y might depend on the value written by x • Unrelated causal events can be seen in any order ELEG652-06F

  28. Conventional MCM • Cache Consistency • Synonym with Cache coherence • Sequential ordering per location basis • SC ensures sequential ordering for all memory locations • Cache consistency is included in SC but not the other way around. • Pipeline RAM • Single Processor  Writes can be pipelined w/o stalling • All writes from other processors are considered concurrent • They can be seen in different order ELEG652-06F

  29. Conventional MCM • Processor Coherence (Goodman’s 89) • PRAM and Coherence united • Processors agrees in the order of writes from a single processor but might disagree in the order of writes of different processors as long as they are to different locations • Stronger than coherence but weaker than sequential ELEG652-06F

  30. Conventional MCM • Weak Consistency • Following restrictions • Any Access to synchronized variables are SC • No Access to a synch variable is issued until all previous data access are performed • No access is issued by a processor until a previous synch access is performed • Synch Access == Fence • A program behave as SC under WC if • No data race • Synchronization is visible to the memory system ELEG652-06F

  31. Conventional MCM • Release Consistency • Refinement of WC • Synch access becomes Acquire, Release and non-synch access • Acquire: One side memory barrier, delay all future memory access • Release: One side memory barrier, it does not completes until all previous memory accesses have completed • Non-Synch access: Competing accesses with no synch purpose. • Entry Consistency • Similar to RC but it associates every shared variables with a synch variable (This being a lock or barrier) • Concurrent access to different Critical section • Refined Acquire to Exclusive and Non-exclusive access ELEG652-06F

  32. P1 P2 Pn S1 Sn S2 Memory Interleaved = {S1, S2, … Sn} Memory More on SC ELEG652-06F

  33. (1, 2, 3, 4) (1, 3, 2, 4) (1, 3, 4, 2) (3, 4, 1, 2) (3, 1, 2, 4) (3, 1, 4, 2) Memory Consistency Problem B = 0 (1) … A = 1 (2) L1: print B A = 0 (3) … B = 1 (4) L2: print A Assume that L1 and L2 are issue only after the other 4 instructions have been completed. What are the possible values that are printed on the screen? Is 0, 0 a possible combination? The Answer: NO!!!! Under SC but under weaker models like PRAM it is possible ELEG652-06F

  34. Sufficient Condition for SC • Every Processor issues Memory operation in program order. • After a write is issued, the processor waits for it to complete. • After a read is issued, the processor waits for it to complete, plus it waits for the write that writes the value returned by the read. • i.e. Reads have to wait for the writes which they depend on to have propagated to all processors ELEG652-06F

  35. An Example Sequential Consistency Compliancy and Cyclops64 Architecture ELEG652-06F

  36. Intro • The Cyclops64 Architecture • Next Generation Cellular Architecture • Compose of several thousand processing units arranged into a 3D mesh • A-Switch connections between nodes • Processing Units • 80 Dual Thread Units • 80 Floating Point Units • On Chip (Interleaved) SRAM banks • On Chip Instruction Cache • Off Chip DRAM memory • A crossbar interconnect between DRAM, SRAM, thread units, FP units and the “ASwitch” (provides communication to the other processing nodes in the system) • Objective: Surpassed the Peta-FLOP barrier ELEG652-06F

  37. Cyclops64 Architecture Picture courtesy of Juan del Cuvillo et. al. in their paper “Toward a Software Infrastructure for the Cyclops64 Cellular Architecture” ELEG652-06F

  38. Cyclops64 Simplified Arch. Model PU PU PU PU Processor & their Issuing Buffers FIFOs are fed with memory requests in program order Crossbar Issuing Buffers: tcp Request to the same memory bank M has equal latency regardless of issuing FIFO / processor Crossbar tcm Memory Bank Memory Buffers: tmb Requests are queued in the Memory buffers in the order that they arrive from the network MM MM MM MM A memory request in a memory buffer cannot be served until the one before it has completed ELEG652-06F

  39. SC under C64 Cyclops64 Chief Designer Main Conjecture Cyclops architecture “obeys” sequential consistency model. w/o the need of special synchronization instructions Sequential Consistency revisited: [A system is sequentially consistent if] the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.] [Lamport79] • This implies: • Operations of all processors are executed in some sequential order (total order); • In that total order, operations from an individual processor are executed in the program order (Lamport’s Order) ELEG652-06F

  40. Theorem • C64 is Sequential Consistent • Key to the proof: Redefined (2) from last slide: • Two operations designated to the same memory module M will be delivered to M’s FIFO queue in the same order as they enter the network. • Such order is respected by the Cyclops64 network • How about the case of two memory modules? ELEG652-06F

  41. Observation • Problem with Lamport’s Total Order • No overlapping of memory ops lifetimes • Possible solution • Concentrate on “performing” order instead of “issuing” order • Then work backwards (by respecting ordering) to obtain a Lamport’s order ELEG652-06F

  42. C64 and SC C64 is not SC in a classical sense!!! Classical SC 1,2,3,4  1,1 1,3,2,4  0,1 1,3,4,2  0,1 3,4,1,2  0,0 3,1,4,2  0,1 3,1,2,4  0,1 Example: Initially x = y = 0 P2 Print Y (3) Print X (4) P1 X = 1 (1) Y = 1 (2) ELEG652-06F

  43. C64 and SC • Consider that: • Mx and My Memory banks containing x and y respectively. It is safe to assume that Mx is not equal to My • Tcp(n)  Represents the time in which a memory op n leaves the issuing FIFO for the network • Tcm(n)  Represents the time in which a memory op n leaves the network and enters a memory buffer • Tmb(n)  Represents the time at which a memory op leaves the memory buffer • C(M)  Represents the latency to memory bank M • Wn  The number of operations in the memory buffer prior to the arrival of memory op n • Sn  The Average servicing time of the memory unit for memory access Example: Initially x = y = 0 P2 Print Y (3) Print X (4) P1 X = 1 (1) Y = 1 (2) Tcp(1) < Tcp(2) Tcm(1) = Tcp(1) + C(Mx) Tcm(2) = Tcp(2) + C(My) Tmb(1) = Tcp(1) + C(Mx) + Wx * Sx Tmb(2) = Tcp(2) + C(My) + Wy * Sy What if? Tmb(1) is greater than Tmb(2)??? Possible??? YES!!!! You broke SC!!!! ELEG652-06F

  44. However • This is called Sequential Consistency Compliant • A shared memory system S is SCC when any execution of Program P on S is equivalent to a Sequential Consistent execution of Program P • A stronger mathematical proof is provided by “Lamport Order Revisit: A study on How to Efficiently Achieve Sequential Consistency on a Modern Multichip on a chip Architecture” by Yuan Zhang (2005) • It involves the creation of two sets of instruction one based on the Lamport order and one based on true C64 execution • A series of reorder steps that conserve “equivalence” are executed and a mapping is produced ELEG652-06F

  45. More on Cyclops64 MM • One processor and one memory bank share one of the 96 ports of the crossbar switch. • Messages from both the processor and the memory bank are injected to the crossbar via the injection queue (α). • Messages to both the processor and the memory bank are dejected from the crossbar via the dejection queue (β). • The equal-flying-time property is still true • Still SCC • Further discussion is left for the paper ELEG652-06F

  46. One more Thing … • Scratch Pad • A private section of memory to each processing unit • Non coherent and therefore non consistent • Own Rd and Wr ports • High Speed Access • Separate Address space as seen by the processor • Good or Bad? ELEG652-06F

  47. Topic 4b An Intro to Cache Coherence Protocols ELEG652-06F

  48. Outline • Review of Cache Architecture / Organization • The cache protocol • Bus SNOOPY based cache protocol: MESI • Directory Based cache protocol • DASH Architecture ELEG652-06F

  49. Cache Coherency • The Coherency Problem • A processor should have exclusive access to a shared variable when writing and should get the “most recent” value when reading. • Solution • When writing • (1) Invalidate all copies • (2) Broadcast to everyone the new copy • When reading • Finding the most recent copy • Can be tricky ELEG652-06F

  50. X’ I X Shared Memory Shared Memory Shared Memory Bus Bus Bus Cache Cache Cache X X’ I I X’ X X X’ X’ Processors Processors Processors P1 P1 P1 P2 P2 P2 P3 P3 P3 Write Update V.S Write Invalidate X is a shared variable that has a copy in all caches. Then a write occurred For Write Invalidate all the cache copies are marked as “invalid” except the most recent one For Write Update all the cache copies are updated with the most recent value Assume a write through cache protocol ELEG652-06F