1 / 64

Topic 5

Topic 5. Synchronization and Costs for Shared Memory. “.... You will be assimilated. Resistance is futile.“ Star Trek. Synchronization. The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races Data Race or Race Condition

freya
Download Presentation

Topic 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek ELEG652-06F

  2. Synchronization • The orchestration of two or more threads (or processes) to complete a task in a correct manner and avoid any data races • Data Race or Race Condition • “There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write” • Atomicity and / or serialibility ELEG652-06F

  3. Atomicity • Atomic  From the Greek “Atomos” which means indivisible • An “All or None” scheme • An instruction (or a group of them) will appear as if it was (they were) executed in a single try • All side effects of the instruction (s) in the block are seen in its totality or not all • Side effects  Writes and (Causal) Reads to the variables inside the atomic block ELEG652-06F

  4. Atomicity • Word aligned load and stores are atomic in almost all architectures • Unaligned and bigger than word accesses are usually not atomic • What happens when non-atomic operations goes wrong • The final result will be a garbled combination of values • Complete operations might be lost in the process • Strong Versus Weak Atomicity ELEG652-06F

  5. Synchronization • Applied to Shared Variables • Synchronization might enforce ordering or not • High level Synchronization types • Semaphores • Mutex • Barriers • Critical Sections • Monitors • Conditional Variables ELEG652-06F

  6. Semaphores • Intelligent Counters of Resources • Zero Means not available • Abstract data which has two operations involved • P  probeer te verlagen: “try to decrease” Waits (Busy waits or sleeps) if the resource is not available. • V  verhoog: “increase.” Frees the resource • Binary V.S. Blocking V.S. Counting Semaphores • Binary: Initial Value will allow threads to obtain it • Blocking: Initial Value will block the threads • Counting: Initial Value is not zero • Note: P and V are atomic operations!!!! ELEG652-06F

  7. Mutex • Mutual Exclusion Lock • A binary semaphore to ensure that one thread (and only one) will access the resource • P  Lock the mutex • V  Unlock the mutex • It doesn’t enforce ordering • Fine V.S. Coarse grained ELEG652-06F

  8. Barriers • A high level programming construct • Ensure that all participating threads will wait at a program point for all other (participating) threads to arrive, before they can continue • Types of Barriers • Tree Barriers (Software Assisted) • Centralized Barriers • Tournament Barriers • Fine grained Barriers • Butterfly style Barriers • Consistency Barriers (i.e. #pragma omp flush) ELEG652-06F

  9. Critical Sections • A piece of code that is executed by one and only one thread at any point in time • If T1 finds CS in use, then it waits until the CS is free for it to use it • Special Case: • Conditional Critical Sections: Threads waits on a “given” signal to resume execution. • Better implemented with lock free techniques (i.e. Transactional Memory) ELEG652-06F

  10. Monitors and Conditional Variables • A monitor consists of: • A set of procedures to work on shared variables • A set of shared variables • An invariant • A lock to protect from access by other threads • Conditional Variables • The invariant in a monitor (but it can be used in other schemes) • It is a signal place holder for other threads activities ELEG652-06F

  11. Much More … • However, all of these are abstractions • Major elements • A synchronization element that ensure atomicity • Locks!!!! • A synchronization element that ensure ordering • Barriers!!!! • Implementations and types • Common types of atomic primitives • Read – Modify – Write Back cycles • Synch Overhead may break a system • Unnecessary consistency actions • Communication cost between threads • Why Distributed Memory Machines have “implicit” synchronization? ELEG652-06F

  12. Topic 5a Locks ELEG652-06F

  13. Implementation • Atomic Primitives • Fetch and Φ operations • Read – Modify – Write Cycles • Test and Set • Fetch and Store • Exchange register and memory • Fetch and Add • Compare and Swap • Conditionally exchange the value of a memory location ELEG652-06F

  14. Implementation • Use by programmers to implement more complex synchronization constructs • Waiting behavior • Scheduler based: The process / thread is de-scheduled and will be scheduled in a future time • Busy Wait: The process / thread polls on the resource until it is available • Dependent on the Hardware / OS / Scheduler behavior ELEG652-06F

  15. Types of (Software) LocksThe Spin Lock Family • The Simple Test and Set Lock • Polls a shared Boolean variable: A binary semaphore • Uses Fetch and Φ operations to operate on the binary semaphore • Expensive!!!! • Waste bandwidth • Generate Extra Busses transactions • The test test and set approach • Just poll when the lock is in use ELEG652-06F

  16. Types of (Software) LocksThe Spin Lock Family • Delay based Locks • Spin Locks in which a delay has been introduced in testing the lock • Constant delay • Exponentional Back-off • Best Results • The test test and set scheme is not needed ELEG652-06F

  17. Types of (Software) LocksThe Spin Lock Family Pseudo code: enum LOCK_ACTIONS = {LOCKED, UNLOCKED}; void acquire_lock(lock_t L) { int delay = 1; while(! test_and_set(L, LOCKED) ) { sleep(delay); delay *= 2; } } void release_lock(lock_t L) { L = UNLOCKED; } ELEG652-06F

  18. Types of (Software) LocksThe Ticket Lock • Reduce the # of Fetch and Φ operations • Only one per lock acquisition • Strongly fair lock • No starvation • A FIFO service • Implementation: Two counters • A Request and Release Counters ELEG652-06F

  19. Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 0 0 Request Release T1 acquires the lock ELEG652-06F

  20. Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 1 0 Request Release T2 requests the lock ELEG652-06F

  21. Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 2 0 Request Release T3 requests the lock ELEG652-06F

  22. Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 3 1 Request Release T1 releases the lock T2 gets the lock T4 requests the lock ELEG652-06F

  23. Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 4 1 Request Release T5 requests the lock ELEG652-06F

  24. Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 5 1 Request Release T1 requests the lock ELEG652-06F

  25. Types of (Software) LocksThe Ticket Lock T1 T2 T3 T4 T5 5 2 Request Release T2 releases the lock T3 acquires the lock ELEG652-06F

  26. Types of (Software) LocksThe Ticket Lock • Reduce the number of Fetch and Φ operations • Only read ops on the release counter • However, still a lot of memory and network bandwidth wasted. • Back off techniques also used • Exponentional Back off • A bad idea • Constant Delay • Minimum time of holding a lock • Proportional Back off • Dependent on how many are waiting for the lock ELEG652-06F

  27. Types of (Software) LocksThe Ticket Lock Pseudocode: unsigned int next_ticket = 0; unsigned int now_serving = 0; void acquire_lock() { unsigned int my_ticket = fetch_and_increment(next_ticket); while{ sleep(my_ticket - now_serving); if(now_serving == my_ticket) return; } } void release_lock() { now_serving = now_serving + 1; } ELEG652-06F

  28. Types of (Software) LocksThe Array Based Queue Lock • Contention on the release counter • Cache Coherence and memory traffic • Invalidation of the counter variable and the request to a single memory bank • Two elements • An Array and a tail pointer that index such array • The array is as big as the number of processor • Fetch and store  Address of the array element • Fetch and increment  Tail pointer • FIFO ordering ELEG652-06F

  29. Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Enter Wait Wait Wait Wait Tail The tail pointer points to the beginning of the array The all array elements except the first one are marked to wait ELEG652-06F

  30. Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Enter Wait Wait Wait Wait Tail T1 Gets the lock ELEG652-06F

  31. Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Enter Wait Wait Wait Wait Tail T2 Requests ELEG652-06F

  32. Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Enter Wait Wait Wait Wait Tail T3 requests ELEG652-06F

  33. Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Wait Enter Wait Wait Wait Tail T1 releases T2 Gets ELEG652-06F

  34. Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Wait Enter Wait Wait Wait Tail T4 Requests ELEG652-06F

  35. Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Wait Enter Wait Wait Wait Tail T1 requests ELEG652-06F

  36. Types of (Software) LocksThe Array Based Queue Lock T4 T1 T2 T3 T5 Wait Wait Enter Wait Wait Tail T2 releases T3 gets ELEG652-06F

  37. Types of (Software) LocksThe Queue Locks • It uses too much memory • Linear space (relative to the number of processors) per lock. • Array • Easy to implement • Linked List: QNODE • Cache management ELEG652-06F

  38. Types of (Software) LocksThe MCS Lock • Characteristics • FIFO ordering • Spins on locally accessible flag variables • Small amount of space per lock • Works equally well on machines with and without coherent caches • Similar to the QNODE implementation of queue locks • QNODES are assigned to local memory • Threads spins on local memory ELEG652-06F

  39. MCS: How it works? • Each processor enqueues its own private lock variable into a queue and spins on it • key: spin locally • CC model: spin in local cache • DSM model: spin in local private memory • No contention • On lock release, the releaser unlocks the next lock in the queue • Only have bus/network contention on actual unlock • No starvation (order of lock acquisitions defined by the list) ELEG652-06F

  40. MCS Lock • Requires atomic instruction: • compare-and-swap • fetch-and-store • If there is no compare-and-swap • an alternative release algorithm • extra complexity • loss of strict FIFO ordering • theoretical possibility of starvation • Detail: Mellor-Crummey and Scott’s 1991 paper ELEG652-06F

  41. Tail Flag Next Tail Flag Next F = 1 Next Tail MCS: Example Init Proc 1 gets Proc 2 tries CPU 3 • CPU 1 holds the “real” lock • CPU 2, CPU 3 and CPU 4 spins on the flag • When CPU 1 releases, it releases the lock and change the flag variable of the next in the list CPU 2 CPU 4 CPU 1 ELEG652-06F

  42. ImplementationModern Alternatives • Fetch and Φ operations • They are restrictive • Not all architecture support all of them • Problem: A general one atomic op is hard!!! • Solution: Provide two primitives to generate atomic operations • Load Linked and Store Conditional • Remember PowerPC lwarx and stwcx instructions ELEG652-06F

  43. An ExampleSwap Exchange the contents of register R4 with memory location pointed by R1 try: mov R3, R4 ld R2, 0(R1) st R3, 0(R1) mov R4, R2 Not Atomic!!!! ELEG652-06F

  44. An ExampleAtomic Swap Swap (Fetch and store) using ll and sc try: mov R3, R4 ll R2, 0(R1) sc R3, 0(R1) beqz R3, try mov R4, R2 In case that another processor writes to the value pointed by R1 before the sc can complete, the reservation (usually keep in register) is lost. This means that the sc will fail and the code will loop back and try again. ELEG652-06F

  45. Another ExampleFetch and Increment and Spin Lock Fetch and Increment using ll-sc try: ll R2, 0(R1) addi R2, R2, #1 sc R2, 0(R1) beqz R2, try Spin Lock using ll-sc The exch instruction is equivalent to the Atomic Swap Instruction Block presented earlier Assume that the lock is not cacheable Note: 0  Unlocked; 1  Locked li R2, #1 lockit: exch R2, 0(R1) bnez R2, lockit ELEG652-06F

  46. Performance Penalty Example Suppose there are 10 processors on a bus that each try to lock a variable simultaneously. Assume that each bus transaction (read miss or write miss) is 100 clock cycles long. You can ignore the time of the actual read or write of a lock held in the cache, as well as the time the lock is held (they won’t matter much!) Determine the performance penalty. ELEG652-06F

  47. Answer It takes over 12,000 cycles total for all processor to pass through the lock! Note: the contention of the lock and the serialization of the bus transactions. See example on pp 596, Henn/Patt, 3rd Ed. ELEG652-06F

  48. Performance Penalty • Assume the same example as before (100 cycles per bus transaction, 10 processors) but consider the case of a queue lock which only updates on a miss Paterson and Hennesy p 603 ELEG652-06F

  49. Performance Penalty • Answer: • First time: n+1 • Subsequent access: 2(n-1) • Total: 3n – 1 • 29 Bus cycles or 2900 clock cycles ELEG652-06F

  50. Implementing Locks Using Coherence lockit: ld R2, 0(R1) bnez R2, lockit li R2, #1 exch R2, 0(R1) bnez R2, lockit lockit: ll R2, 0(R1) bnez R2, lockit li R2, #1 sc R2, 0(R1) beqz R2, lockit ELEG652-06F

More Related