1 / 25

RH Lock: A Scalable Hierarchical Spin Lock

RH Lock: A Scalable Hierarchical Spin Lock. 2nd ANNUAL WORKSHOP ON MEMORY PERFORMANCE ISSUES (WMPI 2002) May 25, 2002, Anchorage, Alaska. Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [ UART ].

lucius
Download Presentation

RH Lock: A Scalable Hierarchical Spin Lock

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RH Lock:A Scalable Hierarchical Spin Lock 2nd ANNUAL WORKSHOP ON MEMORYPERFORMANCE ISSUES (WMPI 2002) May 25, 2002, Anchorage, Alaska Uppsala UniversityInformation TechnologyDepartment of Computer SystemsUppsala Architecture Research Team [UART] Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se

  2. Synchronization History Busy-wait/ backoff • Spin-Locks • test_and_set (TAS), e.g., IBM System/360, ’64 • Rudolph and Segall, ISCA’84 • test_and_test_and_set (TATAS) • TATAS with exponential backoff (TATAS_EXP), ’90 – ’91 Memory BUSY FREE Lock: $ $ $ … $ Pn P1 P2 P3 P3 BUSY BUSY FREE BUSY Uppsala Architecture Research Team (UART)

  3. Performance, 12 years ago …Traditional microbenchmark Thanks: Michael L. Scott IF (more contention)  THEN less efficient CS … for (i = 0; i < iterations; i++) {ACQUIRE(lock); // Null Critical Section (CS) RELEASE(lock);} Uppsala Architecture Research Team (UART)

  4. Making it Scalable: Queues … • Spin on your predecessor’s flag • First-come first-served order • Queue-Based Locks • QOLB/QOSB ’89 • MCS ’91 • CLH ’93 Uppsala Architecture Research Team (UART)

  5. Performance, May 2002Traditional microbenchmark • Sun Enterprise E6000 SMP 16 Uppsala Architecture Research Team (UART)

  6. Synchronization Today • Commercial applications use spin-locks (!) • usually TATAS & TATAS_EXP with timeout for • recovery from transaction deadlock • recovery from preemption of the lock holder • POSIX threads: • pthread_mutex_lock • pthread_mutex_unlock • HPC: runtime systems, OpenMP, … Uppsala Architecture Research Team (UART)

  7. Non-Uniform MemoryArchitecture (NUMA) • NUMA optimizations • Page migration • Page replication Memory Memory Switch $ $ $ $ $ $ $ $ 1 2 – 10 P1 P2 P3 Pn P1 P2 P3 Pn Uppsala Architecture Research Team (UART)

  8. Non-Uniform CommunicationArchitecture (NUCA) Memory Memory Switch NUCA ratio $ $ $ $ $ $ $ $ 1 2 – 10 P1 P2 P3 Pn P1 P2 P3 Pn • NUCA examples (NUCA ratios): • 1992: Stanford DASH (~ 4.5) • 1996: Sequent NUMA-Q (~ 10) • 1999: Sun WildFire (~ 6) • 2000: Compaq DS-320 (~ 3.5) • Future: CMP, SMT (~ 10) Our NUCA … Uppsala Architecture Research Team (UART)

  9. Our NUCA: Sun WildFire • Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction • 16 UltraSPARC II (250 MHz) CPUs per node • 8 GB memory • NUCA ratio 6 Uppsala Architecture Research Team (UART)

  10. Performance on our NUCA 16 16 Uppsala Architecture Research Team (UART)

  11. Our Goals • Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs • new microbenchmark: “more realistic” behavior, and • real application study • Design a scalable spin lock that exploits the NUCAs • creating a controlled unfairness (stable lock), and • reducing the traffic compared with the test&set locks Uppsala Architecture Research Team (UART)

  12. Outline • History & Background • NUMA vs. NUCA • Experimentation Environment • The RH Lock • Performance Results • Application Performance • Conclusions Uppsala Architecture Research Team (UART)

  13. Key Ideas Behind RH Lock • Minimizing global traffic at lock-handover • Only one thread per node will try to acquire a “remote” lock • Maximizing node locality of NUCAs • Handover the lock to a neighbor in the same node • Creates locality for the critical section (CS) data as well • Especially good for large CS and high contention • RH lock in a nutshell: • Double TATAS_EXP: one node-local lock + one “global” Uppsala Architecture Research Team (UART)

  14. The RH Lock Algorithm Release: CAS(my_TID, FREE) else L_FREE) Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) Cabinet 1: Memory Cabinet 2: Memory FREE Lock2: REMOTE Lock2: L_FREE 2 16 1 16 FREE Lock1: 32 19 REMOTE Lock1: IF (more contention)  THEN more efficient CS … … $ $ $ $ $ $ $ $ P2 P19 P1 P2 P3 P16 P17 P18 P19 P32 2 FREECS FREECS 1 REMOTE Uppsala Architecture Research Team (UART)

  15. Performance ResultsTraditional microbenchmark, 2-node Sun WildFire Uppsala Architecture Research Team (UART)

  16. Controlling Unfairness … void rh_acquire_slowpath(rh_lock *L){ ... if ((random() % FAIR_FACTOR) == 0)be_fare = TRUE; elsebe_fare = FALSE; ... } void rh_release(rh_lock *L){ if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE; } Cabinet 1: Memory FREE Lock2: TID L_FREE FREE Lock1: $ $ $ $ P2 P1 P2 P3 Pn Uppsala Architecture Research Team (UART)

  17. Node-handoffsTraditional microbenchmark, 2-node Sun WildFire Uppsala Architecture Research Team (UART)

  18. New Microbenchmark • More realistic node-handoffs for queue-based locks • Constant number of processors • The amount of Critical Section (CS) work can be increased • we can control the “amount of contention” for (i = 0; i < iterations; i++) {ACQUIRE(lock); // Critical Section (CS) work RELEASE(lock); // Non-CS work STATIC part + // Non-CS work RANDOM part} Uppsala Architecture Research Team (UART)

  19. Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs 14 14 WF Uppsala Architecture Research Team (UART)

  20. Application Performance (1)Methodology • The SPLASH-2 programs • 14 apps • We study only applications with more then 10,000 acquire/release operations • Barnes, Cholesky, FMM, Radiosity, Raytrace, Volrend, and Water-Nsq • Synchronization algorithms • TATAS, TATAS_EXP, MCS, CLH, and RH • 2-node Sun WildFire Uppsala Architecture Research Team (UART)

  21. Application Performance (2)Raytrace Speedup 8 7 6 WF 5 Speedup 4 TATAS 3 TATAS_EXP MCS 2 CLH RH 1 0 0 4 8 12 16 20 24 28 Number of Processors Uppsala Architecture Research Team (UART)

  22. Single-Processor ResultsTraditional microbenchmark, null CS 1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: } Uppsala Architecture Research Team (UART)

  23. Performance ResultsTraditional microbenchmark, single-node E6000 • Bind all threads to only one of the E6000 nodes As expected: RH lock  TATAS_EXP Uppsala Architecture Research Team (UART)

  24. Conclusions • First-come first-served not desirable for NUCAs • The RH lock exploits NUCAs by • creating locality through controlled unfairness (stable lock) • reducing traffic compared with the test&set locks • The only lock that performs better under contention • A critical section (CS) guarded by the RH lock take less than half the time to execute with the same CS guarded by any other lock • Raytrace on 30 CPUs: 1.83 – 5.70 “better” • Works best for NUCA with a few large “nodes” Uppsala Architecture Research Team (UART)

  25. UART’s Home Page http://www.it.uu.se/research/group/uart Uppsala Architecture Research Team (UART)

More Related