RH Lock: A Scalable Hierarchical Spin Lock

RH Lock:A Scalable Hierarchical Spin Lock 2nd ANNUAL WORKSHOP ON MEMORYPERFORMANCE ISSUES (WMPI 2002) May 25, 2002, Anchorage, Alaska Uppsala UniversityInformation TechnologyDepartment of Computer SystemsUppsala Architecture Research Team [UART] Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se

Synchronization History Busy-wait/ backoff • Spin-Locks • test_and_set (TAS), e.g., IBM System/360, ’64 • Rudolph and Segall, ISCA’84 • test_and_test_and_set (TATAS) • TATAS with exponential backoff (TATAS_EXP), ’90 – ’91 Memory BUSY FREE Lock: $ $ $ … $ Pn P1 P2 P3 P3 BUSY BUSY FREE BUSY Uppsala Architecture Research Team (UART)

Performance, 12 years ago …Traditional microbenchmark Thanks: Michael L. Scott IF (more contention)  THEN less efficient CS … for (i = 0; i < iterations; i++) {ACQUIRE(lock); // Null Critical Section (CS) RELEASE(lock);} Uppsala Architecture Research Team (UART)

Making it Scalable: Queues … • Spin on your predecessor’s flag • First-come first-served order • Queue-Based Locks • QOLB/QOSB ’89 • MCS ’91 • CLH ’93 Uppsala Architecture Research Team (UART)

Performance, May 2002Traditional microbenchmark • Sun Enterprise E6000 SMP 16 Uppsala Architecture Research Team (UART)

Synchronization Today • Commercial applications use spin-locks (!) • usually TATAS & TATAS_EXP with timeout for • recovery from transaction deadlock • recovery from preemption of the lock holder • POSIX threads: • pthread_mutex_lock • pthread_mutex_unlock • HPC: runtime systems, OpenMP, … Uppsala Architecture Research Team (UART)

Non-Uniform MemoryArchitecture (NUMA) • NUMA optimizations • Page migration • Page replication Memory Memory Switch $ $ $ $ $ $ $ $ 1 2 – 10 P1 P2 P3 Pn P1 P2 P3 Pn Uppsala Architecture Research Team (UART)

Non-Uniform CommunicationArchitecture (NUCA) Memory Memory Switch NUCA ratio $ $ $ $ $ $ $ $ 1 2 – 10 P1 P2 P3 Pn P1 P2 P3 Pn • NUCA examples (NUCA ratios): • 1992: Stanford DASH (~ 4.5) • 1996: Sequent NUMA-Q (~ 10) • 1999: Sun WildFire (~ 6) • 2000: Compaq DS-320 (~ 3.5) • Future: CMP, SMT (~ 10) Our NUCA … Uppsala Architecture Research Team (UART)

Our NUCA: Sun WildFire • Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction • 16 UltraSPARC II (250 MHz) CPUs per node • 8 GB memory • NUCA ratio 6 Uppsala Architecture Research Team (UART)

Performance on our NUCA 16 16 Uppsala Architecture Research Team (UART)

Our Goals • Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs • new microbenchmark: “more realistic” behavior, and • real application study • Design a scalable spin lock that exploits the NUCAs • creating a controlled unfairness (stable lock), and • reducing the traffic compared with the test&set locks Uppsala Architecture Research Team (UART)

Outline • History & Background • NUMA vs. NUCA • Experimentation Environment • The RH Lock • Performance Results • Application Performance • Conclusions Uppsala Architecture Research Team (UART)

Key Ideas Behind RH Lock • Minimizing global traffic at lock-handover • Only one thread per node will try to acquire a “remote” lock • Maximizing node locality of NUCAs • Handover the lock to a neighbor in the same node • Creates locality for the critical section (CS) data as well • Especially good for large CS and high contention • RH lock in a nutshell: • Double TATAS_EXP: one node-local lock + one “global” Uppsala Architecture Research Team (UART)

The RH Lock Algorithm Release: CAS(my_TID, FREE) else L_FREE) Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) Cabinet 1: Memory Cabinet 2: Memory FREE Lock2: REMOTE Lock2: L_FREE 2 16 1 16 FREE Lock1: 32 19 REMOTE Lock1: IF (more contention)  THEN more efficient CS … … $ $ $ $ $ $ $ $ P2 P19 P1 P2 P3 P16 P17 P18 P19 P32 2 FREECS FREECS 1 REMOTE Uppsala Architecture Research Team (UART)

Performance ResultsTraditional microbenchmark, 2-node Sun WildFire Uppsala Architecture Research Team (UART)

Controlling Unfairness … void rh_acquire_slowpath(rh_lock *L){ ... if ((random() % FAIR_FACTOR) == 0)be_fare = TRUE; elsebe_fare = FALSE; ... } void rh_release(rh_lock *L){ if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE; } Cabinet 1: Memory FREE Lock2: TID L_FREE FREE Lock1: $ $ $ $ P2 P1 P2 P3 Pn Uppsala Architecture Research Team (UART)

Node-handoffsTraditional microbenchmark, 2-node Sun WildFire Uppsala Architecture Research Team (UART)

New Microbenchmark • More realistic node-handoffs for queue-based locks • Constant number of processors • The amount of Critical Section (CS) work can be increased • we can control the “amount of contention” for (i = 0; i < iterations; i++) {ACQUIRE(lock); // Critical Section (CS) work RELEASE(lock); // Non-CS work STATIC part + // Non-CS work RANDOM part} Uppsala Architecture Research Team (UART)

Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs 14 14 WF Uppsala Architecture Research Team (UART)

Application Performance (1)Methodology • The SPLASH-2 programs • 14 apps • We study only applications with more then 10,000 acquire/release operations • Barnes, Cholesky, FMM, Radiosity, Raytrace, Volrend, and Water-Nsq • Synchronization algorithms • TATAS, TATAS_EXP, MCS, CLH, and RH • 2-node Sun WildFire Uppsala Architecture Research Team (UART)

Application Performance (2)Raytrace Speedup 8 7 6 WF 5 Speedup 4 TATAS 3 TATAS_EXP MCS 2 CLH RH 1 0 0 4 8 12 16 20 24 28 Number of Processors Uppsala Architecture Research Team (UART)

Single-Processor ResultsTraditional microbenchmark, null CS 1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: } Uppsala Architecture Research Team (UART)

Performance ResultsTraditional microbenchmark, single-node E6000 • Bind all threads to only one of the E6000 nodes As expected: RH lock  TATAS_EXP Uppsala Architecture Research Team (UART)

Conclusions • First-come first-served not desirable for NUCAs • The RH lock exploits NUCAs by • creating locality through controlled unfairness (stable lock) • reducing traffic compared with the test&set locks • The only lock that performs better under contention • A critical section (CS) guarded by the RH lock take less than half the time to execute with the same CS guarded by any other lock • Raytrace on 30 CPUs: 1.83 – 5.70 “better” • Works best for NUCA with a few large “nodes” Uppsala Architecture Research Team (UART)

UART’s Home Page http://www.it.uu.se/research/group/uart Uppsala Architecture Research Team (UART)

RH Lock: A Scalable Hierarchical Spin Lock

RH Lock: A Scalable Hierarchical Spin Lock

Presentation Transcript

Lock Out

Lock Performance

ELECTRONIC LOCK

Lock vs. Lock-Free memory

Initiating a Saline Lock and IV (Ranger Lock)

Scalable and Lock-Free Concurrent Dictionaries

Lock Tuning

Lock-its

Lock Picking

Combination Lock.

Remote PC Lock - Computer Lock

Queens Lock

Throttle Lock

Digital lock | Digital Door Lock Promotion

Pencil Lock

Bedroom Bolt Lock, Deadbolt Lock and Double-Bolt Lock

Lock Smith

Secure A Lock

Lock Change

Lock Picking

TWIST LOCK