Characterization of Commercial Workload Synchronization Behavior for Lock Elision Optimization Study

Lock Behaviour Characterization of Commercial Workloads <chang@cs.wisc.edu> <wxd@cs.wisc.edu> Jichuan Chang Xidong Wang CS757

Outline • Motivation • Methods • Results • Speculative Lock Elision Issues • Conclusions CS757

Motivation Understanding the Synchronization Behavior of Commercial Workloads (OLTP, Apache, SpecJBB) Identifying Opportunities for Speculative Lock Elision (performance, ease of programming) CS757

Lock-free section Contention (spin/wait) Critical section time Questions to Answer • Lock related statistics • Can hardware identify critical sections? • Critical section size • Lock-free section size • Amount of lock contentions • Hardware optimizations by speculation • Context switching implications • Resource requirements • Other issues • Realistic timing model • Other synchronization (reader/writer, etc) CS757

Methods • Benchmarks • OLTP, Apache, JBB, Barnes (for comparison) • Full system simulation (tracing) using Simics • Simple timing model - Simics tracer • Ruby timing model - Simics + Ruby • Using #instr (not #cycle) as the measurement unit • Set cpu_switch_time to 1, disable STC • Validating our approach • Using micro-benchmarks, to compare our stats with the result reported by kernel tools (lockstat) • Tracing into disassembly code (kernel/user) CS757

Lock Identification • Basic idea [from SLE] • Lock acquisition must use one atomic instruction. • Silent store pair: as a pair, the stores in lock acquisition and release operations are silent. • SPARC v9 atomic instructions • ldstub, swap, casa (compare-and-swap) OLTP Values JBB Values ldstub [%o0 + %g0], %o4 brnz,pn %o4, <0x10034b98> stbar … … stb %g0, [%o0 + 12] 0x0->0xff … … 0xff->0x0 casa [%l2] 128,%g4,%g3 … … … casa [%l2] 128,%l0,%g4 0x1->0x8410f8bc … … … 0x8410f8bc->0x1 CS757

Lock Identification Algorithm • Starts with an atomic instruction • that writes back a different value to the lock • otherwise meaning unsuccessful lock acquisition • Examine each following store made by the same CPU • Until we meet a normal store • that completes the silent store pair • usually with the value of 0x0 • Other completion patterns • Self-release (by the same CPU) • using atomic instruction, pair-silently (JBB) • using atomic instruction, not pair-silently • Cross-release (by a different CPU) • using atomic instruction; • Removed: can’t observe lock release (16K limited window). CS757

Lock Frequency CS757

Execution Phase Breakdown CS757

Critical Section Size CS757

Lock-free Section Size CS757

Simple Timing Ruby Timing Timing Models • Adding Ruby doesn’t change the size of critical section and lock-free section, but removes lock contentions. • Why? • “Shrinking” caused by less frequent memory accesses within critical sections • or simulation effect? • Guess: more shrinking using Ruby and Opal CS757

46% 236% 70% Lock Contention • Waiting: from the first try to successful acquisition • Spinning: ignore those have been waiting for more than 4K instructions. CS757

Distinguishing “wait” and “spin” • Why bother? • Very few long-waiting events make big difference in the percentages of wasted instructions • Easy if we can identify thread switching • But the identification is not easy • Waiting if spinning for too many instructions • Using 4096 instructions as the limit • 90+% contentions are shorter than 4K instr • It makes sense for different timing models. CS757

Lock Contention – Most Contended Locks CS757

SLE on Commercial Workloads • Context switching (later) • Buffering requirement – Not much • Small critical sections dominate • Except for Apache user locks (1-8K) • Single shared buffer among threads on the same CPU • Possible performance gain • Not big if only counting num of instructions (1 - 6%) • Critical section size already small • Contention already infrequent • Can be larger if lock spinning latency increases • Can be smaller • less lock contentions happen (as in Ruby case) • Must throttle speculation (to avoid unnecessary rollbacks) CS757

Context Switch • Why bother? • Needed to precisely quantify the amount of instructions spent on lock waiting (process and thread switching) • Needed to correctly implement speculative lock elision (process switching only) • Process Switching Identification • Marker: Demap TLB on context switch • Apache (100 transactions, CPU #3) • Average: ~210K instructions (Max ~360K, Min ~160K) • Process switching are infrequent, performance implication negligible • Thread Switching Identification is hard • No simple patterns to observe, No feedback to validate assumptions • Not a good idea to provide separate buffer for each thread on a single processor. Hard to detect conflicts, thread switch & need many buffers. CS757

Other Synchronization Algorithms • Hard to recognize complex synchronization • Barriers, Read/writer locks, etc • Mutual Exclusion implementation composed of the small critical sections • pthread_mutex_lock(&lock) acquires 3 lock • Reader/writer lock use locks to maintain data structure (reader/writer queues, num of current reader, etc) Serialized Execution (maintained by synch. algo.) writer_enter() writer_exit() HW only sees two small critical sections CS757

Conclusion • Commercial workloads lock characterization • Small critical sections dominate • Infrequent lock contention • User/kernel code have different behavior • Kernel locks can’t be ignored • (Kernel) contented PCs predictable • Performance Improvements • SLE won’t help as much CS757

Thank You! Questions? CS757

Backup Slides • Thread switching details • Critical section size using Ruby timing model • Sparc Atomic Instructions • Misc Issues • Acknowledgement CS757

Thread Switch Identification • User thread scheduling • Disassemble user thread library, Observe execution of scheduling methods (_disp, _switch). not always possible!! • Kernel thread scheduling • Involve a set of interleaved method invocations (resume, disp, swtch, _resume_from_idle..). Hard to identify starting and ending point of thread switch • Impossible to identify kernel thread switch by only observing register window swap since it also happen in user thread switch • No feedback from OS to validate our assumption • Methodology & Preliminary Observations • Disassemble kernel code to build VA  kernel method map. Observe the method control flow in Simics trace. • resume may indicate a kernel thread switch • user_rtt may indicate a user level thread switch. • Conclusion: Thread Switch Identification is a hard, unresolved issue CS757

Critical Section Size (Ruby) CS757

Sparc Atomic Instructions • ldstub • Write all 1 into a byte • Swap • Swap the value of the reg and the mem location • Compare-and-swap • Swap if (value in the 1st reg == value in mem) • Membar/stbar • Usually follows such atomic instructions CS757

Misc. • Why Apache “strange”? • Lock more frequent, few user lock (1-2%) • Large percentage of critical section instruction • Nested Locks • Intertwined Locks • Critical sections in Barnes are more clustered • Buffer size ≤ 2^9 * 30% * 1/3 = 64 Blocks • The same as SLE CS757

Acknowledgement • Project suggested by Prof. Mark Hill • Guiding and supporting • Lots of discussion with and help from • Min Xu, our TA • Carl Mauer, Multifacet simulator expert • Ravi Rajwar, SLE paper author CS757

Characterization of Commercial Workload Synchronization Behavior for Lock Elision Optimization Study

Characterization of Commercial Workload Synchronization Behavior for Lock Elision Optimization Study

Presentation Transcript

Patterns of Behaviour

Memory System Characterization of Commercial Workloads

Development of Behaviour

A Comparison of File System Workloads

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads

CPE 619 Workloads: Types, Selection, Characterization

Academic Workloads

Type of Workloads

Codes of behaviour

Analytical Evaluation of Shared-Memory Systems with Commercial Workloads

Types of Workloads

Performance Characterization of the Pentium III using Web Workloads

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads

Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems

Workloads

Memory System Characterization of Commercial Workloads

Statistical Simulation of Superscalar Architectures using Commercial Workloads

Guardwell Lock & Safe Ltd Supplies Domestic,Commercial Safe

Ck Lock Offers Complete Commercial Locksmith Solutions

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads

Memory System Characterization of Commercial Workloads