Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos (Some slides adapted from Emery Berger’s presentation)

Outline • Motivation • Problems in allocator design • False sharing • Fragmentation • Existing approaches • Hoard design • Experimental evaluation

Motivation • Parallel multithreaded programs prevalent • Web servers, search engines, DB managers etc. • Run on CMP/SMP for high performance • Some of them embarrassingly parallel • Memory allocation is a bottleneck • Prevents scaling with number of processors

Desired allocator attributes on a multiprocessor system • Speed • Competitive with uniprocessor allocators on 1 cpu • Scalability • Performance linear with the number of processors • Fragmentation (=max allocated / max in use) • High fragmentation  poor data locality  paging • False sharing avoidance

Program causes false sharing Allocate number of objects in a cache line, pass objects to different threads Allocators cause false sharing! Actively: malloc satisfies different thread requests from same cache line Passively: free allows future malloc to produce false sharing The problem of false sharing A cache line processor 1 processor 2 x1 = malloc(s); x2 = malloc(s); thrash… thrash…

The problem of fragmentation • Blowup: • Increase in memory consumption when allocator reclaims memory freed by program, but fails to use it for future requests • Mainly a problem of concurrent allocators • Unbounded (worst case) or bounded (O(P))

Example: Pure Private Heaps Allocator processor 1 processor 2 • Pure private heaps: • one heap per processor. • malloc gets memoryfrom the processor's heap or the system • free puts memory on the processor's heap • Avoids heap contention • Examples: STL, Cilk x1= malloc(s) x2= malloc(s) free(x1) free(x2) x4= malloc(s) x3= malloc(s) = allocated by heap 1 = free, on heap 2

How to Break Pure Private Heaps: Fragmentation • Pure private heaps: • memory consumption can grow without bound! • Producer-consumer: • processor 1 allocates • processor 2 frees • Memory always unavailable to producer processor 1 processor 2 x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3= malloc(s) free(x3)

Example II: Private Heaps with Ownership • free puts memory back on the originating processor's heap. • Avoids unbounded memory consumption • Examples: ptmalloc,LKmalloc processor 1 processor 2 x1= malloc(s) free(x1) x2= malloc(s) free(x2)

How to Break Private Heaps with Ownership:Fragmentation • memory consumption can blowup by a factor of P. • Round-robin producer-consumer: processor i allocates processor i+1 frees • Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks processor 1 processor 2 processor 3 x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3=malloc(s) free(x3)

Existing approaches

Uniprocessor Allocators on Multiprocessors • Fragmentation: Excellent • Very low for most programs [Wilson & Johnstone] • Speed & Scalability: Poor • Heap contention • A single lock protects the heap • Can exacerbate false sharing • Different processors can share cache lines

Existing Multiprocessor Allocators • Speed: • One concurrent heap (e.g., concurrent B-tree): • O(log (#size-classes)) cost per memory operation • too many locks/atomic updates  Fast allocators use multiple heaps • Scalability: • Allocator-induced false sharing • Other bottlenecks (e.g. nextHeap global in Ptmalloc) • Fragmentation: • P-fold increase or even unbounded

Hoard as the solution

Hoard Overview • P per-processor heaps & 1 global heap • Each thread accesses only its local heap & global • Manages memory in page-sized superblocks of same-sized objects (LIFO free-list) • Avoids false sharing by not carving up cache lines • Avoids heap contention – local heaps allocate & free small blocks from their superblocks • Avoids blowup by • Moving superblocks to global heap when fraction of free memory exceeds some threshold

Superblock management Emptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S) f = ¼ K = 0 • Multiple heaps  Avoid actively induced false sharing • Block coalescing  Avoid passively induced false sharing • Superblocks transferred are usually empty and transfer is infrequent

Hoard pseudo-code malloc(sz) • If sz > S/2, allocate the superblock from the OS and return it. • i hash(current thread) • Lock heap i • Scan heap i’s list of superblocks from full to least (for the size class of sz) • If there is no superblock with free space { • Check heap 0 (global) for a superblock • If there is none { • Allocate S bytes as superblock s & set owner to heap i • } Else { • Transfer the superblock s to heap i • u0  u0 – s.u; ui ui + s.u • a0  a0 - S; ai  ai + S • } • } • ui ui + sz; s.u  s.u + sz • Unlock heap i • Return a block from the superblock free(ptr) • If the block is “large” • Free superblock to OS and return • Find the superblock s this blocks comes from • Lock s • Lock heap i, the superblock’s owner • Deallocate the block from the superblock • uiui – block size • s.u  s.u – block size • If (i = 0) unlock heap i, superblock s and return • If (ui < ai – K*S) and (ui<(1-f)*ai) { • Transfer a mostly-empty superblock s1 to heap 0 (global) • u0 u0 + s1.u; ui  ui – s1.u • a0  a0 + S; ai  ai – S • } • Unlock heap i and superblock s

Deriving bounds on blowup • blowup:= O(A(t) / U(t)) • A(t) = A’(t) • ai(t) – K*S ≤ ui(t)) ∨ (1-f)ai(t) ≤ ui(t) • P << U(t)  blowup := O(1) • Worst case consumption is a constant factor overhead that does not grow with the amount of memory required by the program A(t) = O(U(t) + P)

Deriving bounds on contention (1) • Per-processor Heap contention • 1 allocator thread / multiple threads free • Inherently unscalable • Pairs of producer/consumer threads • malloc/free calls serialized • At most 2X slowdown (undesirable but scalable) • Empirically only a small fraction of memory is freed by another thread  Contention expected to be low

Deriving bounds on contention (2) • Global Heap contention • Measure # GH lock acquisitionsas upper bound • Growing phase: • Each thread at most k/(f*S/s) acquisitions for kmalloc’s • Shrinking phase: • Pathological case where program frees (1-f) of each superblock and then frees every block in superblock one at a time • Empirically: No excessive shrinking and gradual growth of memory usage  low overall contention

Experimental Evaluation • Dedicated 14-processor Sun Enterprise • 400 MHz Ultrasparc • 2 GB RAM, 4MB L2 cache • Solaris 7 • Superblock size=8K, f = ¼ • Comparison between • Hoard • Ptmalloc (GNU libC, multiple heaps & ownership) • Mtmalloc (Solaris multithreaded allocator) • Solaris (default system allocator)

Benchmarks

Speed Size classes need to be handled more cleverly

Scalability - threadtest 278% faster than Ptmalloc on 14 cpus t threads allocate/deallocate 100,000/t 8-byte objects

Scalability – Larson • “Bleeding” typical in server applications • Mainly stays within empty fraction during execution • 18X faster than next best allocator on 14 cpus

Scalability - BEMengine • Few times below empty fraction  low synchronization

False sharing behavior • Active-false: Each thread allocates small object, writes it few times, frees it • Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false • Illustrate effects of contention of the coherence mechanism

Fragmentation results Large number of size classes remain live for duration of program and scattered across blocks Within 20% of Lea’s allocator

Hoard Conclusions • Speed: Excellent • As fast as a uniprocessor allocator on one processor • amortized O(1) cost • 1 lock for malloc, 2 for free • Scalability: Excellent • Scales linearly with the number of processors • Avoids false sharing • Fragmentation: Very good • Worst-case is provably close to ideal • Actual observed fragmentation is low

Discussion Points • If we had to re-evaluate Hoard today which benchmarks would we use? • Are there any changes needed to make it work with languages like Java?

Hoard: A Scalable Memory Allocator for Multithreaded Applications