hoard a scalable memory allocator for multithreaded applications l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Hoard: A Scalable Memory Allocator for Multithreaded Applications PowerPoint Presentation
Download Presentation
Hoard: A Scalable Memory Allocator for Multithreaded Applications

Loading in 2 Seconds...

play fullscreen
1 / 30

Hoard: A Scalable Memory Allocator for Multithreaded Applications - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

Hoard: A Scalable Memory Allocator for Multithreaded Applications. Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos (Some slides adapted from Emery Berger’s presentation). Outline. Motivation Problems in allocator design False sharing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hoard: A Scalable Memory Allocator for Multithreaded Applications' - craig


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
hoard a scalable memory allocator for multithreaded applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson

Presented by Dimitris Prountzos

(Some slides adapted from Emery Berger’s presentation)

outline
Outline
  • Motivation
  • Problems in allocator design
    • False sharing
    • Fragmentation
  • Existing approaches
  • Hoard design
  • Experimental evaluation
motivation
Motivation
  • Parallel multithreaded programs prevalent
    • Web servers, search engines, DB managers etc.
    • Run on CMP/SMP for high performance
    • Some of them embarrassingly parallel
  • Memory allocation is a bottleneck
    • Prevents scaling with number of processors
desired allocator attributes on a multiprocessor system
Desired allocator attributes on a multiprocessor system
  • Speed
    • Competitive with uniprocessor allocators on 1 cpu
  • Scalability
    • Performance linear with the number of processors
  • Fragmentation (=max allocated / max in use)
    • High fragmentation  poor data locality  paging
  • False sharing avoidance
the problem of false sharing
Program causes false sharing

Allocate number of objects in a cache line, pass objects to different threads

Allocators cause false sharing!

Actively:

malloc satisfies different thread requests from same cache line

Passively:

free allows future malloc to produce false sharing

The problem of false sharing

A cache line

processor 1

processor 2

x1 = malloc(s);

x2 = malloc(s);

thrash…

thrash…

the problem of fragmentation
The problem of fragmentation
  • Blowup:
    • Increase in memory consumption when allocator reclaims memory freed by program, but fails to use it for future requests
    • Mainly a problem of concurrent allocators
    • Unbounded (worst case) or bounded (O(P))
example pure private heaps allocator
Example: Pure Private Heaps Allocator

processor 1

processor 2

  • Pure private heaps:
    • one heap per processor.
    • malloc gets memoryfrom the processor's heap or the system
    • free puts memory on the processor's heap
  • Avoids heap contention
    • Examples: STL, Cilk

x1= malloc(s)

x2= malloc(s)

free(x1)

free(x2)

x4= malloc(s)

x3= malloc(s)

= allocated by heap 1

= free, on heap 2

how to break pure private heaps fragmentation
How to Break Pure Private Heaps: Fragmentation
  • Pure private heaps:
    • memory consumption can grow without bound!
  • Producer-consumer:
    • processor 1 allocates
    • processor 2 frees
    • Memory always unavailable to producer

processor 1

processor 2

x1= malloc(s)

free(x1)

x2= malloc(s)

free(x2)

x3= malloc(s)

free(x3)

example ii private heaps with ownership
Example II: Private Heaps with Ownership
  • free puts memory back on the originating processor's heap.
  • Avoids unbounded memory consumption
    • Examples: ptmalloc,LKmalloc

processor 1

processor 2

x1= malloc(s)

free(x1)

x2= malloc(s)

free(x2)

how to break private heaps with ownership fragmentation
How to Break Private Heaps with Ownership:Fragmentation
  • memory consumption can blowup by a factor of P.
  • Round-robin producer-consumer:

processor i allocates

processor i+1 frees

  • Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks

processor 1

processor 2

processor 3

x1= malloc(s)

free(x1)

x2= malloc(s)

free(x2)

x3=malloc(s)

free(x3)

uniprocessor allocators on multiprocessors
Uniprocessor Allocators on Multiprocessors
  • Fragmentation: Excellent
    • Very low for most programs [Wilson & Johnstone]
  • Speed & Scalability: Poor
    • Heap contention
      • A single lock protects the heap
  • Can exacerbate false sharing
    • Different processors can share cache lines
existing multiprocessor allocators
Existing Multiprocessor Allocators
  • Speed:
    • One concurrent heap (e.g., concurrent B-tree):
      • O(log (#size-classes)) cost per memory operation
      • too many locks/atomic updates

 Fast allocators use multiple heaps

  • Scalability:
    • Allocator-induced false sharing
    • Other bottlenecks (e.g. nextHeap global in Ptmalloc)
  • Fragmentation:
    • P-fold increase or even unbounded
hoard overview
Hoard Overview
  • P per-processor heaps & 1 global heap
  • Each thread accesses only its local heap & global
  • Manages memory in page-sized superblocks of same-sized objects (LIFO free-list)
    • Avoids false sharing by not carving up cache lines
    • Avoids heap contention – local heaps allocate & free small blocks from their superblocks
  • Avoids blowup by
    • Moving superblocks to global heap when fraction of free memory exceeds some threshold
superblock management
Superblock management

Emptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S)

f = ¼

K = 0

  • Multiple heaps  Avoid actively induced false sharing
  • Block coalescing  Avoid passively induced false sharing
  • Superblocks transferred are usually empty and transfer is infrequent
hoard pseudo code
Hoard pseudo-code

malloc(sz)

  • If sz > S/2, allocate the superblock from the OS and return it.
  • i hash(current thread)
  • Lock heap i
  • Scan heap i’s list of superblocks from full to least (for the size class of sz)
  • If there is no superblock with free space {
  • Check heap 0 (global) for a superblock
  • If there is none {
  • Allocate S bytes as superblock s & set owner to heap i
  • } Else {
  • Transfer the superblock s to heap i
  • u0  u0 – s.u; ui ui + s.u
  • a0  a0 - S; ai  ai + S
  • }
  • }
  • ui ui + sz; s.u  s.u + sz
  • Unlock heap i
  • Return a block from the superblock

free(ptr)

  • If the block is “large”
  • Free superblock to OS and return
  • Find the superblock s this blocks comes from
  • Lock s
  • Lock heap i, the superblock’s owner
  • Deallocate the block from the superblock
  • uiui – block size
  • s.u  s.u – block size
  • If (i = 0) unlock heap i, superblock s and return
  • If (ui < ai – K*S) and (ui<(1-f)*ai) {
  • Transfer a mostly-empty superblock s1 to heap 0 (global)
  • u0 u0 + s1.u; ui  ui – s1.u
  • a0  a0 + S; ai  ai – S
  • }
  • Unlock heap i and superblock s
deriving bounds on blowup
Deriving bounds on blowup
  • blowup:= O(A(t) / U(t))
  • A(t) = A’(t)
  • ai(t) – K*S ≤ ui(t)) ∨ (1-f)ai(t) ≤ ui(t)
  • P << U(t)  blowup := O(1)
  • Worst case consumption is a constant factor overhead that does not grow with the amount of memory required by the program

A(t) = O(U(t) + P)

deriving bounds on contention 1
Deriving bounds on contention (1)
  • Per-processor Heap contention
    • 1 allocator thread / multiple threads free
      • Inherently unscalable
    • Pairs of producer/consumer threads
      • malloc/free calls serialized
      • At most 2X slowdown (undesirable but scalable)
    • Empirically only a small fraction of memory is freed by another thread  Contention expected to be low
deriving bounds on contention 2
Deriving bounds on contention (2)
  • Global Heap contention
    • Measure # GH lock acquisitionsas upper bound
    • Growing phase:
      • Each thread at most k/(f*S/s) acquisitions for kmalloc’s
    • Shrinking phase:
      • Pathological case where program frees (1-f) of each superblock and then frees every block in superblock one at a time
    • Empirically: No excessive shrinking and gradual growth of memory usage  low overall contention
experimental evaluation
Experimental Evaluation
  • Dedicated 14-processor Sun Enterprise
    • 400 MHz Ultrasparc
    • 2 GB RAM, 4MB L2 cache
    • Solaris 7
    • Superblock size=8K, f = ¼
  • Comparison between
    • Hoard
    • Ptmalloc (GNU libC, multiple heaps & ownership)
    • Mtmalloc (Solaris multithreaded allocator)
    • Solaris (default system allocator)
speed
Speed

Size classes need to be handled more cleverly

scalability threadtest
Scalability - threadtest

278% faster than Ptmalloc on 14 cpus

t threads allocate/deallocate 100,000/t 8-byte objects

scalability larson
Scalability – Larson
  • “Bleeding” typical in server applications
  • Mainly stays within empty fraction during execution
  • 18X faster than next best allocator on 14 cpus
scalability bemengine
Scalability - BEMengine
  • Few times below empty fraction  low synchronization
false sharing behavior
False sharing behavior
  • Active-false: Each thread allocates small object, writes it few times, frees it
  • Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false
  • Illustrate effects of contention of the coherence mechanism
fragmentation results
Fragmentation results

Large number of size classes remain live for duration of program and scattered across blocks

Within 20% of Lea’s allocator

hoard conclusions
Hoard Conclusions
  • Speed: Excellent
    • As fast as a uniprocessor allocator on one processor
      • amortized O(1) cost
      • 1 lock for malloc, 2 for free
  • Scalability: Excellent
    • Scales linearly with the number of processors
    • Avoids false sharing
  • Fragmentation: Very good
    • Worst-case is provably close to ideal
    • Actual observed fragmentation is low
discussion points
Discussion Points
  • If we had to re-evaluate Hoard today which benchmarks would we use?
  • Are there any changes needed to make it work with languages like Java?