Portable, mostly-concurrent, mostly-copying GC for multi-processors

Portable,mostly-concurrent,mostly-copying GC formulti-processors Tony Hosking Secure Software Systems Lab Purdue University

Platform assumptions • Symmetric multi-processor (SMP/CMP) • Multiple mutator threads • (Large heaps)

Desirable properties • Maximize throughput • Minimize collector pauses • Scalability

Exploiting parallelism • Avoid contention • (Mostly-)Concurrent allocation • (Mostly-)Concurrent collection

Concurrent allocation • Use thread-private allocation “pages” • Threads contend for free pages • Each thread allocates from its own page • multiple small objects per page, or • multiple pages per large object

Concurrent collection:The tricolour abstraction • Black • “live” • scanned • cannot refer to white • Grey • “live” wavefront • still to be scanned • may refer to any color • White • hypothetical garbage

Garbage collection • White = whole heap • Shade root targets grey • While grey nonempty • Shade one grey object black • Shade its white children grey • At end, white objects are garbage

Copying collection • Partition white from black by copying • Reclaim white partition wholesale • At next GC, “flip” black to white

Incremental collection Mutator threads

Concurrent collection Mutator threads Background GC thread

Concurrent mutators • Mutation changes reachability during GC • Loss of black/grey reference is safe • Non-white object losing its last reference will be garbage at next GC • New reference from black to white • New reference may make target live • Collector may never see new reference • Mutations may require compensation

Compensation options • Prevent mutator from creating black-to-white references • write barrier on black • read barrier on grey to prevent mutator obtaining white refs • Prevent destruction of any path from a grey object to a white object without telling GC • write barrier on grey

Mostly-copying GC [Bartlett] • Copying collection with ambiguous roots • Uncooperative compilers • Untidy references • Explicit pinning • Pin ambiguously-referenced objects • Shade their page grey without copying • Assume heap accuracy • Copy remaining heap-referenced objects

Incremental MCGC[DeTreville] • Enforce grey mutator invariant • STW greys ambiguously-referenced pages • Read barrier on grey using VM page protection • Read barrier • Stop mutator threads • Unprotect page • Copy white targets to grey • Shade page black • Restart threads • Atomic system call wrappers unprotect parameter targets (otherwise traps in OS return error)

Concurrent MCGC? • Stopping all threads at each increment is prohibitive on SMP & impedes concurrency • BUT barriers difficult to place on ambiguous references with uncooperative compilers • ALSO Preemptive scheduling may break wrapper atomicity

Mostly-concurrent MCGC • Enforce black mutator invariant • STW blackens ambiguously-referenced pages • Read barrier on load of accurate (tidy) grey reference • Read barrier: • Blacken grey references as they are loaded • No system call wrappers: arguments are always black

Read barrier on load of grey • Object header bit marks grey objects • Inline fast path checks grey bit in target header, calls out to slow path if set • Out-of-line slow path: • Lock heap meta-data • For each (grey) source object in target page • Copy white targets to grey • Clear grey header bit • Shade target page black • Unlock heap meta-data

Coherence for fast path • STW phase synchronizes mutators’ views of heap state • Grey bits are set only in newly-copied objects (ie, newly-allocated grey pages) since most recent STW • Mutators can never see a cleared grey header unless the page is also black • Seeing a spurious grey header due to weak ordering is benign: slow path will synchronize

Implementation • Modula-3: • gcc-based compiler back-end • No tricky target-specific stack-maps • Compiler front-end emits barriers • M3 threads map to preemptively-scheduled POSIX pthreads • Stop/start threads: signals + semaphores, or OS primitives if available • Simple to port: Darwin (OS X), Linux, Solaris, Alpha/OSF

Experiments • Parallelized GCOld benchmark to permit throughput measurements for multiple mutators • Measures steady-state GC throughput • 2 platforms: • 2 x 2.3GHz PowerPC Macintosh Xserve running OS X 10.4.4 • 8 x 700MHz Intel Pentium 3 SMP running Linux 2.6

Read Barriers: STW1 user-level mutator thread, work=1

Elapsed time (s)1 system-level mutator thread, work=1

Heap size1 system-level mutator thread

BMU1 system-level mutator thread, work=1000, ratio=1

Scalabilitywork=1000, ratio=1, 8xP3

Java Hotspot serverwork=1000, 8xP3

Conclusions • Mostly-concurrent,mostly-copying collection is feasible for multi-processors (proof-of-existence) • Performance is good (scalable) • Portable: changes only to compiler front-end to introduce barriers, and to GC run-time system • Compiler back-end unchanged: full-blown optimizations enabled, no stack-map overheads

Future work • Convert read barrier to “clean” only target object instead of whole page

Scalabilitywork=10, ratio=1, 8xP3

Java Hotspot serverwork=10, 8xP3

Portable, mostly-concurrent, mostly-copying GC for multi-processors

Portable, mostly-concurrent, mostly-copying GC for multi-processors

Presentation Transcript

Worms (mostly parasitic)

South Austin Memories (mostly)

Advance Directives (mostly)

Cup Mostly Full

10053 Trace Files (Mostly Harmless)

Timetables (well, mostly timetables)

An Implementation of Mostly-Copying GC on Ruby VM

Protists : eukaryotes , mostly unicellular

Various Mostly Lagrangian Things

Mostly Zooplankton

South Austin Memories (mostly)

Igneous Review 2 Mostly Extrusives

(mostly) Psychopharmacology in Primary Care

2002 – 2004 Mostly Market History

2004,2005,2006 Mostly Market History

Portable Generator Is Mostly Liked

Combinational Logic (mostly review!)

(Mostly) micro-hydro (mostly) in Thailand