A Parallel, Real-Time Garbage Collector

A Parallel, Real-Time Garbage Collector Author: Perry Cheng, Guy E. Blelloch Presenter: Jun Tao

Outline • Introduction • Background and definitions • Theoretical algorithm • Extended algorithm • Evaluation • Conclusion

Introduction • First garbage collectors: • Non-incremental, non-parallel • Recent collector • Incremental • Concurrent • Parallel

Introduction • Scalably parallel and real-time collector • All aspects of the collector are incremental • Parallel • Arbitrary number of application and collector threads • Tight theoretical bounds on • Pause time for any application • Total memory usage • Asymptotically but not practically efficient

Introduction • Extended collector algorithm • Work with generations • Increase the granularity of the incremental steps • Separately handle global variables • Delay the copy on write • Reduce the synchronization cost of copying small objects • Parallelize the processing of large objects • Reduce double allocation during collection • Allow program stacks

Background and Definitions • A semispace Stop-Copy Collector • Divide heap memory into two equally-sized • From-space and to-space • Suspend mutator and copy reachable objects to the to-space when from-space is full • Update root values and reversing the role of from-space and to-space

Background and Definitions • Types of Garbage Collectors

Background and Definitions • Type of Garbage Collector (continued)

Background and Definitions • Real-time Collector • Maximum pause time • Utilization • The fraction of time that the mutator executes • Minimum Mutator Utilization • A function of window size • Minimum utilization at all windows of that size • = 0 when window size <= maximum pause time

Theoretical Algorithm • A Parallel, incremental and concurrent collector • Base on Cheney’s simple copying collector • All objects are stored in a shared global pool of memory • Two atomic instruction • FetchAndAdd • CompareAndSwap • Collector interfaces with the application • Allocating space for a new object • Initializing the fields of a new object • Modifying the field of an existing object

Theoretical Algorithm • Scalable Parallelism • Maintain the set of gray objects • Cheney’s technique • Keeping them in contiguous locations in to-space • Pros • Simple • Cons • Restricts the traversal order to breadth-first • Difficult to implement in a parallel setting

Theoretical Algorithm • Scalable Parallelism (continued) • Explicitly managed local stack • Each processor maintains a stack • A shared stack of gray objects • Periodically transfer gray objects between local and shared stack • Avoid idleness • Pushes (or pops) can proceed in parallel • Reserve a target region before transfer • Pushes and pops are not concurrent • Room sychronization

Theoretical Algorithm • Scalable Parallelism (continued) • Avoid white objects being copied twice • Exclusive access by atomic instructions • Copy-copy synchronization

Theoretical Algorithm • Incremental and Replicating Collection • Baker’s incremental collector • Copy k units of data when allocate a unit of data • Bound the pause time • Mutator can only see copied objects in to-space • A read barrier is needed • Modification to avoid the read barrier • Mutator can only see the original objects in from-space • A write barrier is needed

Theoretical Algorithm • Concurrency • Program and collector execute simultaneously • Program manipulate primary memory graph • Collector manipulate replica graph • A copy-write synchronization is needed • Replica objects should be modified correspondently • Avoid race condition • Mark objects being copied • Mutator’s update to replica should be delay • A write-write synchronization is needed • Prohibit different mutator threads from modifying the same memory location concurrently

Theoretical Algorithm • Space and Time Bounds • Time bounds on each memory operation • ck • C : a constant • K: the number of words we collect per word allocated • Space bounds • 2(R(1+1.5/k)+N+5PD) ≈ 2(R(1+1.5/k) • R: reachable space • N: maximum object count • P: P-way multiprocessor • D: maximum memory graph depth

Extended Algorithm • Globals, Stacks and Stacklets • Globals • Updated when collection ends • Arbitrary many -> unbound time • Replicate globals like other heap objects • Every global has two location • A single flag is used for all globals • Stacks and Stacklets • Divided stacks into fixed-size stacklets • At most one stacklet is active and the other can be replicated savely • Also bound the waste space per stack

Extended Algorithm • Granularity • Block Allocation and Free Initialization • Avoid calling FetchAndAdd for every memory allocation • Each processor maintain a local pool in from-space and a local pool in to-space when collector is on • Using a FetchAndAdd when allocating a local pool • Write Barrier • Avoid updating copied objects every time • Record a triple <x, i, y> in a write log and defer • Invoke the collector when the write log is full • Eliminating frequent context switches

Extended Algorithm • Small and Large Objects • Original Algorithm • One field at a time • Reinterpretation of the tag word • Transferring the object from and to the local stack • Extended Algorithm • Small objects • Locked down and copied at a time • Large objects • Divided into segments • One segment at a time

Extended Algorithm • Algorithmic Modifications • Reducing double allocation • One allocation by mutator and one by collector • Deferring the double allocation • Rooms and Better Rooms • A push room and a pop room • Only one room can be non-empty • Rooms • Enter the pop room, fetch work and perform, transition to the push room, push objects back to the shared stack • Graying objects is time-consuming • Wait for entering the push room

Extended Algorithm • Algorithm modifications • Rooms and Better Rooms (continued) • Better rooms • Leave the pop room after fetching work from shared stack • Detect the shared stack is empty by maintaining a borrow counter • Generational Collection • Nursery and tenured space • Trigger a minor collection when nursery space is full • Trigger a major collection when tenured space is full • Tenured references might not be modified during collection • Hold two fields for mutable pointer • one for mutator to use, the other for collector to update

Evaluation

Conclusion • Implements a scalably parallel, concurrent, real-time garbage collector • Thread synchronization is minimized

A Parallel, Real-Time Garbage Collector

A Parallel, Real-Time Garbage Collector

Presentation Transcript

A Real-Time Garbage Collector Based on the Lifetimes of Objects

Real-Time Parallel Radiosity

Parallel, Real-Time Garbage Collection

Immix: A Mark-Region Garbage Collector

Garbage collector

A Study of Garbage Collector Scalability on Multicores

On The Fly Garbage Collector

Engineering a Conservative Mark-Sweep Garbage Collector

Garbage Collector

Parallel Garbage Collection

Immix: A Mark-Region Garbage Collector

Parallel Real-Time Systems

A Real-Time Garbage Collector with Low Overhead and Consistent Utilization

Parallel Real-Time Systems

Mr. Garbage Collector