Outline

Incorporating Generationsinto a Modern Reference CountingGarbage CollectorHezi AzatchiAdvisor: Erez Petrank

Outline • Background • Garbage Collection • Reference Counting • Mark&Sweep • Improving Tracing using Generations • On-The-Fly Sliding-View Garbage Collectors • Our Generational on-the-fly Algorithms • Results • Summary

Background – Reference Counting o3 o4 The Reference Counting Algorithm [Collins 1960] • if o1.RC==0: • Delete o1. • Decrement o.RC for all sons of o1. • Recursively delete objects whose RC is decremented to 0. • Each object has an RC field. • New objects get o.RC:=1. • When p that points to o1 is modified to point to o2, we do: • o1.RC--, o2.RC++. p o1 o2

Background – Reference Counting o1 o2 3 years later… • [Harold-McBeth 1963] Reference counting algorithm does not reclaim cycles!. • But, • It turns out that “normal” programs do not use too many cycles. • So, other methods (such as mark and sweep) are used “seldom” to collect the cycles.

Background – Reference Counting Deferred Reference Counting • Problem: RC algorithms prescribe an action for each pointer operation. • Solution [Deutch & Bobrow, 1976] : • Don’t update RC for locals. • Put objects with RC=0 in a Zero-Count-Table(ZCT). • “Once in a while”: collect all the objects (in the ZCT) with o.RC=0 that are not referenced from local roots. • Deferred RC reduces overhead by 80%. Used in most modern RC systems.

Background – Mark and Sweep The Mark-Sweep algorithm [McCarthy 1960] • Traverse & mark live objects. • White objects can be reclaimed. globals Roots

Background – Generational GC Generational Garbage Collection • [Ungar, 1984] Weak generational hypothesis: “most objects die young” • Segregate objects by age into two or more regions of the heap called: generations. • Objects are first allocated in the youngest generation, but are promoted into older generation if they survive long enough. • Most pauses are short (for young generation GC). • Collection effort concentrated where there is garbage. • Better locality.

Background – Generational GC Globals Generational GC – Inter-Generational-Pointers • Pointers from old to young generation must be part of the root set of the young generation. Old Stack Young

Background Note Interesting Properties • Mark&sweep is good with low fraction of live objects thus it “fits” the young generation which has low fraction of live objects. • RC does not depend on amount of live space thus it “fits” to the old generation which does have large amount of live space. • Thus – a combination of RC for old generation and Mark&sweep for the young may be good! • This is exactly what we tried • On a modern platform (SMP). • With advanced modern on-the-fly collectors.

Background Terminology (Mutators) (Collector Threads)

On the fly Sliding-View Algorithms Levanoni-Petrank OOPSLA 2001

Levanoni Petrank Algorithms - Motivation Motivation for RC • Reference Counting work is proportional to the work on creations and modifications. • Can tracing deal with tomorrow’s huge heaps? • Reference counting has good locality. • The Challenge: • RC write barriers seem too expensive. • RC seems impossible to “parallelize”.

Levanoni Petrank Algorithms - Motivation Multithreaded RC? • Problem 1: ref-counts updates must be atomic. • Problem 2: parallel updates confuse counters: Thread 1: Read A.next; (see B) A.next  C; B.RC- -; C.RC++ Thread 2: Read A.next; (see B) A.next  D; B.RC- -; D.RC++ A C B D

Levanoni Petrank Algorithms - Motivation First Multithreaded RC • [DeTreville]: • Lock heap for each pointer modification. • Thread records its updates in a buffer. • Once in a while (snapshot alike): • GC thread reads all buffers to update ref counts • Reclaims all objects with 0 rc that are not local.

Levanoni Petrank Algorithms - Motivation To Summarize… • Overhead on write barrier is considered high. • Even with deferred RC of Deutch & Bobrow. • Using reference counting concurrently with program threads seems to bear high synchronization cost. • Lock or “compare & swap” for each pointer update.

Levanoni Petrank Algorithms . . . . . O0 O1 O2 O3 O4 On Improving the write-barrier overhead • Consider a pointer p that takes the following values between GC’s: O0,O1, O2, …, On . • Out of 2n operations: O0.RC--;O1.RC++; O1.RC--; O2.RC++; O2.RC--; … ;On.RC++; • Only two are needed: O0.RC-- and On.RC++ p

The write barrier Procedure Update(p:Pointer, new:Object) prev := *p if !Dirty(p) then log <p, prev> // into local log buffer Dirty(p) = True; *p := new • Collection time • P  O1; (record p’s previous value O0) • P  O2; (do nothing) • … • P  On; (do nothing) • Collection time: For each modified slot p: • Read p to get On, read records to get O0 • O0.RC-- , On.RC++ Time

Levanoni Petrank Algorithms The “Snapshot” (Concurrent) RC Algorithm: • Use write barrier with program threads. • Take a snapshot: • Stop all threads • Scan roots (locals) • get the buffers with modified slots • Clear all dirty bits. • Resume threads • Then run collector: • For each modified slot: • decrease rc for previous snapshot value (read buffer), • increase rc for current snapshot value (“read heap”), • Reclaim non-local objects with rc 0.

Levanoni Petrank Algorithms The General Picture P1 P2 P3 P4 P5 P6 p7 P1 P2 P3 P4 P5 P6 p7 Use list of modifications to update reference counts. Record Modifications Heap at collection k Heap at collection k+1

Levanoni Petrank Algorithms The “Snapshot” Tracing (Mark&Sweep) Collector • Use write barrier with program threads. • Take a snapshot: • Stop all threads • Scan roots (locals) • get the buffers with modified slots • Clear all dirty bits. • Resume threads • Then run collector: • Mark via current snapshot • foreach reachable slot s • if (!s.dirty) then • “read heap” • else • “read buffer” • recursively mark s value • - Sweep all non-local objects which are not marked.

Levanoni Petrank Algorithms Intermediate Concurrent Algorithm Properties: • Snapshot oriented, concurrent, (not so bad…) • Pause time: • Stop all threads • clear all dirty bits. • mark roots of all threads. • Pause time goal: • Stop one thread to mark its own local roots! • The goal: an on-the-fly algorithm with a low throughput cost.

Levanoni Petrank Algorithms Collecting On-the-fly - What if we stop each thread at a time? • Take a sliding view: • For each thread t • Stop t • Scan roots (locals) • get the buffers with modified slots • Resume t • Clear all dirty bits. • Then run collector: • For each modified slot: • decrease rc for previous snapshot value (read buffer), • increase rc for current snapshot value (“read heap”), • Reclaim non-local objects with rc 0. • Several problems to be solved…

Levanoni Petrank Algorithms The New Picture – using Sliding-Views Read information from each thread at a time (while other threads run): no snapshot. p1 p1 p2 p2 List of Modifications p3 p3 Heap p4 p4 p5 p5 p6 p6 p7 p7 Sliding view of the heap at collection k Sliding view of the heap at collection k+1

Levanoni Petrank Algorithms Danger in Sliding Views Here sliding view reads P2 (NULL) Program does: P1  O P2  O P1  NULL p1 p2 Here sliding view reads P1 (NULL) p3 Heap p4 p5 Problem: reachability of O not noticed! Solution: if a pointer to O has been stored during the sliding view phase – do not reclaim O (and descendants). p6 p7

Levanoni Petrank Algorithms The Sliding Views Collector • Take a sliding view: • Start snooping • For each thread t • Stop t • Scan roots (locals) • get the buffers with modified slots • Resume t • Stop snooping • Clear all dirty bits. • Then run collector: • For each modified slot: • decrease rc for previous snapshot value (read buffer), • increase rc for current snapshot value (“read heap”), • Reclaim non-local objects with rc 0.

Levanoni Petrank Algorithms Implementation for Java • Based on Sun’s JDK1.2.2 for Windows NT • Main features • 2-bit RC field per object (á la [Wise et. al.]) • A custom allocator for on-the-fly RC • Benchmarks: • Server benchmarks • SPECjbb2000 --- simulates business-like transactions in a large firm • MTRT --- a multi-threaded ray tracer • Client benchmarks • SPECjvm98 --- a suite of mostly single-threaded client benchmarks

Levanoni Petrank Algorithms Improved RC - How many RC updates are eliminated?

Levanoni Petrank Algorithms SPECjbb – max pause time

Levanoni Petrank Algorithms SPECjbb Throughput

Levanoni Petrank Algorithms MTRT Throughput

This Work: Sliding Views Algorithms with Generations

This Work - Generational Algorithms Motivation • Investigate how generations integrate with reference-counting on a multiprocessor. • Tracing work is proportional to the amount of live objects and by weak generational hypothesis: “many objects die young”. • RC does not depend on the amount of live space. The old generation has high fraction of live objects. • The goal: Get larger throughput • Algorithms match their generations • Work is concentrated where garbage is. • Better locality, working set size is smaller. • Note: similar pauses expected.

This Work - Generational Algorithms Design issues: • Two generations • Two collection types – minor and full • Each object which has survived a collection is promoted • Simplify implementation • Lower overhead for Inter-Generational-Pointers handling. • The heap is partitioned logically • In an on-the-fly collector object copying is very difficult if not impossible. • An object is promoted by marking it as old.

This Work - Generational Algorithms Design issues: • Promotion is done by the collector • Collection triggering • Minor collection is triggered every X[Bytes] Allocations. • Full collection is triggered when the heap occupancy grows to more than Y% • Two local buffers for each mutator: • “young-objects” buffer – pointers to new objects. • “old-objects” buffer. • The young generation processed by this cycle: • All local “young-objects” buffers from the previous cycle.

This Work - Generational Algorithms Log modified objects instead of modified slots Heap Objects • Update(A.p1, C) • Update(A.p2, C) • Update(A.p2, D)

This Work - Generational Algorithms “young-objects” buffer and “old-objects” buffer roles K cycle o2 new 1. o1.next := new(256); “old-Objects” “young-objects” 2. Update(*o1.next, o1); 3. Update(o1.next, o2); o1 K+1 cycle Heap Mutator K+2 cycle t

This Work - Generational Algorithms Three On-the-fly Generational Algorithms • Reference-Counting for both collections. • Reference-Counting for young collection. • Tracing for the major collection. • Reference-Counting for major collection. • Tracing for the minor collection. • Expected to be the best

This Work - Generational Algorithms Agenda • No time to present all algorithms • Only major RC (the best) algorithm will be presented. • Go over several interesting difficulties: • Issues for major RC collections • Efficient find the Inter-Generational-Pointers • Prepare the buffers for the major reference-counting. • Issues for minor RC collections • Efficient promotion with minor RC. • Snoop selectively. • No need to accurate update all objects RCs.

This Work - Generational Algorithms Reference Counting for the Major collection algorithm Expected to be the best. Uses Mark and sweep for minor collections. Uses RC for the major collections.

The minor collection - mark&sweep • Take a sliding view: • Start snooping • For each thread t • Stop t • Scan roots (locals) • get the buffers with modified slots • Resume t • Stop snooping • Clear all dirty bits. • Then run collector: • Find the inter-generational-pointers, add them to the roots set. • Mark via the sliding view • foreach reachable slot s • if (!s.dirty) then • “read heap” • else • “read buffer” • recursively mark s value • Sweep non local, unmarked objects, promote survivals. • Prepare buffers for major collection

The major collection – RC • Take a sliding view: • Start snooping • For each thread t • Stop t • Scan roots (locals) • get the buffers with modified slots • Resume t • Stop snooping • Clear all dirty bits. • Then run collector: • For each modified slot: • (which are in the current sliding-view buffers or in the prepared major buffers) • decrease rc for previous snapshot value (read buffer), • increase rc for current snapshot value (“read heap”), • Reclaim non-local objects with rc 0, promote survivals.

This Work - Generational Algorithms Issues for:Major RC collections • Young generation: • How do we find inter-generational-pointers (for the mark&sweep of the young generation) efficiently? • Provide the major RC collection with consistent buffers.

This Work - Generational Algorithms Inter-Generational-Pointers are “given for free” • Observation: Old objects that point to young objects - must have been modified since the previous collection, because young objects did not exist before. • Thus: all inter-generational pointers must be logged in “old-objects” local buffers. • Does this get all Inter-Generational Pointers? • Must note some race conditions due to the non-atomic sliding-view.

The first race – “intra sliding-view update” Collector Mutator A Mutator B • Take a sliding-view • Cooperate: • Stop • Mark-Roots • Read-Buffers • Resume K-1 cycle • p:=new(16) • Update(o.next, *p) That new object is logged to the young generation of the next cycle • Cooperate: • Stop • Mark-Roots • Read-Buffers • Resume The “inter-generational-pointer” is logged in the buffers of cycle k-1, these buffers won’t be available in this cycle!!! K cycle K+1 cycle

The second race – “update before clear” • Read o.next=x • Read o.Dirty=true Collector Mutator A Mutator B K-1 cycle • Take a sliding-view • Cooperate: • Stop • Mark-Roots • Read-Buffers • Resume • Cooperate: • Stop • Mark-Roots • Read-Buffers • Resume • p:=new(16) Inter-generational-pointer was created and not logged to any buffer! • Update(o.next, *p) • Clear-Dirty-Marks K cycle If we won’t traverse through the inter-generational-pointer to the new object it might be sweeped mistakenly in this cycle! K+1 cycle

This Work - Generational Algorithms Solution to both races: • Record into “IGPs_buffer” all objects that are involved in an update to young object in the following uncertainty period: • While taking the sliding-view and till the end of the clear-dirty-marks. • The true inter-generational-set is contained in the following set: • {Union over all mutators’ old-objects buffers}  {IGPs_buffer}

This Work - Generational Algorithms Full RC collection buffers preparation • The mutators log objects to their local “young-objects” and “old-objects” buffers. • The collector log part of these logged objects to the “major-new-objects” buffer, and to the “major-old-objects” buffer.

This Work - Generational Algorithms Which objects to log to the major buffers? • Only objects which will be alive in the next major collection. • Use a OldDirty flag (for each logged object) – To avoid multiple loggings of the same object. • Logging to the “major-new-objects” buffer • Log only objects which were promoted (it is known at the young-generation sweep phase). • No object’s children are logged (because the object did not exist in the previous major cycle, thus its children did not reference any object).

This Work - Generational Algorithms Logging to the “major-old-objects” buffer • The parents objects which are logged into the young generation “old-objects” buffers are: old. (Why?) • Thus they can be logged to the major buffer. (They will survive). • Their children may be sweeped, thus log only children which were promoted (only after the sweep phase).

This Work - Generational Algorithms Issues for:RC minor collection • Efficient promotion withminor RC collections • Reference-Counting for the young generation algorithm Advantage: • The RC field might be not accurate. • Selectively snoop only young objects.

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

OUTLINE