Caching Considerations for Generational Garbage Collection

Caching Considerations for Generational Garbage Collection Presented By: Felix Gartsman – 306054172 http://www.cs.tau.ac.il/~gartsma/seminar.ppt gartsma@post.tau.ac.il

Introduction • Main theme: Effect of memory caches on GC performance • What is a memory cache? • How caches work? • How and why caches and GC interact? • Can we boost GC performance by knowing more about caches?

Motivation • CPU and memory performance don’t advance with same speed • When CPU waits for memory it is idle • Solutions: Pipeline, speculative execution and caches • Caches provide fast access for commonly accessed memory

Caches and GC • Two-way relationship: • Improving GC performance by “cache awareness”– minimizing cache misses • GC improving mutator memory access locality and minimizes cache misses by mutator (not dealt by the article)

Previous Work (Outdated!) • Deal mainly with interaction with virtual memory systems • No special attention to Generational GC • Assumed “best/worst cases”, special hardware • Investigated only direct-mapped caches

Article Contribution • Survey GGC performance on various caches • Check techniques for improving performance • Main advice: Try keeping the youngest generation fully in cache. If impossible prefer associative caches

Roadmap • Cache in-depth • GC memory reuse cycle • GGC as better GC • Comparing cache size requirements • Comparing misses for different cache types • Conclusions

Cache in-depth Memory Hierarchy Cache Hierarchy • Higher level means higher speed and smaller capacity • Miss in one level relays the handling to a lower level L1 Registers Cache L2 Main Memory L3??? Virtual Memory (Disk)

Motivation contd. • When a memory word is not in cache, a “cache miss” occurs • Cache miss stalls the CPU, and forces access to main memory • Cache misses are expensive • Cache misses become more expensive with each new generation of CPUs • Penalty for memory access in P4: L1 – 2 cycles, L2 – 7, miss – dozens depending on memory type

Cache properties • Size (8-64KB in L1, 128KB-3MB in L2, 6-8MB in L3?) • Layout (block size and sub-blocks) • Placement (N:M hash function) • Associativity • Write strategy • Write-through or Write-back • Fetch-on-write or write-around

Cache Size • Size – The bigger the better. Too small cache can render fast CPU to sluggish (Intel Celeron as example) • Bigger cache reduces cache misses • Constraints: • Physical feasibility (proximity, size, heat) • Money (cost vs. performance ratio)

Cache Layout • Cache memory is divided to blocks called “cache lines” • Each line contains validity bit, dirty bit, replacement policy bits, address tag and of course the data • Bigger block reduce misses for good spatial locality. Hurt performance if working on multiple memory regions. Also longer to fill lines

Cache Layout contd. • Can be solved by dividing lines to sub-blocks and managing them separately

Cache Placement • Map memory address to block number • Examples: • Address modulo #blocks • Select middle bits of address • Select a set of bits • Must be fast and “hardware friendly” • Should be uniform mapping

Cache Associativity • Fully associative – each address can be in any block. Need to check all tags – slow or expensive. LRU replacement • Direct mapped – address can be only in one block. Fast lookup, but no usage history • Set associative - each address can be in a set (2,4,8) of blocks. A compromise – fast access and limited usage history

Cache Write Strategy • Write-Through – Write directly to memory and of course update the cache (slow, but can use write buffers) • Write-Back – Write to cache, and mark it dirty. Flush to memory later. Very useful for multiple writes to close addresses (object initialization). Can also enjoy write buffers (less useful)

Cache Write Strategy contd. • What to do on write cache miss? • Fetch-on-write/Write-allocate – on miss fetch the corresponding cache line, and treat it as write hit • Write-around/Write-no-allocate – Write directly to memory • Usually Write-back + Write-allocate, Write-through + Write-no-allocate

Modern memory usage • Object-Oriented languages tend to create many small objects for short periods. For example, STL uses value semantics which copies objects for every operation! • Functional languages (Lisp, Scheme) constantly create new objects which replace old ones (cons and friends…)

Modern memory usage contd. • Creation is expensive – allocation with probable write miss (new address used). Article cites sources claiming functional languages writing in up to 25% of their instructions (others 10%)

Memory Recycling Pattern • GC systems tend to violate locality assumptions • Cyclic reuse of memory beats any caching policy. The reuse cycle is too long to be captured • GC systems become bandwidth limited

Allocation is to blame, not GC • Locality of the GC process itself is not “the weakest link” • The problem is fast allocation of memory, which will be reclaimed much later • Main memory filled very fast. What to do? • GC – Too frequent, but avoids page • Use VM – Touches many pages and causes paging

Pattern Results • Allocation touches new memory, and force a page-in/page fetch (slow) • Why fetch? The memory allocated was used previously. OS doesn’t know it’s garbage, and allocation will overwrite it anyway • Informing OS no fetch required speeds execution

Pattern Results contd. • When main exhausted (or the process isn’t allowed more pages), old pages must be evicted • Those pages are probably dirty – must be written to disk • Even worse – the evicted page is LRU – probably garbage! • Worst case: Disk B/W = 2*Allocation Rate

Another view • View GC allocator as a co-process to the mutator • Each one has it own locality reference • The mutator probably with good spatial locality • The allocator linearly marches through memory • Allocation is cyclic (remember LRU)

Compaction and Semi-Spaces • Compaction helps the mutator, little difference to allocator • Still marches through large memory areas • Trouble with semi-spaces – the tospace was probably evicted. All addresses are replaced – cache flush. Marching through entire heap every second cycle

Solution? • So LRU is bad, can we replace it? • We can, but it wont help much • Too much memory touched too frequently • Allocator page faults dominate program execution! • Only holding entire reuse cycle in memory will stop paging

Generational GC • Solution: Touch less memory, less frequently • Divide heap to generations • GC the young generation(s) – touching less memory • This eliminates vast memory marching – memory reuse cycle minimized • Eliminates paging, what about cache?

Generational GC variations • Can use single space – immediate promotion • Can use semi-spaces – promote at will, at the expense of more memory

Better Generational GC • Ungar: Use a pair of semi-spaces and a separate dedicated creation space • This space is emptied and reused every cycle, but the semi-spaces alternate roles as destination • The result: Only little part of semi-spaces are touched, and new objects are created in “hot” space in main memory (and maybe in cache)

Cache Revised • Cache misses can be categorized to: • Capacity misses – No matter what cache is used, the miss will occur • Conflict misses – A miss occurs because two (or more) addresses mapped to same cache line (set) • Direct mapped suffer from conflict misses the most – every miss evicts blocked with same mapping

Conflict Misses in-depth • Miss rate function is roughly a minimization one • Example: Both addresses map to same line. The first accessed every ms, the second every μs. The (double) miss is every ms. • The rate depends on the usage frequency of addresses not in cache

Minimizing Conflict Misses Rate • Most non-GC systems are skewed – many frequent objects, little others. If placed well, cache is efficient • If many block accessed in intermediate time scale – more misses, and more chances they will interfere each other • Over-simplified to help understanding

Example • Program marches memory, while doing normal activity. We use 16KB cache • 2-way associative: the most frequent block are not touched • Direct Mapped: Total flush every cycle • Conclusion: It takes twice time to be remapped in DM, but the result is painful (flush) • DM can’t handle multiple access pattern

Experiments • Instrumented Scheme compiler with integrated cache simulator • Executes millions of instructions, allocates MBs • We’ll present 2 programs: • Scheme compiler • Boyer benchmark – objects live long, and tend to be promoted

Experiments contd. • Cache lines are 16 bytes width • 3 collectors: • GGC with 2 MB spaces for generation – no promotion ever done • GGC with 141 KB spaces for generation • #2 + 141 KB creation space (Ungar)

Results (Capacity)

Interpretation • LRU queue distance distribution • What it means? • The probability of a block to be touched at different point in LRU queue • The probability of a block to be touched given how long since it was last touched • The probability of a block to be touched given how many other blocks have been more recently

Interpretation contd. • Fourth queue position – 128 KB • Eight queue position – 256 KB • For any given position – The area under the curve to the left – cache hits, to the right – misses • Curve’s height at point – the marginal increase in hits due cache enlargement at that point

Experiment Meaning • First entries absorb most hits • Collector #1: • Dramatic drop • About tenth position (320 KB) – no need • Collector #2+#3: • Hump peaking when memory starts recycling

Experiment Meaning contd. • #2: Recycling after 141*2 KB, cache of 300-400 KB should suffice • #3: Creation space is constantly recycled, and a small part of other spaces is touched, cache of 200-300 KB should suffice

Experiment Meaning contd. 2 • Boyer behaves differently • #3 better than #2 by 30% • Capacity misses disappear if cache larger than youngest generation size

Results (Collision)

Interpretation • The graph plots cache size vs. miss rate • Shows results only for collector #3

Experiment Meaning • Associative shows dramatic and almost linear dropdown to 256 KB (contains youngest generation). From then nothing interesting • Direct mapped same on 16-90 KB interval, better on 90-135 KB, much worse latter on

Experiment Meaning contd. • Why DM better at that period? • Cache big enough to hold creation area, and suffers interference for other blocks • Associative evicts before used due collision • Later associative suffers only “re-fill” misses, DM also suffers collisions

More Performance Notes • When cache is too small, most evicted blocks are dirty and require expensive writebacks • Interference may also cause writebacks

Conclusions • Caches are important part of modern computer • Garbage collectors reuse memory in cycles, often march memory • LRU evicts dirty pages/cache lines, needless fetches are costly • GGC reuses smaller area, reduces paging

Conclusions contd. • Similar idea for caches: hold youngest generation entirely • Ungar 3-space proposition reduces required footprint by 30% • Excluding small interval, associative caches perform better than direct mapped which suffer collision misses

Questions?

The End

Caching Considerations for Generational Garbage Collection

Caching Considerations for Generational Garbage Collection

Presentation Transcript

Garbage Collection

Garbage Collection

Garbage Collection

Garbage Collection

Generational Garbage Collection

Garbage Collection

Garbage Collection

Garbage collection

Garbage Collection

Garbage collection

Garbage Collection

Garbage Collection

Garbage Collection

Garbage Collection

Garbage Collection

Garbage Collection