1 / 32

Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor. Chen-Yong Cher Michael Gschwind. Automatic Dynamic Garbage Collector. Offload to coprocessor performance benefit host processor keeps running independently BDW implementation offload mark phase to SPE

lilith
Download Presentation

Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor • Chen-Yong Cher • Michael Gschwind

  2. Automatic Dynamic Garbage Collector • Offload to coprocessor • performance benefit • host processor keeps running independently • BDW implementation • offload mark phase to SPE • Prefetch and avoid cache misses

  3. Contributions • LS and sync for offloading mark phase • coherency of liveness info b/w host and SPU • MFC-based SW$ to improve performance • hybrid caching schemes for different data types • extend SW$ and Capitulative Loads using MFC DMA

  4. Usefulness • Application space not the right place for such job • Explore LS-based system • low latency • no overhead of cache coherence protocols

  5. BDW garbage collection • Mark- (lazy) sweep • ambiguous roots • DFS for all reachable pointers on the heap • un-marked are de-allocated • lazy sweep: avoid touching large amounts of cold data • preferable caching behavior

  6. Initialization • PPE • offload marking phase to SPE • send effective address to SPE (mailbox) • SPE • indicate ready to receive data • sync for completion of PPE transfer • MFC translates effective address to absolute (Segm. Tables, Page Tables) • Initiate DMA transfer (preferable) • SPE LS -> Memory

  7. Use of Local Store • 20 KB instruction image (GC code) • 128 KB of software cache • 40 KB of header cache • 32 KB of local mark stack • small activation record stack

  8. Porting GC • Porting BDW to SPE • Application heap traversal (bulk of execution) • Sync b/w PPE and SPE via mailbox (small data) • BDW data structures • Mark Stack • Heap Blocks • Header Blocks • Traverse only live references (locality optimization)

  9. Porting GC cont. • Pointer chasing • poor locality on hardware cache and prefetchers • operand buffers to improve locality • fetch entire blocks not just reference • object cache (for HDR blocks) • temporary store of records • hashed system memory effective address • Naive implementation (baseline performance)

  10. Software caches • Non-homogeneous and non-continuous data structures • exploit temporal and spatial locality • Significant overheads on SPU • access latency to compute and compare tags • possible cache miss • locate data buffer • Poor match • for regular and predictive access patterns • Useful for large data sets • statistical locality • Adjusted in size

  11. Software caches cont. • Hybrid design (caching strategy) • SW$ + operand buffer • partition blocks using SW$ or OPB • use SW$ for small heap blocks • reduces hit latency • removes references with dense spatial locality to cache • Intelligent tuning of code generation • take advantage of Cell SPEISA features • ILP/DLP (not covered)

  12. Prefetching • Large sets with poor locality • Hide memory latency (techniques) • Boehm • uses mark stack for prefetching • CHV and Capitulative Loads • use 8-16 entry buffers (FIFO) • CHV • exploits parallel branches • ! DFS traversal • Capitulative Loads • changes access ordering • a demand load on cache hit, a prefetch on miss

  13. Prefetching cont. • Advantage of Cell • initiate prefetch under application control by DMA engine • good for regular dense data sets (tiling matrices) is efficient • GC with irregular data patterns and unpredictable locality is the antithesis • Prefetch of heap blocks for scanning pointers • Locality: addresses are within heap block bounds

  14. Prefetching cont. • Differences b/w prefetching on conventional procs vs. procs with LS • addresses are binding • granularity of prefetching • cost of misspeculation • cost of DMA • Early/late arrival, replacement • not enough work to overlap and hide the miss latency • suffer of pollution effects and overheads • virtual tie

  15. Data coherence and consistency management • Sync b/w PPE and SPE is necessary • application and control code on PPE • bulk of mark phase on SPE • Data are copied • maintain consistency on updates • hdr blocks • lookup indices • application heap

  16. Data coherence and consistency management cont. • Scheme based on data usage • SPE -> PPE only when SPE has completed work • data sync. necessary except from mark stack • Achieve coherency and sync. handled via MFC mailbox • schedule mark operation • PPE sends descriptor with part of the MS and parameters • SPE sends back the parameters via DMA (for ACK) and mark phase begins

  17. Conclusions • Data reference optimized strategy • performance gain 400%-600% • local store based data prefetch • explore local environment • Software controlled cache for garbage collectors • Viable and competitive solution • offload to coprocessor, increase utilization

  18. A Reactive Unobtrusive Prefetcher for Multicore and Manycore Architectures • Jason Mars Daniel Williams • Dan Upton Sudeep Ghosh Kim Hazelwood

  19. Software dynamic prefetchers • Identify complex patterns • high application overheads • Unobtrusive prefetcher • take advantage of neighbor’s underutilized cores • Snooping, profiling, pattern detection, prefetching

  20. URPref • Neighbor idle core observes miss patterns • Analyze miss patterns • continuously profile and adapt • Use Sequitur • pattern detection • Perform prefetch • first identify prefix of hot stream • Reactive • high cache miss rate • Pointer chasing (complex and difficult for hardware)

  21. URPref cont. • Contributions • profile cache misses, detect patterns, prefetch • propose hardware extensions • snooping, etc. • pattern based approach to detect miss patterns, adapt phases

  22. URPref Support • FIFO Snoopy buffer • Profiling on separate core • 2 more ISA instructions • OS must be aware of URPref • Resume, easy halted, suspend, wait, late start • Linear time pattern detection algorithm for miss patterns on cache misses • Sequitur

  23. Detecting Hot Streams with Sequitur • Hot cache miss • length of miss stream • frequency in the input sample window • often large percentage in a small portion • Snoopy sends each new miss to Sequitur • determine if it forms a prefix • if it matches then fetch the remainder hot stream to the neighboring L1 cache

  24. Detecting Hot Streams with Sequitur cont. • Profile window • series of cache misses • Data miss stream • sequence of data cache miss addresses repeats in a profile window • Hot stream • a given percentage of the profiling window • Sequitur builds a context free grammar of the cache miss patterns • each production represents some sequence repeated more than once

  25. Sequitur • Hottest streams used for prefetching • hotness = uses x misses • # times rule used in a grammar (cold uses) • # of terminal symbols • Detecting patterns • actual cache addresses • deltas

  26. Complex patterns

  27. Using Sequitur • Determine data b/w adjacent misses • hotness = length x cold uses • sum of the # of terminals for a given non-terminal • # of times the non-terminal appears in the grammar • prefix • first few symbols of the hot stream

  28. Using Sequitur cont. • Fill window of misses from snoopy • dynamic size -> normalize # of hot streams • Keep track of last 4 misses • Active hot stream table • Receive misses from Snoopy • create prefix window • search on active HS table • if a match is found, prefetch it

  29. Prefetching • New hot streams are added to the table • hot streams no longer active are retired • No hot streams found • dynamic prefetcher remains dormant • Hot streams become cold • dynamic prefethcer stops • Avoid conservative approach • Application code doesn’t change • No prefetch instructions are added • no overhead • No sync. b/w prefetcher and app core is needed • no respect to latency when stream of tail is prefetched

  30. Prefetcher: Adaptive Response • Phase change • avoid cache pollution • incorrect prefetching • achieve maximum performance • Continuous profiling (one window latency) • prefetch happens with prefixes from the previous window • ABCDEF • ABCGHI

More Related