Performance and Energy Implications of Many-Core Caches for Throughput Computing

Performance and Energy Implications of Many-Core Caches for Throughput Computing C. J. Hughes, C. Kim, Y. Chen inIntel Labs IEEE Micro 2010 2013010654 유승요

Throughput Computing • Throughput Computing • Computing focuses on maximizing the throughput of workloads rather than latency • Huge number of calculation with parallelism • Fit to many-core processor • To keep many core busy • memory system must feed and facilitate efficient core-to-core communication • By using cache • Hides latency of low level memory systems • Fast core-to-core communication • Sufficient bandwidth

Research Objective • Many-core cache design for throughput computing • Considering power and performance • Different from traditional CPU cache • More cores intercore communication • Latency tolerant minimizing average access time may not be best

Throughput Applications

Throughput Applications • Working set size • Model use 256kB L2 cache  larger working set benchmarks may be slowed by L2 cache size • Cache miss rate • Means L1 cache miss rate high miss rate means high lower level cache access rate • Prefetch coverage • Percentage reduction in L1 misses when stride prefetcher added high reduction rate means strong streaming pattern

Throughput Applications Datasharing characteristics. Percentage of L2 cache lines and L2 cache accesses to data shared by given number of cores • Sharing degree • number of cores access in line • Spatial domain • Most data is private – 12 out of 15 benchmarks has more than 70 % none shared line with • Frequency domain • Larger than percentage of spatial domain

Cache design • Constraints • Two level caching • L1 private cache for each core • Last Level Cache(LLC) • Directory based hardware cache coherence • Entry contains tag and directory state information • Tiled design processor • Flexibility – key design criterion • Determines a given line can reside in the LLC • Affects the distance an LLC request and reply must travel through the on-die network • Flexibledesignaffects • Access latency?  more flexible design is better • On-die bandwidth usage?  more flexible design is better • No. of unique lines?  less flexible design is better • Off-die bandwidth usage?  less flexible design is better • Effective cache capacities? less flexible design is better

Cache design -Private • Set core’s data to close location • Private LLC • Less unique cache line • home tile • Mapped by address hashing function • Tag directory in home tile maintains line

Private design variation • Replication policy • To increase unique cache lines • Uncontrolled replication • Controlled replication • No replication • Uncontrolled replication • Allow unlimited replication • when unique line evicted • Not in home tile, move it to home tile(migration) • Controlled replication • Allow replication, but deprioritize line via replacement policy • Use a bit for each line(reuse bit)It follow cache line when it is evicted or transferred to other LLC • If new line is inserted, line with 0 reuse bit is evicted.

Private design variation • No replication • Shared line in home tile • Private line in accessing core’s tile • LLC controller works as directory controller • Track private line with Roaming Data Pointer(RDP) • When it is shared, removes RDP entry and migrates line to home tile

Shared design • Shared • Keep all line in home tile • Maximize unique line • Increase in average access latency and on-die traffic

Experimental setup • L1 cache withhardware stride prefetcher • 64 switches for ring • Energy components • LLC, tag directory, RDP access • On-die data messages • On-die coherence messages • Off-die access

Performance and Energy Consumption (a)Performance of the 5 LLC designs relative to shared

Performance • Least flexible design is better • Flexible designs intended to minimize access latency • But in throughput computing, cache miss latency is hided via multithreading or prefetching • Critical path  heavily rw shared line • Least flexible designs  centralized storage • Home tile respond requires no acknowledgment • Flexible designs  private caches only • Not allow muliple cache to cache transfer • Tag directory needs acknowledgment from tile

Energy consumption • flexible design is better • Saving on-die traffic • Increased off-die traffic  small effect • Unique line policy in private design • Controlled and uncontrolled design • Reduce off-die access Energy consumption of the 5 LLC designs relative to shared P : private, U : uncontrolled replication, C : controlled replication, N : no replication, S : shared, T:tag directory buffer

Designing for performance and energy • Tag Directory Buffer • Small FA buffer to hold clean lines • Handle read request for clean lines • Acts like shared based model • Add a bit in each line’s directory entry • to check concurrent share • Save space and traffic Tag directory buffer hit rates of different buffer sizes

Alternatives • Sharing Migration • Coping read shared lines to the home LLC tiles • Also needs acknowledgment from home tile • Parallel reads • Modify coherence protocol and directory hardware to allow simultaneous transfer • Not increasing data traffic • Changing protocol and hardware • Still needs cache-to-cache transfer • Slower than tag directory buffer

Impact of increased read parallelism (a)performance and (b) energy consumption for designs that attempt to increase read parallelism

Impact of increased read parallelism • Tag Directory Buffer • Faster but increase in energy • Copying data to home tile • Data reply from tag directory has longer path • Alternatives • Parallel reads : slower than TDB & same energy consumption with Controlled • Sharing migration : no performance boost & increase in energy • Read throughput isn’t sufficient

Conclusion • Tag Directory Buffer • 10% faster than private designs • 55% energy saving compare to shared designs • Next Work • More complex hierarchy • More fundamental changes in hierarchy

Thank You ! www.themegallery.com www.themegallery.com

Performance and Energy Implications of Many-Core Caches for Throughput Computing