1 / 21

Performance and Energy Implications of Many-Core Caches for Throughput Computing

Performance and Energy Implications of Many-Core Caches for Throughput Computing. C. J. Hughes, C. Kim, Y. Chen in Intel Labs IEEE Micro 2010. 2013010654 유승요. Throughput Computing. Throughput Computing Computing focuses on maximizing the throughput of workloads rather than latency

ohio
Download Presentation

Performance and Energy Implications of Many-Core Caches for Throughput Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance and Energy Implications of Many-Core Caches for Throughput Computing C. J. Hughes, C. Kim, Y. Chen inIntel Labs IEEE Micro 2010 2013010654 유승요

  2. Throughput Computing • Throughput Computing • Computing focuses on maximizing the throughput of workloads rather than latency • Huge number of calculation with parallelism • Fit to many-core processor • To keep many core busy • memory system must feed and facilitate efficient core-to-core communication • By using cache • Hides latency of low level memory systems • Fast core-to-core communication • Sufficient bandwidth

  3. Research Objective • Many-core cache design for throughput computing • Considering power and performance • Different from traditional CPU cache • More cores intercore communication • Latency tolerant minimizing average access time may not be best

  4. Throughput Applications

  5. Throughput Applications • Working set size • Model use 256kB L2 cache  larger working set benchmarks may be slowed by L2 cache size • Cache miss rate • Means L1 cache miss rate high miss rate means high lower level cache access rate • Prefetch coverage • Percentage reduction in L1 misses when stride prefetcher added high reduction rate means strong streaming pattern

  6. Throughput Applications Datasharing characteristics. Percentage of L2 cache lines and L2 cache accesses to data shared by given number of cores • Sharing degree • number of cores access in line • Spatial domain • Most data is private – 12 out of 15 benchmarks has more than 70 % none shared line with • Frequency domain • Larger than percentage of spatial domain

  7. Cache design • Constraints • Two level caching • L1 private cache for each core • Last Level Cache(LLC) • Directory based hardware cache coherence • Entry contains tag and directory state information • Tiled design processor • Flexibility – key design criterion • Determines a given line can reside in the LLC • Affects the distance an LLC request and reply must travel through the on-die network • Flexibledesignaffects • Access latency?  more flexible design is better • On-die bandwidth usage?  more flexible design is better • No. of unique lines?  less flexible design is better • Off-die bandwidth usage?  less flexible design is better • Effective cache capacities? less flexible design is better

  8. Cache design -Private • Set core’s data to close location • Private LLC • Less unique cache line • home tile • Mapped by address hashing function • Tag directory in home tile maintains line

  9. Private design variation • Replication policy • To increase unique cache lines • Uncontrolled replication • Controlled replication • No replication • Uncontrolled replication • Allow unlimited replication • when unique line evicted • Not in home tile, move it to home tile(migration) • Controlled replication • Allow replication, but deprioritize line via replacement policy • Use a bit for each line(reuse bit)It follow cache line when it is evicted or transferred to other LLC • If new line is inserted, line with 0 reuse bit is evicted.

  10. Private design variation • No replication • Shared line in home tile • Private line in accessing core’s tile • LLC controller works as directory controller • Track private line with Roaming Data Pointer(RDP) • When it is shared, removes RDP entry and migrates line to home tile

  11. Shared design • Shared • Keep all line in home tile • Maximize unique line • Increase in average access latency and on-die traffic

  12. Experimental setup • L1 cache withhardware stride prefetcher • 64 switches for ring • Energy components • LLC, tag directory, RDP access • On-die data messages • On-die coherence messages • Off-die access

  13. Performance and Energy Consumption (a)Performance of the 5 LLC designs relative to shared

  14. Performance • Least flexible design is better • Flexible designs intended to minimize access latency • But in throughput computing, cache miss latency is hided via multithreading or prefetching • Critical path  heavily rw shared line • Least flexible designs  centralized storage • Home tile respond requires no acknowledgment • Flexible designs  private caches only • Not allow muliple cache to cache transfer • Tag directory needs acknowledgment from tile

  15. Energy consumption • flexible design is better • Saving on-die traffic • Increased off-die traffic  small effect • Unique line policy in private design • Controlled and uncontrolled design • Reduce off-die access Energy consumption of the 5 LLC designs relative to shared P : private, U : uncontrolled replication, C : controlled replication, N : no replication, S : shared, T:tag directory buffer

  16. Designing for performance and energy • Tag Directory Buffer • Small FA buffer to hold clean lines • Handle read request for clean lines • Acts like shared based model • Add a bit in each line’s directory entry • to check concurrent share • Save space and traffic Tag directory buffer hit rates of different buffer sizes

  17. Alternatives • Sharing Migration • Coping read shared lines to the home LLC tiles • Also needs acknowledgment from home tile • Parallel reads • Modify coherence protocol and directory hardware to allow simultaneous transfer • Not increasing data traffic • Changing protocol and hardware • Still needs cache-to-cache transfer • Slower than tag directory buffer

  18. Impact of increased read parallelism (a)performance and (b) energy consumption for designs that attempt to increase read parallelism

  19. Impact of increased read parallelism • Tag Directory Buffer • Faster but increase in energy • Copying data to home tile • Data reply from tag directory has longer path • Alternatives • Parallel reads : slower than TDB & same energy consumption with Controlled • Sharing migration : no performance boost & increase in energy • Read throughput isn’t sufficient

  20. Conclusion • Tag Directory Buffer • 10% faster than private designs • 55% energy saving compare to shared designs • Next Work • More complex hierarchy • More fundamental changes in hierarchy

  21. Thank You ! www.themegallery.com www.themegallery.com

More Related