1 / 21

CSE 661 PAPER PRESENTATION

CSE 661 PAPER PRESENTATION. PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR THROUGHPUT COMPUTING By C. J. Hughes et al. Presented By SALAMI, Hamza Onoruoiza g201002240. OUTLINE OF PRESENTATION. Throughput Computing Benchmarks Used

noma
Download Presentation

CSE 661 PAPER PRESENTATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 661 PAPER PRESENTATION PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR THROUGHPUT COMPUTING By C. J. Hughes et al Presented By SALAMI, Hamza Onoruoiza g201002240

  2. OUTLINE OF PRESENTATION • Throughput Computing • Benchmarks Used • Degree of Sharing of L2 Caches in Benchmarks • Cache Designs Considered • Experimental Setup • Results (Performance and Energy) • Possible Improvements • Final Results • Conclusion, Comments and Questions

  3. THROUGHPUT COMPUTING • Performing huge number of computations with large amounts of parallelism • Also known as GPGPU

  4. L1 Miss rate without prefetching BENCHMARKS USED • Working Set: 64KB – 2MB • 64 Threads with private 32KB Cache • 256KBL2 Cache • L2 < 2MB may result in bad performance

  5. BENCHMARKS USED (2) L1 Miss rate without prefetching

  6. DEGREE OF SHARING OF L2 CACHE IN BENCHMARK SHARING DEGREE • Spatial: Fraction of each cache line accessed. Most data private except for svm • Temporal: Fraction of accesses to line. Shared data is prevalent e.gpcg. 0.1% of lines involved in global R/W giving 19.2% of L2 cache accesses

  7. CACHE DESIGNS CONSIDERED ASSUMPTIONS • Two level caching (private L1, Varying L2); inclusive cache • Directory Based Coherence • Tiled Design (Tile = Core + Private Caches + Switch)

  8. CACHE DESIGNS CONSIDERED (2) 1) PRIVATE LLC • LLC in tile’s core • Most flexible design (replicas of cache line can exist in all LLCs simultaneously) • Fewer unique cache lines => more LLC misses • Each tile contains tag directory • Hash function (cache block address) = home tile • Home tile provides info. On which LLC(s) hold required data. • Cache-to-Cache transfer takes place

  9. CACHE DESIGNS CONSIDERED (3) 2) UNCONTROLLED REPLICATION • Similar to private LLC • Tries to increase no. of unique lines • Eviction of cache block with one sharer? Move block to its home tile. • Already in home tile? Evict from chip.

  10. CACHE DESIGNS CONSIDERED (4) 3) CONTROLLED REPLICATION • Builds on Uncontrolled Replication • Tries to further increase no. of unique lines • Each block has reference bit. • Reference bit = 1 => likely part of working set • Duplicate copies of cache blocks not in active use are favored for LRU eviction.

  11. CACHE DESIGNS CONSIDERED (5) 4) NO REPLICATION • Limited flexibility • Cache lines reside in at mostone LLC at a time . • Shared lines held in lines’ home tile’s LLC (=> easy accessibility) • Private lines held in user’s LLC (RDP points to line’s location). • Eviction of private line or increased number of sharers returns block to its home LLC

  12. CACHE DESIGNS CONSIDERED (6) 5) SHARED • Least flexibility • All cache lines reside in their home LLC. • Easy to find lines • Increased average access latency and on-die traffic for private lines

  13. CACHE DESIGNS CONSIDERED(7) CACHE DESIGNS Private Uncontrolled Replication Controlled Replication No Replication Shared Effective Cache Capacity (No. of Unique Blocks) Flexibility Reduction in On-Die bandwidth usage Reduction in Off-Die bandwidth usage

  14. EXPERIMENTAL SETUP • Simulator is used • L1 has hardware stride prefetcher • Energy Consumption • Storage energy: tag and cache line access to LLC, tag directory and RDP • On-die data messages • On-die coherence messages • Off-die accesses

  15. RESULTS (PERFORMANCE) • Least flexible designs offer better performance!!! • Least flexible designs • High throughput to heavily R/W lines (on a miss, home tile responds directly, no need for acknowledgement) • Single write causes invalidation for readers (less impact for centralized data design, worse for flexible designs) • Flexible designs • No centralized data storage • No overlapped cache-to-cache transfer; directory receives acknowledgement from sending tile before processing another request.

  16. RESULTS (ENERGY) • Flexible designs consume significantly less energy than other designs!!! • Flexible designs minimize on-die traffic because of replication. • Increase in off-die traffic (fewer unique lines) but most lines have few shares. See Figure 1 • On-die traffic for No Replication better than Shared due to data migration • Off-die traffic increases as we move from Private to Uncontrolled Replication to Controlled Replication

  17. RESULTS SO FAR… • Flexible designs are more energy efficient • Less flexible designs offer better performance • Controlled Replication uses least energy. • Can we improve its parallelism for handling multiple reads of the same cache line?

  18. POSSIBLE IMPROVEMENTS • Tag Directory Buffer • Small, fully associative buffer added to tag directory to hold clean lines having at least 3 shared readers (similar to Shared design) • Tag Directory Buffer All • Similar to Tag Directory Buffer • In this case, all read-shared lines are placed in tag directory buffer Four-entry buffer of 256 bytes is used

  19. POSSIBLE IMPROVEMENTS (2) • Sharing Migration • Similar to Tag Directory Buffer • However, uses home tile’s LLC instead of a buffer • Tag Directory Buffer All • Similar to Tag Directory Buffer All • Uses home tile’s LLC instead of a buffer • Parallel Reads • Allows simultaneous (overlapped) cache-to-cache transfers of the same cache lines for reads.

  20. FINAL RESULTS • Tag Directory Buffer provides highest performance and close to the least energy consumption. See also figure 3

  21. CONCLUDING REMARKS, COMMENTS AND QUESTIONS THANK YOU

More Related