Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar,Avinoam Kolodny, Uri C. Weiser The Technion – Israel Institute of Technology

Caches are a principal challenge in CMP • CMP’s severely stress on-chip caches • Capacity • Bandwidth • Latency • Data sharing complicates our life • Contention on shared data • Synchronization How to organize & handle data in CMP caches?

Outline • Caches in CMP • Cache-in-the-Middle layout • Application characterization • Nahalal solution • Overview • Results • Putting Nahalal into practice • Line search • Scalability • Summary

Tackling Cache Latency via NUCA • Due to the growing wire delay: • Hit time depends on physical location [Agarwal et al., ISCA 2000] NUCA - Non Uniform Cache Architecture [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04] • Non uniform access times • Closer data => smaller hit time • Aim for vicinity of reference • Locate data lines closer to their client L2 Cache L2 Cache Dynamic NUCA (DNUCA) [Kim et al., ASPLOS’02, Beckmann and Wood, MICRO’04] • Migrate cache lines towards processors that access them Source: [Keckler et al., ISSCC 2003]

CPU0 CPU0 CPU2 CPU2 CPU1 CPU1 CPU3 CPU3 Bank2 Bank3 Bank0 Bank1 Bank6 Bank7 Bank4 Bank5 CP40 CPU4 CPU6 CPU6 CPU5 CPU5 CPU7 CPU7 Cache-In-the-Middle Layout (CIM) • Shared L2 cache • Higher capacity utilization • Single copy  no inter-cache coherence • Banked , DNUCA • Interconnected using Network-on-Chip (NoC) [Beckmann et al. Micro’06] [Beckmann and Wood, MICRO’04]

CPU0 CPU0 CPU2 CPU2 CPU1 CPU1 CPU3 CPU3 Bank2 Bank3 Bank0 Bank1 Bank6 Bank7 Bank4 Bank5 CPU4 CP40 CPU6 CPU6 CPU5 CPU5 CPU7 CPU7 Remoteness of Shared Data • Inevitably resides far from (some of) its clients • Long access times

⇒ Shared hot lines effect A small number of lines, shared by many processors, is accessed numerous times Observations on Memory Accesses For many parallel applications: • Splash-2, SpecOMP, Apache, Specjbb, STM, .. • Access to shared lines is substantial • Shared lines are shared by many processors • A small number of lines make for a large fraction of the total accesses

CPU0 CPU0 CPU2 CPU2 CPU1 CPU1 CPU3 CPU3 Bank2 Bank3 Bank0 Bank1 Bank6 Bank7 Bank4 Bank5 CPU4 CP40 CPU6 CPU6 CPU5 CPU5 CPU7 CPU7 Shared Data Hinders Cache Perf. What can be done better? • Bring shared data closer to all processors • Preserve vicinity of private data

P1 P0 P7 P2 P6 P3 P4 P5 This Has Been Addressed Before Overview of Nahalal cache organization Aerial view of Nahalal cooperative village

Bank1 Bank0 Bank2 CPU1 CPU1 CPU0 CPU0 CPU2 CPU2 SharedBank CPU7 CPU7 CPU3 CPU3 Bank7 Bank3 CPU6 CPU6 CPU4 CPU4 CPU5 CPU5 Bank6 Bank4 Bank5 P1 P0 P1 P0 P7 P2 P7 P2 P6 P3 P6 P3 P4 P5 P4 P5 Nahalal Layout • A new architectural differentiation of cache lines • According to the way the data is used • Private vs. Shared • Designated area for shared data lines in the center • Small & fast structure • Close to all processors • Outer rings used for private data • Preserves vicinity of private data A more realistic layout:

Bank1 Bank0 Bank2 CPU1 CPU1 CPU0 CPU0 CPU2 CPU2 SharedBank CPU7 CPU7 CPU3 CPU3 Bank7 Bank3 CPU6 CPU6 CPU4 CPU4 CPU5 CPU5 Bank6 Bank4 Bank5 Nahalal Cache Management Where does the data go? • First access – go to private yard of requester • Accesses by additional cores – go to the middle • On eviction from over-crowded middle, can go to any sharer’s private yard • In typical workloads: • virtually all accesses to shared data satisfied from the middle

Bank1 CPU0 CPU0 CPU2 CPU2 CPU1 CPU1 CPU3 CPU3 Bank0 Bank2 CPU1 CPU1 CPU0 CPU0 CPU2 CPU2 Bank2 Bank3 Bank0 Bank1 SharedBank Bank7 Bank3 CPU7 CPU7 CPU3 CPU3 Bank6 Bank7 Bank4 Bank5 CPU6 CPU6 CPU4 CPU4 CPU5 CPU5 Bank6 Bank4 CPU4 CP40 CPU6 CPU6 CPU5 CPU5 CPU7 CPU7 Bank5 Simulations • Full system simulation via SIMICS • 8 Processor CMP • Private L1 for each processor (32KByte) • 16MByte of shared L2 CIM (Cache In the Middle) Nahalal • 2MB near each processor • 1.875MB near each processor • 1MB in the middle

24.2% 41.1% 29.1% 40.53% 29.06% 39.4% 29.35% # clock cycles 8.57% 3.9% Cache Performance • 26.8% improvement in average cache hit time • 41.1% in apache Average Cache Hit Time (cycles)

Average Relative Distance Average Distance – Shared vs. Private • Nahalal shortens the distance to shared data • Distance to private data remains roughly the same Average Relative Distance

Putting Nahalal into Practice • Line search: • How to find a line within the cache • Line Migration: • When and where to move a line between places in the cache • Scalability: • How far can we take the Nahalal structure “The difference between theory and practice is always larger in practice than it is in theory” [Peter H. Salus]

P1 P0 P7 P2 P6 P3 P4 P5 Summary • State-of-the-art cache’s weakness • Remoteness of shared data • Software behavior: • Shared-hot-lines effect  Shared data hinders cache performance • Nahalal cache architecture • Places shared lines closer to all processor • Preserve vicinity of private data • A new architectural differentiation of cache lines • Not all data should be treated equally • Data-usage-aware design Questions ?

Backup

Kfar Yehoshua Nahalal Scalability Issues This has (also) been addressed before Clustered Nahalal CMP design A cluster of Garden-Cities (Ebenezer Howard, 1902)

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture