To Include or Not to Include?

To Include or Not to Include? Natalie Enright Dana Vantrease

Motivation • CMP technology affects coherence protocols differently than previously studied MP systems • New shared on-chip resources (e.g. L2) • Low latency between on-chip caches • Need for scalability in design • Industry Examples • IBM Power 4 – Inclusion • Piranha – Exclusion • Our goal: Determine at which point, each inclusion protocol (strict inclusion, non-inclusion and exclusion) is the best choice for CMP performance.

SMP vs CMP Opportunities L1 L1 L1 L1 VS L2 L2 L2 L1 L1 L1 L1 VS L2 L2 L2

Multilevel Inclusion • Protocol given to us with the simulator • L1 has Modified, Shared and Invalid States • L2 has Modified, Owned, Shared, and Invalid States • When an L2 line is replaced, any copies present on the chip must be invalidated (the sharers are given in the directory entry) • In a single processor chip, there are only 2 caches (Instruction and Data) connected to a single L2 cache • Chip multiprocessors introduce an additional 2 level 1 caches per processor which could make this forced inclusion harmful.

Non-Inclusion • Protocol courtesy of Mike • L1 now has owned and exclusion states • Complexity of the on chip directory has increased significantly • States added to indicate local level 1 sharers or a local level 1 owner. • L1 directory state also needs to be visible for external requests from other chips • Increase effective on-chip cache storage

Directory Exclusion • No replication of Data between a single L1 and the L2 • L2 Acts as Large Victim Cache • Utilizes cache space, lowering required off-chip bandwidth • L2 is centralized coherency point (tag lookup) • L1 States: M, E, I, SC, SM • L2 States: M, E, I • No ownership – simply request 1st Sharer in Tag Lookup for Data Request

Directory Exclusion L1 L1 L1 L1 L1 L1 L1 L1 L2 L1 Tags L2 L1 Tags L1 L1 L1 L1 L2 L1 Tags

Tag Lookup Cache • Aids in off-chip coherency and directing on-chip requests • Associativity = L1 associativity * # L1s • # Sets = #Sets in a single L1 • # Data Entries = # L1s • Data Entry = The L1 corresponding to the Data Entry has the data or not (1/0). • Scalability?

Methodology • Vary the L1 cache size to find the design point at which an inclusive protocol hurts performance. • As the number of cores increases, so does the aggregate L1 cache size

Simulation Configuration • Configuration • 4 processors per chip and 1 chip • 2 MB of L2 cache • Small but wanted to see the effect of changing the ratio of L1 size to L2 size. • 16 processors per chip as future work • Only simulated one chip to isolate the effects of intra-chip coherence from inter-chip coherence • Future work: see how extending the life of a block on chip through non-inclusion or exclusion affects other chips.

Results Inclusion vs. Non-Inclusion

Results (cont.) Inclusion vs. Pseudo-Exclusion

Conclusion/Future Work • An inclusive protocol is less complex • Esp. considering inter-chip communication • Non-Inclusion performs consistently better than inclusion • Additional complexity only warranted after the total L1 cache size is greater than 25% of the L2 cache size. • Longer runs and more benchmarks would provide more conclusive evidence

Future Work • Ongoing: Get working exclusion protocol in Ruby tester and Simics. • Current Status: Currently runs 500 memory transactions in the Ruby tester. • Run comparable tests to those run for Non-inclusion • Analyze benefits of exclusion over inclusion. • Expand to 16 cores and study scalability issues.

To Include or Not to Include?