To include or not to include
This presentation is the property of its rightful owner.
Sponsored Links
1 / 14

To Include or Not to Include? PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

To Include or Not to Include?. Natalie Enright Dana Vantrease. Motivation. CMP technology affects coherence protocols differently than previously studied MP systems New shared on-chip resources (e.g. L2) Low latency between on-chip caches Need for scalability in design Industry Examples

Download Presentation

To Include or Not to Include?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


To include or not to include

To Include or Not to Include?

Natalie Enright

Dana Vantrease


Motivation

Motivation

  • CMP technology affects coherence protocols differently than previously studied MP systems

    • New shared on-chip resources (e.g. L2)

    • Low latency between on-chip caches

    • Need for scalability in design

  • Industry Examples

    • IBM Power 4 – Inclusion

    • Piranha – Exclusion

  • Our goal: Determine at which point, each inclusion protocol (strict inclusion, non-inclusion and exclusion) is the best choice for CMP performance.


Smp vs cmp opportunities

SMP vs CMP Opportunities

L1

L1

L1

L1

VS

L2

L2

L2

L1

L1

L1

L1

VS

L2

L2

L2


Multilevel inclusion

Multilevel Inclusion

  • Protocol given to us with the simulator

    • L1 has Modified, Shared and Invalid States

    • L2 has Modified, Owned, Shared, and Invalid States

  • When an L2 line is replaced, any copies present on the chip must be invalidated (the sharers are given in the directory entry)

  • In a single processor chip, there are only 2 caches (Instruction and Data) connected to a single L2 cache

    • Chip multiprocessors introduce an additional 2 level 1 caches per processor which could make this forced inclusion harmful.


Non inclusion

Non-Inclusion

  • Protocol courtesy of Mike

  • L1 now has owned and exclusion states

  • Complexity of the on chip directory has increased significantly

    • States added to indicate local level 1 sharers or a local level 1 owner.

    • L1 directory state also needs to be visible for external requests from other chips

  • Increase effective on-chip cache storage


Directory exclusion

Directory Exclusion

  • No replication of Data between a single L1 and the L2

    • L2 Acts as Large Victim Cache

    • Utilizes cache space, lowering required off-chip bandwidth

  • L2 is centralized coherency point (tag lookup)

  • L1 States: M, E, I, SC, SM

  • L2 States: M, E, I

  • No ownership – simply request 1st Sharer in Tag Lookup for Data Request


Directory exclusion1

Directory Exclusion

L1

L1

L1

L1

L1

L1

L1

L1

L2

L1 Tags

L2

L1 Tags

L1

L1

L1

L1

L2

L1 Tags


Tag lookup cache

Tag Lookup Cache

  • Aids in off-chip coherency and directing on-chip requests

  • Associativity = L1 associativity * # L1s

  • # Sets = #Sets in a single L1

  • # Data Entries = # L1s

  • Data Entry = The L1 corresponding to the Data Entry has the data or not (1/0).

  • Scalability?


Methodology

Methodology

  • Vary the L1 cache size to find the design point at which an inclusive protocol hurts performance.

  • As the number of cores increases, so does the aggregate L1 cache size


Simulation configuration

Simulation Configuration

  • Configuration

    • 4 processors per chip and 1 chip

    • 2 MB of L2 cache

      • Small but wanted to see the effect of changing the ratio of L1 size to L2 size.

    • 16 processors per chip as future work

    • Only simulated one chip to isolate the effects of intra-chip coherence from inter-chip coherence

    • Future work: see how extending the life of a block on chip through non-inclusion or exclusion affects other chips.


Results

Results

Inclusion vs. Non-Inclusion


Results cont

Results (cont.)

Inclusion vs. Pseudo-Exclusion


Conclusion future work

Conclusion/Future Work

  • An inclusive protocol is less complex

    • Esp. considering inter-chip communication

  • Non-Inclusion performs consistently better than inclusion

    • Additional complexity only warranted after the total L1 cache size is greater than 25% of the L2 cache size.

  • Longer runs and more benchmarks would provide more conclusive evidence


Future work

Future Work

  • Ongoing: Get working exclusion protocol in Ruby tester and Simics.

    • Current Status: Currently runs 500 memory transactions in the Ruby tester.

  • Run comparable tests to those run for Non-inclusion

    • Analyze benefits of exclusion over inclusion.

  • Expand to 16 cores and study scalability issues.


  • Login