1 / 20

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors. Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana-Champaign *IBM Research. http://iacoma.cs.uiuc.edu. Motivation. CMPs are becoming standard components

soo
Download Presentation

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexible Snooping:Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana-Champaign *IBM Research http://iacoma.cs.uiuc.edu

  2. Motivation • CMPs are becoming standard components • cheaper to build medium size machines • 32 to 128 cores (multi-CMP) • shared memory, cache coherent • easier to program, easier to manage • supporting cache coherence is difficult

  3. strategy ordered network? pros cons snoopy broadcast bus yes simple difficult to scale directory based protocol no scalable indirection, extra hardware snoopy embedded ring no simple long latencies Cache coherence solutions • other proposals (e.g. token coherence)

  4. energy consumption energy consumption performance performance Superset Aggressive Superset Conservative Contributions • family of adaptive coherence protocols for rings • two were chosen as best options compared to fastest state-of-the-art scheme high performance scheme energy conscious scheme

  5. CMP Proc + L1 + L2 memory local network Multi-CMP multiprocessor • coherence protocol used: only one supplier if line is cached

  6. data data data supplier predictor Ring in action R R R S S S Oracle Lazy Eager cmp snoop request response

  7. Ring in action R R R S S S Oracle Lazy Eager latency snoops messages • goal: adaptive schemes that approximate Oracle’s behavior

  8. X Primitive snooping actions • snoop and then forward + fewer messages • forward and then snoop + shorter latency • forward only + fewer snoops X + shorter latency – false negative predictions not allowed

  9. set of addresses: predictor / algorithm action on negative prediction action on positive prediction node can supply in predictor Subset Superset Con snoop then forward Predictors and algorithms forward then snoop snoop forward Agg forward then snoop Exact forward snoop

  10. Eager Subset Lazy SupersetAgg SupersetCon Oracle Algorithms Per miss service: / Exact number of messages number of snoops snoop message latency

  11. Predictor implementation • Subset • associative table: • subset of addresses that can be supplied by node • Superset • bloom filter: superset of addresses that can be supplied by node • associative table (exclude cache): • addresses that recently suffered false positives • Exact • associative table: all addresses that can be supplied by node • downgrading: if address has to be evicted from predictor table, corresponding line in node has to be downgraded

  12. A Downgrading Negative effects: E S • writes by this node need to snoop other nodes B A • reads and writes by other nodes need to fetch line from memory

  13. Experiments • 8 CMPs, 4 ooo cores each = 32 cores • private L2 caches • on-chip bus interconnect • off-chip 2D torus interconnect with embedded unidirectional ring • per node predictors: latency of 3 processor cycles • sesc simulator (sesc.sourceforge.net) • SPLASH-2, SPECjbb, SPECweb

  14. Execution time  Lazy 1.2 Eager 1 0.8 Oracle Normalized execution time 0.6 Subset 0.4 SupersetCon 0.2 SupersetAgg 0 Exact  SPLASH-2 SPECjbb SPECweb • performance of most flexible snooping algorithms is similar to Eager • the fastest of all algorithms is SupersetAgg

  15. Miss service energy  3.22 2 Lazy Eager 1.5 Oracle Normalized energy consumption 1 Subset SupersetCon 0.5 SupersetAgg Exact  0 SPLASH-2 SPECjbb SPECweb • algorithms that eagerly forward messages use more energy • SupersetCon is least energy-hungry algorithm

  16. Most cost-effective algorithms  Lazy 1.2 1 Eager 0.8 Oracle execution time Normalized 0.6 Subset 0.4 SupersetCon 0.2 SupersetAgg  0 Exact SPLASH-2 SPECjbb SPECweb  3.22 Lazy 2 Eager 1.5 Oracle Normalized energy consumption Subset 1 SupersetCon 0.5  SupersetAgg 0 Exact SPLASH-2 SPECjbb SPECweb • high performance: Superset Aggressive • faster than Eager at lower energy consumption • energy conscious: Superset Conservative • slightly slower than Eager at much lower energy consumption

  17. energy consumption energy consumption performance performance Most cost-effective algorithms compared to fastest state-of-the-art scheme (Eager) Superset Aggressive high performance scheme Superset Conservative energy conscious scheme can be combined by only changing forwarding policy

  18. Conclusions • proposed flexible snooping, a family of adaptive protocols for embedded rings • two chosen protocols • high performance: Superset Aggressive • energy conservation: Superset Conservative • can be selected dynamically • embedded-ring protocols more attractive

  19. Arch map http://iacoma.cs.uiuc.edu/students/archmap/archmap.html Google: architecture conference map (1st hit)

  20. Flexible Snooping:Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana-Champaign *IBM Research http://iacoma.cs.uiuc.edu

More Related