1 / 26

ImanFaraji

Time-based Snoop Filtering in Chip Multiprocessors. Amirali Baniasadi. ImanFaraji. University of Victoria Victoria, Canada. Amirkabir University of Technology Tehran, Iran. This work: Reducing redundant snoops in chip multiprocessors. Our Goal

taylor
Download Presentation

ImanFaraji

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Time-based Snoop Filtering in Chip Multiprocessors Amirali Baniasadi ImanFaraji University of Victoria Victoria, Canada Amirkabir University of Technology Tehran, Iran

  2. This work: Reducing redundant snoops in chip multiprocessors Our Goal Improving energy efficiency of WT-based CMP Our Motivation There are long time intervals where snooping fails, wasting energy and bandwidth. Our Solution Detect such intervals and avoid snoops Key Results Memory Energy 18% Snoop Traffic 93% Performance 3.8%

  3. Conventional Snooping CPU CPU 4 D$ D$ 5 1 Redundant (miss): ~70% 2 Interconnect 3 controller 6 5 5 D$ D$ 4 4 CPU CPU

  4. WB vs. WT • Relative memory energy consumption

  5. Previous Work: Snoop Filters Eliminate redundant snoop (local & global) requests. Local: one core fails to provide data Global: all cores fail. Examples: RegionScout: Detects Memory Regions Not Shared (Moshovos) Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi) Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti) Good snoop filter • Fast & simple • Accurate and effective

  6. Our Work Time-based Snoop Filtering Motivation: There are long intervals where snooping fails consecutively But how long & how often?

  7. Our Work (Cont.)

  8. Our Work (Cont.) Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails

  9. Distribution (a) LRM distribution for different processors (b) GRM distribution Periods of Data Scarcity are usually long

  10. Time-based Global Miss predictor (TGM) TGM Goals: Detect GRM intervals Shutting down snooping in all processors but one (surviving node). • TGM Types: • TGM-First: First processor that has failed snooping survives. • TGM-Last: Last processor that has failed snooping survives.

  11. TGM implementation • TGM-enhanced CMP

  12. TGM • (a) Coverage (b) Accuracy

  13. Time-based Local Miss predictor (TLM) • Goal: Detect LRMs • How? • Count consecutive snoop misses in a node • Disable snoop when exceeds a threshold • Restart snooping after a number of cycles

  14. TLM implementation • TGM-enhanced CMP Processing Unit (PU) First Level Cache Each Processor Redundant SNoop (RSN) Counter Predictor ReStarT (RST) Counter

  15. TLM features • (a)Coverage (b) Accuracy

  16. Methodology • Our Simulator: SESC • Benchmarks: Splash-2 • To evaluate energy: Cacti 6.5 • System used:Quad-Core CMP • System Parameters SPLASH-2 Benchmarks and INPUT parameters

  17. Relative Snoop Traffic Reduction • TGM-F: 58% • TGM-L: 57% • TLM: 77%

  18. Relative Memory Energy • TGM-F: 8% • TGM-L: 8.5% • TLM: 11%

  19. Relative Memory Delay • TGM-F: 1.1% • TGM-L: 2.1% • TLM: 1.7%

  20. Relative Performance • TGM-F: No Change • TGM-L: 0.4% • TLM: 0.3%

  21. Summary • We showed: • Long data scarcity period (DSP) exist during workload runtime • During DSPs redundant snoops happen frequently and consecutively • Our solutions • TGM: • uses snoop behavior on all processors to detect and filter redundant snoops • Shutdown snoop on as much processor as possible • TLM: • Redundant snoops are filtered in a single node • Counts recent redundant snoops to detect data scarcity periods and filter upcoming redundant snoops • Simulation Results: • Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77% • Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11% • Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7% • Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3%

  22. Thanks for your attention

  23. Backup Slides

  24. Discussion • How Characteristics of the benchmarks affect memory energy/delay reduced by our solution? 1. True detection of redundant snoops 2. Share of Redundant Snoops

  25. Memory Energy.Delay Memory Energy = Energy consumed to provide the requested data Memory Delay = time required to provide the requested data

  26. Volrend Benchmark • Volrend while running rarely send snoop requests • This application renders a three-dimensional volume. It renders several frames from changing viewpoints consecutive frames in rotation sequences often vary slightly in viewpoint High Temporal Locality Volrend does Load Distribution very well High Spatial Locality

More Related