1 / 91

Coarse-Grained Coherence

Coarse-Grained Coherence. Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with: Jason Cantin , IBM (Ph.D. ’06) Natalie Enright Jerger Prof. Jim Smith Prof. Li-Shiuan Peh (Princeton).

ordell
Download Presentation

Coarse-Grained Coherence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with: Jason Cantin, IBM (Ph.D. ’06) Natalie Enright Jerger Prof. Jim Smith Prof. Li-Shiuan Peh (Princeton) http://www.ece.wisc.edu/~pharm

  2. Motivation • Multiprocessors are commonplace • Historically, glass house servers • Now laptops, soon cell phones • Most common multiprocessor • Symmetric processors w/coherent caches • Logical extension of time-shared uniprocessors • Easy to program, reason about • Not so easy to build Mikko Lipasti-University of Wisconsin

  3. P0 P1 P2 P3 P4 P5 P6 P7 Coherence Granularity • Track each individual word • Too much overhead • Track larger blocks • 32B – 128B common • Less overhead, exploit spatial locality • Large blocks cause false sharing • Solution: use multiple granularities • Small blocks: manage local read/write permissions • Large blocks: track global behavior Mikko Lipasti-University of Wisconsin

  4. Coarse-Grained Coherence • Initially • Identify non-shared regions • Decouple obtaining coherence permission from data transfer • Filter snoops to reduce broadcast bandwidth • Later • Enable aggressive prefetching • Optimize DRAM accesses • Customize protocol, interconnect to match Mikko Lipasti-University of Wisconsin

  5. Coarse-Grained Coherence • Optimizations lead to • Reduced memory miss latency • Reduced cache-to-cache miss latency • Reduced snoop bandwidth • Fewer exposed cache misses • Elimination of unnecessary DRAM reads • Power savings on bus, interconnect, caches, and in DRAM • World peace and end to global warming Mikko Lipasti-University of Wisconsin

  6. Coarse-Grained Coherence Tracking • Memory is divided into coarse-grained regions • Aligned, power-of-two multiple of cache line size • Can range from two lines to a physical page • A cache-like structure is added to each processor for monitoring coherence at the granularity of regions • Region Coherence Array(RCA) Mikko Lipasti-University of Wisconsin

  7. Region Coherence Arrays • Each entry has an address tag, state, and count of lines cached by the processor • The region state indicates if the processor and / or other processors are sharing / modifying lines in the region • Customize policy/protocol/interconnect to exploit region state Mikko Lipasti-University of Wisconsin

  8. Talk Outline • Motivation • Overview of Coarse-Grained Coherence • Techniques • Broadcast Snoop Reduction [ISCA 2005] • Stealth Prefetching [ASPLOS 2006] • Power-Efficient DRAM Speculation • Hybrid Circuit Switching • Virtual Proximity • Circuit-switched snooping • Research Group Overview Mikko Lipasti-University of Wisconsin

  9. Unnecessary Broadcasts Mikko Lipasti-University of Wisconsin

  10. Broadcast Snoop Reduction • Identify requests that don’t need a broadcast • Send data requests directly to memory w/o broadcasting • Reducing broadcast traffic • Reducing memory latency • Avoid sending non-data requests externally Example Mikko Lipasti-University of Wisconsin

  11. Simulator Evaluation PHARMsim: near-RTL but written in C • Execution-driven simulator built on top of SimOS-PPC • Four 4-way superscalar out-of-order processors • Two-level hierarchy with split L1, unified 1MB L2 caches, and 64B lines • Separate address / data networks –similar to Sun Fireplane Mikko Lipasti-University of Wisconsin

  12. Workloads • Scientific • Ocean, Raytrace, Barnes • Multiprogrammed • SPECint2000_rate, SPECint95_rate • Commercial (database, web) • TPC-W, TPC-B, TPC-H • SPECweb99, SPECjbb2000 Mikko Lipasti-University of Wisconsin

  13. Broadcasts Avoided Mikko Lipasti-University of Wisconsin

  14. Execution Time Mikko Lipasti-University of Wisconsin

  15. Summary • Eliminates nearly all unnecessary broadcasts • Reduces snoop activity by 65% • Fewer broadcasts • Fewer lookups • Provides modest speedup Mikko Lipasti-University of Wisconsin

  16. Talk Outline • Motivation • Overview of Coarse-grained Coherence • Techniques • Broadcast Snoop Reduction [ISCA-2005] • Stealth Prefetching [ASPLOS 2006] • Power-Efficient DRAM Speculation • Hybrid Circuit Switching • Virtual Proximity • Circuit-switched snooping • Research Group Overview Mikko Lipasti-University of Wisconsin

  17. Prefetching in Multiprocessors • Prefetching • Anticipate future reference, fetch into cache • Many prefetching heuristics possible • Current systems: next-block, stride • Proposed: skip pointer, content-based • Some/many prefetched blocks are not used • Multiprocessors complications • Premature or unnecessary prefetches • Permission thrashing if blocks are shared • Separate study [ISPASS 2006] Mikko Lipasti-University of Wisconsin

  18. Stealth Prefetching Lines from non-shared regions can be prefetched stealthily and efficiently • Without disturbing other processors • Without downgrades, invalidations • Without preventing them from obtaining exclusive copies • Without broadcasting prefetch requests • Fetched from DRAM with low overhead Example Mikko Lipasti-University of Wisconsin

  19. Stealth Prefetching • After a threshold number of L2 misses (2), the rest of the lines from a region are prefetched • These lines are buffered close to the processor for later use (Stealth Data Prefetch Buffer) • After accessing the RCA, requests may obtain data from the buffer as they would from memory • To access data, region must be in valid state and a broadcast unnecessary for coherent access Mikko Lipasti-University of Wisconsin

  20. L2 Misses Prefetched Mikko Lipasti-University of Wisconsin

  21. Speedup Mikko Lipasti-University of Wisconsin

  22. Summary Stealth Prefetching can prefetch data: • Stealthily: • Only non-shared data prefetched • Prefetch requests not broadcast • Aggressively: • Large regions prefetched at once, 80-90% timely • Efficiently: • Piggybacked onto a demand request • Fetched from DRAM in open-page mode Mikko Lipasti-University of Wisconsin

  23. Talk Outline • Motivation • Overview of Coarse-grained Coherence • Techniques • Broadcast Snoop Reduction [ISCA-2005] • Stealth Prefetching [ASPLOS 2006] • Power-Efficient DRAM Speculation • Hybrid Circuit Switching • Virtual Proximity • Circuit-switched snooping • Research Group Overview Mikko Lipasti-University of Wisconsin

  24. DRAM Read Xmit Block Power-Efficient DRAM Speculation • Modern systems overlap the DRAM access with the snoop, speculatively accessing DRAM before snoop response • Trading DRAM bandwidth for latency • Wasting power • Approximately 25% of DRAM requests are reads that speculatively access DRAM unnecessarily Broadcast Req Snoop Tags Send Resp Mikko Lipasti-University of Wisconsin

  25. DRAM Operations Mikko Lipasti-University of Wisconsin

  26. Power-Efficient DRAM Speculation • Direct memory requests are non-speculative • Lines from externally-dirty regions likely to be sourced from another processor’s cache • Region state can serve as a prediction • Need not access DRAM speculatively • Initial requests to a region (state unknown) have a lower but significant probability of obtaining data from other processors’ caches Mikko Lipasti-University of Wisconsin

  27. Useless DRAM Reads Mikko Lipasti-University of Wisconsin

  28. Useful DRAM Reads Mikko Lipasti-University of Wisconsin

  29. DRAM Reads Performed/Delayed Mikko Lipasti-University of Wisconsin

  30. Summary Power-Efficient DRAM Speculation: • Can reduce DRAM reads 20%, with less than 1% degradation in performance • 7% slowdown with nonspeculative DRAM • Nearly doubles interval between DRAM requests, allowing modules to stay in low-power modes longer Mikko Lipasti-University of Wisconsin

  31. Talk Outline • Motivation • Overview of Coarse-grained Coherence • Techniques • Broadcast Snoop Reduction [ISCA-2005] • Stealth Prefetching [ASPLOS 2006] • Power-Efficient DRAM Speculation • Hybrid Circuit Switching • Virtual Proximity • Circuit-switched snooping • Research Group Overview Mikko Lipasti-University of Wisconsin

  32. Chip Multiprocessor Interconnect • Options • Buses: don’t scale • Crossbars: too expensive • Rings: too slow • Packet-switched mesh • Attractive for all the same 1990’s DSM reasons • Scalable • Low latency • High link utilization Mikko Lipasti-University of Wisconsin

  33. CMP Interconnection Networks • But… • Cables/traces are now on-chip wires • Fast, cheap, plentiful • Short: 1 cycle per hop • Router latency adds up • 3-4 cycles per hop • Store-and-forward • Lots of activity/power • Is this the right answer? Mikko Lipasti-University of Wisconsin

  34. Circuit-Switched Interconnects • Communication patterns • Spatial locality to memory • Pairwise communication • Circuit-switched links • Avoid switching/routing • Reduce latency • Save power? • Poor utilization! Maybe OK Mikko Lipasti-University of Wisconsin

  35. Router Design • Switches consist of • Configurable crossbar • Configuration memory • 4-stage router pipeline exposes only 1 cycle if CS • Can also act as packet-switched network • Design details in [CA Letters ‘07] Mikko Lipasti-University of Wisconsin

  36. Protocol Optimization • Initial 3-hop miss establishes CS path • Subsequent miss requests • Sent directly on CS path to predicted owner • Also in parallel to home node • Predicted owner sources data early • Directory acks update to sharing list • Benefits • Reduced 3-hop latency • Less activity, less power Mikko Lipasti-University of Wisconsin

  37. Hybrid Circuit Switching (1) • Hybrid Circuit Switching improves performance by up to 7% Mikko Lipasti-University of Wisconsin

  38. Hybrid Circuit Switching (2) • Positive interaction in co-designed interconnect & protocol • More circuit reuse => greater latency benefit Mikko Lipasti-University of Wisconsin

  39. Summary Hybrid Circuit Switching: • Routing overhead eliminated • Still enable high bandwidth when needed • Co-designed protocol • Optimize cache-to-cache transfers • Substantial performance benefits • To do: power analysis Mikko Lipasti-University of Wisconsin

  40. Talk Outline • Motivation • Overview of Coarse-grained Coherence • Techniques • Broadcast Snoop Reduction [ISCA-2005] • Stealth Prefetching [ASPLOS 2006] • Power-Efficient DRAM Speculation • Hybrid Circuit Switching • Virtual Proximity • Circuit-switched snooping • Research Group Overview Mikko Lipasti-University of Wisconsin

  41. Server Consolidation on CMPs • CMP as consolidation platform • Simplify system administration • Save power, cost and physical infrastructure • Study combinations of individual workloads in full system environment • Micro-coded hypervisor schedules VMs • See An Evaluation of Server Consolidation Workloads for Multi-Core Designs in IISWC 2007for additional details • Nugget: shared LLC a big win Mikko Lipasti-University of Wisconsin

  42. Virtual Proximity • Interactions between VM scheduling, placement, and interconnect • Goal: placement agnostic scheduling • Best workload balance • Evaluate 3 scheduling policies • Gang, Affinity and Load Balanced • HCS provides virtual proximity Mikko Lipasti-University of Wisconsin

  43. Scheduling Algorithms • Gang Scheduling • Co-schedules all threads of a VM • No idle-cycle stealing • Affinity Scheduling • VMs assigned to neighboring cores • Can steal idle cycles across VMs sharing core • Load Balanced Scheduling • Ready threads assigned to any core • Any/all VMs can steal idle cycles • Over time, VM fragments across chip Mikko Lipasti-University of Wisconsin

  44. Load balancing wins with fast interconnect • Affinity scheduling wins with slow interconnect • HCS creates virtual proximity Mikko Lipasti-University of Wisconsin

  45. Virtual Proximity Performance • HCS able to provide virtual proximity Mikko Lipasti-University of Wisconsin

  46. As physical distance (hop count) increases, HCS provides significantly lower latency Mikko Lipasti-University of Wisconsin

  47. Summary Virtual Proximity [in submission] • Enables placement agnostic hypervisor scheduler • Results: • Up to 17% better than affinity scheduling • Idle cycle reduction : 84% over gang and 41% over affinity • Low-latency interconnect mitigates increase in L2 cache conflicts from load balancing • L2 misses up by 10% but execution time reduced by 11% • A flexible, distributed address mapping combined with HCS out-performs a localized affinity-based memory mapping by an average of 7% Mikko Lipasti-University of Wisconsin

  48. Talk Outline • Motivation • Overview of Coarse-grained Coherence • Techniques • Broadcast Snoop Reduction [ISCA-2005] • Stealth Prefetching [ASPLOS 2006] • Power-Efficient DRAM Speculation • Hybrid Circuit Switching • Virtual Proximity • Circuit-switched snooping • Research Group Overview Mikko Lipasti-University of Wisconsin

  49. Circuit Switched Snooping (1) • Scalable, efficient broadcasting on unordered network • Remove latency overhead of directory indirection • Extend point-to-point circuit-switched links to trees • Low latency multicast via circuit-switched tree • Help provide performance isolation as requests do not share same communication medium Mikko Lipasti-University of Wisconsin

  50. Circuit-Switched Snooping (2) • Extend Coarse Grain Coherence Tracking (CGCT) • Remove unnecessary broadcasts • Convert broadcasts to multicasts • Effective in Server Consolidation Workloads • Very few coherence requests to globally shared data Mikko Lipasti-University of Wisconsin

More Related