240 likes | 379 Views
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. http://iacoma.cs.uiuc.edu. Motivation. CMPs are ubiquitous. Shared memory + caches = cache coherence. Traditional cache coherence solutions. shared bus-based: electrical, layout issues.
E N D
Uncorq:Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors http://iacoma.cs.uiuc.edu
Motivation • CMPs are ubiquitous • Shared memory + caches = cache coherence • Traditional cache coherence solutions • shared bus-based: electrical, layout issues • directory-based: indirection, storage
Embedded-ring cache coherence [ISCA 2006] • Novel snoopy cache coherence for mid-sized machines • logical ring is embedded in network • control messages use ring • data messages use any path • Simple and inexpensive to implement • Snoop requests can have long latencies
Contributions • Propose invariant for transaction serialization • Propose performance enhancements • Uncorq: unconstrained snoop request delivery • reduces cache-to-cache transfer latency • Simple hardware data prefetching technique • reduces memory-to-cache transfer latency
data A B snoop op. outcome logical ring request request request response response + positive response positive snoop op. outcome Embedded-ring terminology • Snoopy, invalidate protocol • Single supplier protocol • Types of messages: • snoop request • snoop response control messages • snoop request + response • data +
A B S S I S inv time ack read inv data inv ack Transaction serialization S I I S M old value new value
A request request request request response response response response Serialization enforcement with embedded-ring • Logical unidirectional ring provides partial ordering • Distributed algorithm establishes global order • for same-address transactions • On simultaneous transactions to same address: • one is declared the “winner” (firstto reach supplier) • others have to retry
A’s request and response B’s request and response How to serialize transactions + No clear “first” transaction A B’s request reaches S first B Ring guarantees responses are forwarded in the order S performed snoop operations S + A receives B’s positive response before its own A retries: B A
S + request + response Enforcing transaction serialization • Node whose request arrives at supplier node first is the “winner” • What we need to enforce transaction serialization: Ordering Invariant: the order in which responses travel the ring after leaving the supplier must be the same as the order in which the supplier processed their corresponding requests. loser node sees other node’s positive response before its own
Uncorq request response Uncorq idea Baseline Idea: requests do not have to follow the ring (but responses do)
savings request reaches supplier node Benefit of Uncorq Reduced cache-to-cache transfer latency time request snoop data Baseline Uncorq
Implications of Uncorq • Uncorq no longer restricts order of requests • Nodes may receive and process requests in any order • Responses may also get reordered Problem: distributed algorithm relies on the fact that response order reflects order of requests at supplier
Ordering invariant A S B + + S S request + response Example: incorrect transaction ordering A node cannot forward any other response if it has an outstanding positive snoop outcome
+ request response How Uncorq stalls responses • Local transaction table (per-node structure) • records messages that node is currently processing + + + A B C … requests addr C responses
R R Optimization: prefetching from memory • Goal: reduce latency of memory-to-cache transfers • Access memory in parallel with ring snoop optimized unoptimized (1) (2) (1) (1) memory memory • Predict when no node will supply data
Experimental setup • 64 nodes in a single CMP • Interconnection network: 2D torus with embedded-ring • SESC simulator (sesc.sourceforge.net) • SPLASH-2, SPECjbb and SPECweb workloads
Baseline 10 100 8 80 distribution (%) cumulative distribution (%) 6 60 4 40 2 20 0 0 0 100 200 300 400 500 600 cache-to-cache transfer latency substantial reduction in latency Uncorq 10 100 8 80 6 60 cumulative distribution (%) distribution (%) 4 40 2 20 0 0 0 100 200 300 400 500 600 cache-to-cache transfer latency Cache-to-cache transfer latency
Execution Time 1 0.9 0.8 0.7 Baseline normalized execution time 0.6 Uncorq 0.5 Uncorq+Pref 0.4 0.3 0.2 0.1 0 SPLASH-2 SPECjbb SPECweb • Uncorq significantly reduces execution time (reduction: 5-23%) • Uncorq + Pref performs the best (reduction: 13-26%)
Also in the paper • Serialization mechanism for case with no supplier • System and node forward progress • Fences and memory consistency issues • Characterization of prefetching mechanism • Comparison against ccHyperTransport
Conclusion • Propose invariant for transaction serialization • Propose performance enhancements • Uncorq: unconstrained snoop request delivery • Simple hardware data prefetching technique • Reduce execution time by 13-26%
Uncorq:Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors http://iacoma.cs.uiuc.edu