Coherence Decoupling: Making Use of Incoherence

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004

Motivation • Multi-threading and Multi-processing have become common • When a cache line is marked as invalid very often not all data in the line is incorrect • If the data in invalid lines can be used speculatively there is a great potential for performance improvement

Background Cache Coherence Protocol • Used in shared-memory multiprocessors for managing correct data sharing • Vital to the design of multiprocessors since it contributes the most to inter-processor communication latency

Proposed Idea • Separate the traditional cache coherence protocol into two parts • Speculative cache lookup (SCL) – uses a speculative value from an invalid cache line thus allowing the processor to work continuously • Safe coherence protocol – obtains the correct value which is then compared with the value provided by SCL

Coherence Decoupling

Related Work • Customized Coherence Protocols • Speculative Coherence Operations Dynamic self-invalidation, coherence message predictor, token coherence etc. • Speculation on outcome of events in multi-processor execution

Coherence Decoupling Architecture Must support the following: • Split - means to split a memory op into speculative load and a coherence operation • Compute -mechanisms to support execution with speculative values • Recover – means to recover and rollback upon misprediction

SCL Protocols for Coherence Decoupling • Use a simple safe coherence protocol and rely on an aggressive SCL protocol to increase performance • Two components of an SCL protocol • Read component – obtains the speculative value • Update component – updates an invalid cache line so subsequent speculative reads can use it (can be left out in some SCL protocols)

Read vs Update components • SCL protocol with only a read component can be used if the word in an invalid block has: • Not changed remotely (false sharing) • Changed remotely to a same value (silent stores) • Changed remotely to a different value and then back to the original value (temporally silent stores) • For truly-shared data an update component needs to be added • Speculatively sends data around the system by writing it into invalid cache lines

SCL protocol Read component • CD - Use the locally cached incoherent value for every L2 miss Simple but since it is triggered on every load operation it could produce many mis-speculations • CD-F - Add a PC-indexed confidence predictor to filter speculations Reduces the number of (mis)speculative reads thus improving the average accuracy

SCL protocol Update component • CD-IA Use invalidation piggyback to update all invalid blocks • CD-C Use invalidation piggyback if the value is compressed

SCL protocol Update component (Ctd.) • CD-N - Update all sharers after N writes to a block Increases the number of messages (bandwidth) • CD-W - Update on every write if any sharers exist CD assumed wherever Write update is being used

Methodology • Simulator MP-Sauce & SimpleScalar • 16-node SMP systems simulated • Coherence protocol used – simple invalidation snooping-bus protocol • 3 commercial applications and 5 scientific shared memory SPLASH2 suite benchmarks simulated

Results - Microbenchmarks Simple-fs – loads falsely shared data and then executes (in)dependent instructions Critical-fs – forces data dependence between two loads by placing consecutive false sharing misses in critical path

L2 Miss Profiling Results

Coherence Decoupling Accuracy Results CD, CD-F, CD-IA, CD-C, CD-N, CD-W

Timing Results

Bandwidth Requirements

Latency Tolerance Profiles • Executed instructions during coherence decoupling • The number of control dependent instructions will grow in future processors

Conclusions • Coherence Misses – significant fraction of L2 misses ranging from 10% to 80% • Coherence Decoupling has the potential to hide the miss latency for 40% to 90% of coherence misses • Mis-speculation occurs 20% of the time

Coherence Decoupling: Making Use of Incoherence

Coherence Decoupling: Making Use of Incoherence

Presentation Transcript

Decision Theory

How to Achieve Coherence at a Macro Level

Making Transgenic Plants and Animals

MDMP Class (Military Decision Making Process)

CHAPTER 11

Reasoning and Decision Making

ENHANCING DECISION MAKING

Ethics, Decision Making and Dilemmas

“War Making and State Making as Organized Crime” (Tilly, 1985)

Cohesion and coherence

Decision Making

Chapter 20: Personal Decision Making

Vortex Nernst effect Loss of long-range phase coherence The Upper Critical Field High-temperature Diamagnetism

Topic 5 Ethical Decision-Making Models

The LPAC D ecision-Making P rocess for the T exas A ssessment P rogram

Making machine translation work

Course Title

Lecture 4 Decision Making

Coarse-Grained Coherence

Decision Making and CVP

Making Assessments Matter to Students

Aeronautical Decision Making