1 / 13

Practical Reduction for Store Buffers

Practical Reduction for Store Buffers. Ernie Cohen, Microsoft Norbert Schirmer, DFKI. problem. practical reasoning about imperative code is based on state assertions and invariants such reasoning tacitly assumes sequential consistency (SC) … … but real MP hardware doesn’t provide SC

Download Presentation

Practical Reduction for Store Buffers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Reduction for Store Buffers Ernie Cohen, Microsoft Norbert Schirmer, DFKI

  2. problem • practical reasoning about imperative code is based on state assertions and invariants • such reasoning tacitly assumes sequential consistency (SC) … • … but real MP hardware doesn’t provide SC • needed: a programming discipline that • guarantees SC • is flexible enough to handle real software • is practical to check

  3. x86/x64 hardware model: TSO • FIFO store buffer (SB) between each processor (P) and the (shared, SC) memory • P writes are queued onto its SB • concurrently, writes leave SBs and are applied to memory • a read by P reads from P’s SB if possible; otherwise, it reads from memory (“SB forwarding”) • P can flush its own SB (expensive) • note: TSO != “load-acquire, store-release” • a read can move backward past a write to the same location, turning into a read of a constant • note: UP TSO machines are SC, but …

  4. TSO is not SC • TSO is not SC, because of the delay in writes becoming visible to other processors, e.g. P0: <x := 1> <y = 0> P1: <y := 1> <x = 0> • both Ps can complete under TSO, but not under SC (whichever thread writes second gets stuck)

  5. a simple SC discipline • make sure that P reads only when P’s SB is empty • writes dirty the SB; flushes clean it • read allowed only when the SB is clean • (lazy caching uses a similar trick to achieve SC) • proof of SC: • each P simulates a virtual P (that might fall behind) • virtual P takes a write step when that write hits memory • real and virtual P are in sync on read steps • but this discipline isn’t practical • disjoint concurrency shouldn’t require any flushes! • idea: distinguish private and shared memory

  6. ownership • each location can be either owned (by a unique processor) or unowned • each access is volatile or nonvolatile • modified discipline: • nonvolatile access requires ownership of the location • volatile writes dirty the SB • volatile reads allowed only when SB is clean • simulation proof is similar, but novolatile accesses happen as soon as there are no volatile writes in front of them • they’re guaranteed to see the same values when they hit the SB, because other Ps don’t modify

  7. moving ownership around • use ghost operations to take and release ownership • P can take ownership of unowned locations • P can release ownership of locations P owns • (this fits with ownership in VCC, where “unowned” means owned by a data object rather than a thread) • discipline in the paper also adds unowned read-only locations, which allows shared non-volatile reading

  8. ex: spinlocks typedef… struct _SPIN_LOCK { volatileint Lock; _(ghost \object prot_obj;) _(invariant !Lock ==> \mine(prot_obj)) } SPIN_LOCK; void Acquire(SPIN_LOCK *SpinLock …) … { int stop; do { …{ //atomic stop = (__interlockedcompareexchange(&SpinLock->Lock, 1, 0) == 0); _(if (stop) \giveup_closed_owner(SpinLock->prot_obj, SpinLock);) } } while (!stop); } Microsoft confidential

  9. key points • discipline follows some basic VCC methodology • discipline expressed in terms of ghost state • ghost code “witnesses” conformance to the discipline (much as ghost code is used to witness simulations) • by replacing proof obligations with programming obligations, we’re more likely to get programmers to do it • when checking the discipline, we get to assume a SC execution, so we never have to think about the SBs.

  10. the only tricky part of the proof • key observation: ownership changes cannot race on their own • if they do, there are executions that violate the discipline • therefore, we can pretend that ownership doesn’t get released until the next volatile write

  11. a note on ghosts • VCC requires lots of ghost code, incude racy operations on volatile ghost state • why doesn’t this introduce flushing? SC code follows discipline on real data => {SC stripped code simulates SC code} SC stripped code follows discipline on real data => {reduction theorem} stripped code simulates SC stripped code => {SC stripped code simulates SC code} stripped code simulates SC code

  12. how close is this to practice? • discipline followed almost everywhere in the Hv codebase • even non-interlocked volatile writes are fairly rare • exceptions (outside of device ops) are writes where • the write doesn’t race with other writes • racing reads can safely read the old value • ex: releasing a spinlock, broadcasting signals • a solution: introduce a new kind of volatile • one reader, multiple writers • keep track of an upper and lower bound • writes must be above the upper bound, • writes raise the upper bound • flush raises the lower bound to the upper bound • reads by other processors raise the lower bound to the value read • this works, but is kind of gross

More Related