- 84 Views
- Uploaded on
- Presentation posted in: General

* Work supported in part by SRC Contract 1031.001 and NSF Award 0219805

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Verifying MP Executions against Itanium Orderingsusing SAT*Ganesh GopalakrishnanYue YangHemanthkumar SivarajSchool of Computing, University of UtahSalt Lake City, UT, 84112

* Work supported in part by SRC Contract 1031.001 and NSF Award 0219805

Efficient Multiprocessors must have Efficient Shared Memory Systems

- * Hide the cost of memory operations by postponing updates
- * Increasingly important because CPUs are growing faster
- faster than memory systems are

How to build Efficient Shared-memory Multiprocessor Systems?

- Employ weak memory models
- They permit global state updates to be postponed

- Employ aggressive shared memory consistency protocols
- Weak memory models permit shared memory consistency protocols to be
aggressive without undue complexity (no speculation, etc.)

The focus of this talk is on weak memory models

- Weak memory models permit shared memory consistency protocols to be

Weak memory models allow multiple executions...

st c,1 ;

st d,2

ld d;

ld c

CPU

CPU

Memory

One possible

execution...

st c,1 ;

st d,2

ld d, 2;

ld c, 0

Impossible under SC

Possible under Itanium

Another

execution...

st c,1 ;

st d,2

ld d, 2;

ld c, 1

Possible under SC

and under Itanium

Problems with Weak Memory Models

- Hard to understand (easy to misunderstand)

P

st[x] = 1

mf

ldr1 = [y]<0>

Q

st . rel [y] = 1

R

ld . acqr2 = [y]<1>

ld r3 = [x]<0>

Is this legal under Itanium ? (no)

Post-Si verification of MP Orderings today (oversimplified)

assembly

program 1

assembly

program n

Run repeatedly

to catch one interleaving

that might reveal bug

...

New MP System

...

Check every execution

against ordering rules for

compliance

assembly

execution 1

assembly

execution n

* This is done ad-hoc

* How to make this formal

and efficient ?

* How to capitalize on repeated

re-runs ?

Explanation of Illegal Executions (p 31 of Itanium App Note – search 251429)

P

st[x] = 1

mf

ldr1 = [y]<0>

Q

st . rel [y] = 1

R

ld . acqr2 = [y]<1>

ld r3 = [x]<0>

la:

sr:

us:

mf:

ul2:

ul1:

- US >> MF ; hence RVr(US) F(MF)
- MF >> UL1 ; hence F(MF) R(UL1)
- …many reasons… hence R(UL1) RVp(SR)
- If RVr(SR) R(UL1) and RVr(SR) UL1 RVp(SR) , WB release atomicity of SR
- is violated, thus R(UL1) RVr(SR)
- …five lines of reasons Hence RVr(SR) R(LA)
- Since LA >> UL2, R(LA) R(UL2)
- Another para of reasons LV(Sr2) R(UL2) LV(SR1) RVp(SR1) RVq(SR1)
- F(MF1) R(UL1) RVq(SR2) RVp(SR2). But can’t allow due to atomicity of SR.

Checking Executions and Providing Explanations (present approach)

P

st[x] = 1

mf

ldr1 = [y]<0>

Q

st . rel [y] = 1

R

ld . acqr2 = [y]<1>

ld r3 = [x]<0>

- Published approaches are very labor-intensive paper-and-pencil proofs
- Clearly this can’t scale (6 instruction MP program takes 1-page of detailed
- mathematical proof
- What about the combinatorics of reasoning about 200 instructions?
- Approaches actually used within the industry involves the use of “checkers”
- Details of these checkers are unknown (How complete? How scalable?)

Our Approach

MP execution to be checked

ld . acqr2 = [y]<1>

ld r3 = [x]<0>

st[x] = 1

mf

ldr1 = [y]<0>

st . rel [y] = 1

Itanium

Ordering rules

written in

Higher Order

Logic

Mechanical

Program Derivation

Checker Program

R

ld.acqr2 = [y]<1>

ld r3 = [x]<0>

P

st[x] = 1

mf

ldr1 = [y]<0>

Q

st.rel [y] = 1

Satisfiability Problem

with

Clauses carrying

annotations

Sat Solver

Unsat

Sat

Unsat Core

Extraction

using

Zcore

Explanation

in the form of

one possible

interleaving

- Find Offending Clauses
- Trace their annotations
- Determine “ordering cycle”

Largest example tried to date (courtesy S. Zeisset, Intel)

Proc 2

ld4 r24 = [733a74] <415e304>

st4.rel [175984] = 96ab4e1f

… 67 more instructions…

ld8 r87 = [56460] <b5c113d7ce4783b1>

Proc 1

st8 [12ca20] = 7f869af546f2f14c

ld r25 = [45180] <87b5e547172644a8>

… 58 more instructions…

st2 [7c2a00] = 4bca

- Initially the tool gave a trivial violation
- Diagnosed to be forgotten memory initialization
- Added method to incorporate memory initialization in our tool
- Our tool found the exact same cycle as pointed out by author of test
- Sat generation and Sat solving times need improving

Cycle found thru our tool:

st.rel(line 18, P1) ld (line 22, P2) mf ld (line 30, P2) st (line 11, P1)

Statistics Pertaining to Case Study

Proc 2

ld4 r24 = [733a74] <415e304>

st4.rel [175984] = 96ab4e1f

… 67 more instructions…

ld8 r87 = [56460] <b5c113d7ce4783b1>

Proc 1

st8 [12ca20] = 7f869af546f2f14c

ld r25 = [45180] <87b5e547172644a8>

… 58 more instructions…

st2 [7c2a00] = 4bca

- All runs were on a 1.733 GHz 1GB Redhat Linux V9 Athlon
- ~2 minutes to generate Sat instance
- 14,053,390 clauses
- 117,823 variables
- ~1 minute to solve Sat problem - found Unsat
- Unsat Core generation runs fast – gave 23 clauses!
- 23 of the 14M clauses were causing the problem to be Unsat
- Sat time for these 23 clauses … under a second

- Unsat Core’s annotations were traced back to offending instructions and
- the memory ordering rules that situated them in a “cycle”

The rest of the talk

- Itanium memory model in Higher Order Logic (well, not so high actually… )
- Our HOL specs translation “sat-generating checker programs”
- Execution to be checked translation by above program to Sat
- Each assembly instruction clauses it generates + annotations
- When Sat, what interleaving explains?
- When Unsat, how to get “core” (root-cause) + annotations on core
- Translating annotations on core to cycle on original program

- Itanium memory model in Higher Order Logic (well, not so high actually… )

- The initial focus of our presentation :
- How to model an execution ?
- Why use “split stores” in modeling ?

- Itanium memory model in Higher Order Logic (well, not so high actually… )

Basic problem-modeling idea:

Find a “shuffle” of the instructions that explains the observations…

P1

Explanation…

P0

st[y] = 1

ld reg2 = [y] <1>

st[y] = 1

ld reg1 = [y] <1>

ld reg1 = [y] <1>

ld reg2 = [y] <1>

The basic idea won’t always work …

st.rel[y] = 1

st.rel[x] = 2

No Shuffle

of these

sequences

respecting

satisfies

the read-values

Dat. Dep.

Dat. Dep.

ld.acq r3 = [y] <1>

ld.acq r4 = [x] <2>

Ld . Acq Order

Ld . Acq Order

“

”

ld reg1 = [x] <0>

ld reg2 = [y] <0>

- Problem Modeling…

Idea: Find a shuffle after each store is split into (p+1) copies….

(by the way, this idea has sort of become “standard”)

P1

P0

st[y] = 1

st[x] = 2

Local copy for P0

A

similar

split

“remote” copy for P0

“remote” copy for P1

Now, arrange the split copies…

- Problem Modeling…

P1

P0

st[y] = 1

st[x] = 2

ld.acq r3 = [y] <1>

ld.acq r4 = [x] <2>

ld reg1 = [x] <0>

ld reg2 = [y] <0>

st[y] = 1 “l”

st[x] = 2 “l”

st[y] = 1 “rp0”

st[x] = 2 “rp0”

st[y] = 1 “rp1”

st[x] = 2 “rp1”

Now, arrange the

split copies…

st[y] = 1 “l”

Explanation…

ld.acq r3 = [y] <1>

Dependencies

st[x] = 2 “l”

ld.acq r4 = [x] <2>

st[y] = 1 “rp0”

st[x] = 2 “rp1”

ld reg1 = [x] <0>

st[x] = 2 “rp0”

Anti-

dependencies

ld reg2 = [y] <0>

st[y] = 1 “rp1”

- Back to Itanium memory model in Higher Order Logic thru an example

Informal statement:

Store-Releases to write-back memory

become visible to all processors in the same order

Implementation:

All copies of a “split st.rel” are visible atomically

st.rel [x] = 1

Atomic set

One standard way of specifying atomicity:

All other events “e” are strictly before or

strictly after the atomic set

e

e

Another standard way of specifying atomicity:

If some event “e” is between two events in the atomic set,

then “e” also belongs to the atomic set

e

e

- atomicWBRelease rule (Section 3.3.7.1 of Intel App Note):

atomicWBRelease(ops,order) =

Forall (i in ops).(j in ops).(k in ops).

(i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB)

/\ (i.wrID = k.wrID)

/\ order(i,j) /\ order(j,k)

==> (j.wrID = i.wrID)

i

k

j

We have reduced the ~36 page Intel App Note to

~3 pages of HOL rules (barring a few simple omissions…)

Basic idea behind Intel’s Formal Spec (which we follow in our formal spec)

Make it look like SC so that people have less trouble understanding!

legalItanium(ops) =

Existsorder.

( requireStrictTotalOrder ops order

/\ requireWriteOperationOrder ops order

/\ requireProgramOrder ops order

/\ requireMemoryDataDependence ops order

/\ requireDataFlowDependence ops order

/\ requireCoherence ops order

/\ requireAtomicWBRelease ops order

/\ requireSequentialUC ops order

/\ requireNoUCBypass ops order

/\ requireReadValue ops order

SC(ops) =

Existsorder.

( requireStrictTotalOrder ops order

/\ requireProgramOrder ops order

/\ requireReadValue ops order

Call it “otherOrder”

But, how do we check executions against such specs?

legalItanium(ops) =

Existsorder.

( requireStrictTotalOrder ops order

/\ requireWriteOperationOrder ops order

/\ requireProgramOrder ops order

/\ requireMemoryDataDependence ops order

/\ requireDataFlowDependence ops order

/\ requireCoherence ops order

/\ requireAtomicWBRelease ops order

/\ requireSequentialUC ops order

/\ requireNoUCBypass ops order

/\ requireReadValue ops order

SC(ops) =

Existsorder.

( requireStrictTotalOrder ops order

/\ requireProgramOrder ops order

/\ requireReadValue ops order

Execution 1

Execution 2

st c,1 ;

st d,2

ld d, 2;

ld c, 0

st c,1 ;

st d,2

ld d, 2;

ld c, 1

e.g., which execution is legal under which memory model ?

- Itanium memory model in Higher Order Logic (well, not so high actually… )
- Our HOL specs translation “sat-generating checker programs”

Transformation of HOL specs to generate constraints

atomicWBRelease(ops,order) =

forall (i in ops).(j in ops).(k in ops).

(i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID)

/\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID)

atomicWBRelease(ops,order) =

forall (i in ops).(j in ops).(k in ops).

(i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID)

/\ ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

atomicWBRelease(ops,order) =

forall (i in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB)

==> forall (k in ops). (i.wrID = k.wrID)

==> forall (j in ops). ~(j.wrID = i.wrID)

==>

~(order(i,j) /\ order(j,k))

Initial Spec

Applying Contrapositive

After Reducing quantifier Scopes

Functional (Ocaml) Program Derivation from HOL Specs:

atomicWBRelease(ops,order) =

forall (i in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB)

==> forall (k in ops). (i.wrID = k.wrID)

==> forall (j in ops). ~(j.wrID = i.wrID)

==>

~(order(i,j) /\ order(j,k))

atomicWBRelease(ops) = forall(i,ops,wb(i))

wb(i) = if ~((attr_of i.var=WB) & (i.op=StRel) & (i.wrType=Remote) then true

else forall(k,ops,wb1(i,k))

wb1(i,k) = if ~(i.wrID=k.wrID) then true

else forall(j,ops,wb2(i,k,j))

wb2(i,k,j) = if (j.wrID=i.wrID) then true

else ~(order(i,j) & order(j,k))

forall(i,S, e(i)) = for all i in S : e(i) (* foldr( map (fn i -> e(i)) (S) (&), true) *)

Transformed Spec

Functional Program that generates the constraints (will be automated)

- Itanium memory model in Higher Order Logic (well, not so high actually… )
- Our HOL specs translation “sat-generating checker programs”
- Execution to be checked translation by above program to Sat

Have built tool for tuple-generation that addresses many details:

(1) Expansion into tuples with variable address allocation

P1: St a,1;

Ld r1,a <1>;

St b,r1 <1>;

P2: Ld.acq r2,b <1>;

Ld r3,a <0>;

Tuple 1

{id=0; proc=0; pc=0; op= St; var=0; data=1; wrID=0;

wrType=Local; wrProc=0; reg=-1; useReg=false};

{id=1; proc=0; pc=0; op= St; var=0; data=1; wrID=0;

wrType=Remote; wrProc=0; reg=-1; useReg=false};

{id=2; proc=0; pc=0; op= St; var=0; data=1; wrID=0;

wrType=Remote; wrProc=1; reg=-1; useReg=false};

{id=3; proc=0; pc=1; op= Ld; var=0; data=1; wrID=-1;

wrType=DontCare; wrProc=-1; reg=0; useReg=true};

{id=4; proc=0; pc=2; op= St; var=1; data=1; wrID=4;

wrType=Local; wrProc=0; reg=0; useReg=true};

{id=5; proc=0; pc=2; op= St; var=1; data=1; wrID=4;

wrType=Remote; wrProc=0; reg=0; useReg=true};

{id=6; proc=0; pc=2; op= St; var=1; data=1; wrID=4;

wrType=Remote; wrProc=1; reg=0; useReg=true};

{id=7; proc=1; pc=0; op= LdAcq; var=1; data=1; wrID=-1;

wrType=DontCare; wrProc=-1; reg=1; useReg=true};

{id=8; proc=1; pc=1; op= Ld; var=0; data=0; wrID=-1;

wrType=DontCare; wrProc=-1; reg=2; useReg=true}

...

Tuple 9

How the SAT encoding is achieved...

Example Execution

- Store c viewed at P1 for modeling bypassing
- Store c viewed at P1 for modeling global visibility
- Store c viewed at P2 for modeling global visibility
- Store d viewed at P1 for modeling bypassing
- Store d viewed at P1 for modeling global visibility
- Store d viewed at P2 for modeling global visibility
- Ld d viewed at P2 for modeling read value
- Ld c viewed at P2 for modeling read value

st c,1 ;

st d,2

ld d, 2;

ld c, 0

Break it down into “tuples”

8 tuples obtained

legalItanium(ops) =

Exists order.

( requireStrictTotalOrder ops order

/\ requireOtherOrderItanium ops order

/\ requireReadValue ops order

SC(ops) =

Exists order.

( requireStrictTotalOrder ops order

/\ requireOtherOrderSC ops order

/\ requireReadValue ops order

Constraint Encoding Approach #1

- n logn approach (“small domain” encoding)
- Attach a word w_t of 2 bits to each tuple t
- Tuple i before Tuple j --> Assert wi < wj
- StrictTotalOrder --> Assert that the wt words are distinct
- Smaller # of Boolean Vars
- Much Harder SAT instances (abandoned for now)

Illustration on

4 tuples

requireStrictTotalOrder ops order

requireOtherOrder ops order

requireReadValueops order

For all i, j:

xi1,xi0 != xj1, xj0

x00 x01

x10 x11

A system of constraints

with primitive constraint

xi1, xi0< xj1, xj0

x20 x21

x30 x31

Constraint Encoding Approach #2

- n n approach (“e_ij” encoding)
- Assign a matrix position mij for each pair of tuples ti and tj
- Tuple i before Tuple j --> Assert mij true
- StrictTotalOrder --> Assert Irreflexitivity, Transitivity, Totality
- Larger # of Boolean Vars
- Easier SAT instances (being pursued now)

Illustration on

4 tuples

- Forall i : ~mii
- Forall i,j : mij \/ mji
- Forall i,j,k : mij /\ mjk
- => mik

requireStrictTotalOrder ops order

requireOtherOrder ops order

requireReadValueops order

i

. . . .

j . mij . .

. . . .

. . . .

A system of constraints

with primitive constraint

mij

Table of Results (somewhat dated…)

SAT-instance generation time for n logn method

Tuples Total Order Other Order

32 0.2 1.6

64 1.2 17.1

128 5.7 179.0

SAT-instance generation time for n n method

Tuples Total Order Other Order

32 0.5 0.1

64 4.3 0.9

128 34.2 9.0

SAT-checking times

Tuples n logn nn

Monolith TotalOrd OtherOrd Monolith TotalOrd OtherOrd

32 9.6 0.6 4.3 0.33 0.69 0.05

64 247.17 29.53 37.6 2.73 6.17 0.5

128 abort 1341 abort 164.8 145.6 351.1

Explaining the results of Sat

- Itanium memory model in Higher Order Logic (well, not so high actually… )
- Our HOL specs translation “sat-generating checker programs”
- Execution to be checked translation by above program to Sat
- Each assembly instruction clauses it generates + annotations
- When Sat, what interleaving explains?
- When Unsat, how to get “core” (root-cause) + annotations on core
- Translating annotations on core to cycle on original program

- Each clause generated by the sat-generating checker program also generates an associated tuple.
- This tuple has information pertaining to the clause’s source.
- Each tuple has the following information
- The ops involved in generating the clause (upto a maximum of 4 ops could generate a clause)
- The proc value of the processor whose instructions were used to generate this clause (taken from the tuples generated by the gentuple program)
- The pc value of the instruction that was the source for this tuple
- The name of the memory ordering rule the application of which generated this tuple (ReadValue, ProgramOrder, Reflexive, etc)

- The clause annotation looks as follows
< proc, pc, op1, op2, op3, op4, RuleName >

P

st[x] = 1

mf

ldr1 = [y]<0>

Q

st.rel [y] = 1

R

ld.acqr2 = [y]<1>

ld r3 = [x]<0>

- The Sat instance generated for the above example is
- UNSAT.
- Next few slides show automated approach to detect
- the root cause cycle.
- We will ignore the reflexive and transitive rules in
- these slides (they are necessary to force unsat, but
- useless in building a cycle!!)

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule

op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 1; op2 = -1; op3 = -1; op4 = -1; rule = Reflexive

op1 = 4; op2 = 5; op3 = 6; op4 = -1; rule = TransitiveOrder

op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder

op1 = 4; op2 = 6; op3 = 8; op4 = -1; rule = TransitiveOrder

op1 = 4; op2 = 11; op3 = 12; op4 = -1; rule = TransitiveOrder

op1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrder

op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = TotalOrder

op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = TotalOrder

op1 = 11; op2 = 4; op3 = 8; op4 = -1; rule = TransitiveOrder

op1 = 11; op2 = 4; op3 = -1; op4 = -1; rule = TotalOrder

op1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrder

op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRule

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

denotes an op

123 4

st[x] = 1

mf

5

Denotes op numbers. Store has both local and remote ops

6

ldr1 = [y]<0>

7 8 9 10

st.rel [y] = 1

ld.acqr2 = [y]<1>

11

ld r3 = [x]<0>

12

123 4

st[x] = 1

op1 = 4; op2 = 5; op3 = -1; op4 = -1;

rule = ProgramOrder

mf

5

6

ldr1 = [y]<0>

7 8 9 10

st.rel [y] = 1

ld.acqr2 = [y]<1>

11

ld r3 = [x]<0>

12

123 4

st[x] = 1

mf

5

op1 = 5; op2 = 6; op3 = -1; op4 = -1;

rule = ProgramOrder

6

ldr1 = [y]<0>

7 8 9 10

st.rel [y] = 1

ld.acqr2 = [y]<1>

11

ld r3 = [x]<0>

12

123 4

st[x] = 1

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

mf

5

6

ldr1 = [y]<0>

7 8 9 10

st.rel [y] = 1

ld.acqr2 = [y]<1>

11

ld r3 = [x]<0>

12

123 4

op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

st[x] = 1

mf

5

6

ldr1 = [y]<0>

7 8 9 10

st.rel [y] = 1

ld.acqr2 = [y]<1>

11

ld r3 = [x]<0>

12

123 4

st[x] = 1

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

mf

5

6

ldr1 = [y]<0>

7 8 9 10

st.rel [y] = 1

ld.acqr2 = [y]<1>

11

ld r3 = [x]<0>

12

123 4

st[x] = 1

mf

5

op1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrder

6

ldr1 = [y]<0>

7 8 9 10

st.rel [y] = 1

ld.acqr2 = [y]<1>

11

ld r3 = [x]<0>

12

123 4

st[x] = 1

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValue

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

mf

5

6

ldr1 = [y]<0>

7 8 9 10

st.rel [y] = 1

ld.acqr2 = [y]<1>

11

ld r3 = [x]<0>

12

Good Case-study Illustrating Program Derivation from Formal Specs

- Initial specs: HOL
- Formal derivation of tail-recursive functional programs
- “Code generation” consists of generating Boolean clauses
- Choose Boolean encoding method
- Re-target code generation correspondingly

- Source-level optimizations
- Record known orderings (e.g., “i before j”) – these manifest as unit clauses
- Infer others (e.g., “not j before i”) - generate unit-clauses for these too
- Prevent generating transitivity axioms that depend on “j before i”

- The use of incremental SAT can perhaps be directed by “functional scripts” that are automatically generated
- Use of Unsat cores to pinpoint errors

Concluding Remarks

- Main source of complexity: the transitivity axiom
- “Lazy” methods for handling transitivity must be investigated
- Hybrid Sat encoding (partly nn and partly n log n) can also help as was the experience of Lahiri, Seshia, and Bryant
- Analyzing larger programs:
- Somehow view program in terms of “basic blocks”
- Treat each basic block as super instruction
- If super-instruction unordered, no need to descend into basic block

- Exploit incremental Sat when same litmus tests are rerun
- Try modeling another weak memory model

Extra Slides

- The CNF file generated by the sat-generating program is solved using zchaff.
- If SAT, then we get a satisfying assignment.
- First n*n variables in the assignment correspond to the n*n variables in our ordering. Can be used to output a valid ordering of the ops.
- If UNSAT, then need a way to find a “root-cause” for the illegality of the execution.
- We use unsatisfiable core generation to get to the root cause.
- An unsatisfiable core of an unsatisfiable Sat instance is a subset of clauses of the formula such that its conjunction is still UNSAT.

- Zchaff can be told to generate resolution trace while checking for Sat.
- Zcore – tool that takes as input a CNF file and resolution trace produced by zchaff and produces unsatisfiable core.
- Zcore available as part of zchaff.
- Unsatisfiable core is another CNF file with the reduced set of clauses.
- Can be fed back into zchaff/zcore to generate a potentially smaller unsatisfiable core.
- Process repeated till fixed point reached.

- Clauses in the unsatisfiable core contain the ordering violation information in them
- Tool to home in towards the root-cause for the violation
- If the root cause is not something trivial, then the cause is usually a cycle of instructions. Each link in the cycle corresponds to an ordering requirement between the instuctions involved.
- If cycle exists, then Transitivity can be applied to show that Irreflexivity is not satisfied.
- Input to the tool to generate root cause:
- The original set of annotated machine instructions for all processors
- The default values stored in memory locations at the beginning of the execution
- Clause annotations for the clauses that form the unsatisfiable core

Each ReadValue rule generates a set of clauses.

From the annotations, find the tuples that come from the same ReadValue rule (two different ops will be involved in a rule)

- Extract the ops out of the annotations and get the corresponding instructions (using the proc and pc values)
From the data being used in the ld instruction and the default date value for the corresponding memory address, it can be seen if the effect of a store is being reflected in a load.

This way the dependency between a load and a store is established.

The above is done for all the ReadValue rules in the annotations

Ops (and the corresponding instructions) on both sides of a mf that form a link in the cycle are inferred based on ProgramOrder rule annotations and the pc values involved.

The other missing links in the violating cycle are also inferred based on the remaining ProgramOrder rule annotations.