600 likes | 697 Views
Explore the journey of improving software reliability through innovative analysis techniques. Learn about the importance of real code, dynamic analysis, and precision vs. efficiency trade-offs in pointer analysis. Discover how frameworks play a crucial role in research and practicality.
E N D
The SPA ProjectGOLF and ESP Manuvir Das Microsoft Research (joint work with Manuel Fahndrich, Jakob Rehof)
Software Productivity Tools • Jim Larus runs the group • research.microsoft.com/spt • SLAM, Vault, Behave, PipelineServer … • Focus on software reliability
What’s wrong with analysis? • A: We don’t write or look at real code • B: We don’t solve real problems
Why does this happen? • Analysis is a mix of theory and practice • But • Math and theory are elegant • experimentation needs infrastructure • engineering is boring
Today we’ll talk about … • Doing analysis research the right way • My day job • Slicing and Partial Evaluation • Pointer analysis • Error detection
Slicing and Partial Evaluation • PE: Which computations depend only on known inputs? Do these early. • Or, which computations may depend on unknown inputs? Don’t do these early. • Insight: If a computation depends on unknown input, there must be an unknown input in its slice.
Forward slicing and BTA • Binding-time analysis • identify static computations • BTA via slicing • mark all unknown input nodes • forward slice from marked nodes and mark • all unmarked nodes are static computations
Why is this interesting? • Slicing incorporates control dependence • Previous work used reaching definitions read(y); x = 0; while (y != 0) { y--; x++; } z = x; read(y); x = 0; while (y != 0) { y--; x++; } z = x; read(y); x = 0; while (y != 0) { y--; x++; } z = x; • We can now prove correctness
This project had flaws … • A: We don’t write or look at real code • cubic algorithm, ran on 2k lines in 30 minutes • only one benchmark (ray tracer) • B: We don’t solve real problems • who uses PE in practice? • was the lack of safety critical? • why not use a timer?
Then I visited MSR … • Daniel Weise – 1.5 million lines of real code • Real problems – software reliability • I was hooked! • find buffer overflows using static analysis • oops, need pointer analysis
Papers don’t tell the whole truth! • Implemented Ste96, engineered it • lightning fast, but poor results • Lots of papers on how to improve • structures, signatures, SH97 • Tried it all, nothing worked on real code • Needed Andersen (subtyping) on real code
Frameworks are good • A spectrum from Ste96 to And94 • DGC POPL 98 : unification vs flow • SH POPL 97 : buckets within ECRs • Frameworks • give us a way of tuning precision vs efficiency • help us understand the problem
Frameworks are bad • The real issue: how do you find the best trade-off point in a principled manner? • What if the parameter being varied is not the key concept? • CFA varies control depth rather than data • SH 97 picks random categories • DGC 98 alters the behaviour of the same statement
Back to pointer analysis … • No way to run Andersen on MLOC
So, I hid in my office … • Stared at SPEC code, wrote perl scripts • every feature is used • code is idiomatic • pointers are never assigned, except heap • most pointers arise through parameter passing • some code is just too hard for any analysis • Result: new algorithm driven by real code
FSCS: Flow-sensitive Context-sensitive FICS: Flow-insensitive Context-sensitive FSCI: Flow-sensitive Context-insensitive Precision Cost FICI: Flow-insensitive Context-insensitive Pointer Analysis Landscape
Imprecise Precise Andersen (cubic) Expensive 500 KLOC in several minutes, 2GB Steensgaard (almost linear) Cheap 1.5 MLOC in 1 minute, 100 MB FICI Pointer Analysis One level flow (quadratic)
r1 p q r2 r1 q p r2 r3 Andersen’s Algorithm p = &q; p = q;
s1 r1 p s2 q r2 s3 r1 s1 q p r2 s2 Andersen’s Algorithm p = *q; *p = q;
p q p p q q Steensgaard’s Algorithm p = q;
Motivation for One Level Flow foo(&s1); foo(&s2); bar(&s3); foo(struct s *p) { *p.a = 3; bar(p);} bar(struct s *q) { *q.b = 4;}
p q p q s1 s2 s3 s1,s2,s3 Simplified Example p = &s1; p = &s2; q = &s3; q = p; *p.a = 3; *q.b = 4;
p p q q One Level Flow p = q;
p = &s1; p = &s2; q = &s3; q = p; *p.a = 3; *q.b = 4; p = &s1; p = &s2; q = &s3; q = p; *p.a = 3; *q.b = 4; p = &s1; p = &s2; q = &s3; q = p; *p.a = 3; *q.b = 4; p = &s1; p = &s2; q = &s3; q = p; *p.a = 3; *q.b = 4; p p q q s1 s1 s3 s3 s2 s2 Simplified Example p = &s1; p = &s2; q = &s3; q = p; *p.a = 3; *q.b = 4;
e OLF: Simple Reachability Single query: Linear All queries: Quadratic
x y OLF: Cached Reachability MAX MS Word : From 1 hour to 30 seconds for all queries
This project had flaws too … • B: We don’t solve problems • solved an open problem in pointer analysis • But • never got around to buffer overflow • didn’t use PTA for optimization • addressed these issues later, but • should have been driven by the problem
Since then … • Others have made And94 fast • Heintze PLDI 01 • suggested by OLF results • But what about context-sensitivity? • crucial for value flow analysis • GOLF (DLFR SAS 01) • combines OLF and one level of instantiation constraints (Rehof’s lecture) • context-sensitive value flow on MLOC
OLF: Call Example id(r) {return r;} p = id(&x); q = id(&y); *p = 3; r = &x; p = r; r = &y; q = r; *p = 3;
r p x *r *p y *q q OLF: Call Example r = &x; p = r; r = &y; q = r; *p = 3;
r p x ( ) *r *p y [ *q ] q GOLF: Call Example id(r) {return r;} p = id(&x); q = id(&y); *p = 3;
We have an analysis that is … • fast enough to run on MLOC • good enough for static optimization • who cares; leave it to the chip makers! • not good enough for dynamic optimization (MDCE PASTE 01) • not good enough to track interesting correctness properties in real code
Correctness: the killer app • Hardware can • speed up programs • enforce correctness at run-time • Hardware cannot • enforce correctness before product is shipped • Testers can • find errors on some paths • Testers cannot • find errors on all paths • So, use static analysis to find errors
ESP Vision • Error Detection via Scalable Program Analysis • Must be driven by real code • Must be sound (report all errors) • Must report few false positives • Use knowledge of tradeoffs in analysis • Let user help the analysis
Step 1: Identify the problem • Solve a realistic problem: • partial correctness • user specified, finite-state properties • Solve a non-trivial problem: • don’t check uninits, NULL pointers • check locking protocols, resource usage
INIT(l) Ret Lock(l) Unlock(l) LOCKED(l) Lock(l) Ret ERROR(l) Parameterized Protocol Tracking • User specified • FSM with parameterized actions • patterns • Rest is automatic
Step 2: Examine real code • Find common idioms • Understand level of precision needed • Windows device drivers • mostly control dominated protocols • global data flow needs CS, but not FS/PS • path feasibility seems to matter
Sample driver code STATUS Initialize(Object o) { Object p = o; if (p->needLock) KeAcquireSpinLock(p); p->data = 0; if (p->needLock) KeReleaseSpinLock(p); return OK; }
Step 3: Break up the problem • Three distinct entities to be tracked • the temporal sequence of actions along a particular control flow path • the data involved in the actions • the data involved in path feasibility • Can use different levels of static analysis to track each entity
Data analysis vs control analysis • RHS 95: Cost is Ο(ED3). What is D? • dataflow: D is generally related to program size • program size grows because of pointers, globals • What if there is only a single global FSM? • D is just the #states in the FSM! • Control is cheap, data is expensive
Step 4: Design static analyses • track the temporal sequence of actions along a particular control flow path • cannot use flow-insensitive analysis • RHS95 is too expensive • eliminate the data involved in the actions • use GOLF value flow • now we have a control property, use RHS95 • both analyses are context-sensitive
Data elimination STATUS Initialize(Object o) { Object p = o; if (p->needLock) KeAcquireSpinLock(p); p->data = 0; if (p->needLock) KeReleaseSpinLock(p); return OK; }
I L E Data elimination Initialize() { if (*) Lock; if (*) Unlock; }
Do we need context-sensitivity? • What if GOLF cannot provide MUST info? void Initialize(Object o1, Object o2) { LockWrapper(o1); LockWrapper(o2); KeReleaseSpinLock(o1); KeReleaseSpinLock(o2); } void LockWrapper(Object p) { KeAcquireSpinLock(p); }
Interface nodes • Limit scope of value flow to interface nodes • Produce RHS summaries for interface nodes void LockWrapper(Object p) { KeAcquireSpinLock(p); } p: INIT -> LOCKED, LOCKED -> ERROR • Copy summaries to callers
i o1 p j o2 Back to our example … void Initialize(Object o1, Object o2) { i: LockWrapper(o1); j: LockWrapper(o2); KeReleaseSpinLock(o1); KeReleaseSpinLock(o2); } void LockWrapper(Object p) { KeAcquireSpinLock(p); }
Consider the abstraction! • ESP makes an upfront abstraction • interface nodes in the GOLF graph • Plus: linear size, controls overall cost • Minus: may be too coarse • SLAM allows tuning of abstraction • but now we are back in the framework game