An Efficient Inclusion-Based Points-To Analysis for Strictly-Typed Languages

An Efficient Inclusion-Based Points-To Analysis for Strictly-Typed Languages John Whaley Monica S. LamComputer Systems LaboratoryStanford UniversitySeptember 18, 2002

Background • Andersen’s points-to analysis for C (1994) • Flow-insensitive, context-insensitive • Inclusion-based, more accurate thanunification-based Steensgaard • O(n3), considered too slow to be practical • CLA optimization to Andersen’s analysis (Heintze & Tardieu, PLDI’01) • Online caching/cycle elimination • Field-independent: 1.3M lines of code in 137s SAS 2002

Doing it for Java • We want Andersen-level pointers for Java • Naïve port of CLA algorithm: • Spec “compress” benchmark: 2+ hours! • Call graph accuracy: same as RTA (terrible) • Our paper: how to do CLA for Java • Spec “compress” benchmark: 5 seconds! • JEdit (1371 classes): ~10 minutes! • Call graph accuracy: very good SAS 2002

Java vs. C: Virtual calls • Java has many virtual calls • Accuracy of analysis strongly affects number of call targets • More call targets leads to more code being analyzed and longer analysis times SAS 2002

Java vs. C: Treatment of Fields • Field-independent: in o.f, use only o • Most C pointer analyses • Sound even for non-type-safe languages • Field-based: in o.f, use only f • Very inaccurate, requires type safety • Field-sensitive: in o.f, use both o, f • Strictly more accurate than field-independent or field-based • Essential for Java SAS 2002

Java vs. C: Local variables • Local variables/stack locations are reused • Flow insensitivity causes many false aliases • Local flow sensitivity is necessary SAS 2002

Our Contribution • Andersen-style inclusion-based points-to analysis for Java, based on ideas from CLA • Field sensitivity • Tracks separate fields of separate objects • Uses “method summary graphs” • Sparse representation, uses local flow sensitivity • Optimizations • Caching across iterations, reducing redundant ops • Supports all features of Java SAS 2002

Algorithm Overview Intraprocedural:Generate a sparse, flow-insensitive summary graph for each method • Based on access paths, uses local flow sensitivity Interprocedural:Using summary graphs, build inclusion graph to obtain whole-program result SAS 2002

Method Summaries • Sparse, flow-insensitive summary of the semantics of each method • Stores (writes) in method • Calls made by method and their parameters • Return values, thrown and caught exceptions • Use a flow-sensitive technique to generate method summaries • Precisely model updates to stack and locals SAS 2002

Method Summary: Example Code for method foo: Summary for method foo: static void foo(C x, C y) {C t = x.f;t.g = y;x.g = x;t.bar(y); } f g x x.f y g bar(t,y); read edge write edge parameter map edge SAS 2002

Node types A node represents an object at run time. • Concrete type nodes • Objects that have a known concrete type • new statements and constant objects • Abstract nodes • Parameters, return values, dereferences • Interprocedural phase maps an abstract node to set of concrete nodes it can represent SAS 2002

Edge types • Read edge: • Created by load statements • Represent dereferences (access paths) of known locations • Write edge: • Created by store statements • Represent references created by the method f f SAS 2002

Outgoing parameter map • Records which nodes are passed as which parameters • This is used in the interprocedural phase to match call sites to call targets f g x x.f y g t.bar(y); SAS 2002

Generating method summary • Worklist data flow solver (flow-sensitive) • Strong updates on locals, weak on others • Detect and close cycles in access paths • More detail in the paper SAS 2002

Review: Andersen’s Points-to • Points-to is encoded as inclusion relations x = y implies x  y x  y is also written as: x  y SAS 2002

x  newy newy.f  e x  newy e  newy.f e1  e2 e1  e2, e2  e3 e1  e3 Review: Andersen’s Points-to Rule name: If code contains: Apply rule: Store x.f = e; Load e = x.f; Copy e1 = e2; Transitive closure SAS 2002

Andersen example g t = x.f; t.g = y; x.g = x; f g x x.f y SAS 2002

Andersen example g t = x.f; t.g = y; x.g = x; f g x x.f y f C D E SAS 2002

x  newy e  newy.f Andersen example g t = x.f; t.g = y; x.g = x; f g x x.f y f C D E Rule name: If code contains: Apply rule: Load e = x.f; SAS 2002

x  newy newy.f  e Andersen example g t = x.f; t.g = y; x.g = x; f g x x.f y f C D E Rule name: If code contains: Apply rule: Store x.f = e; SAS 2002

x  newy newy.f  e Andersen example g t = x.f; t.g = y; x.g = x; f g x x.f y g f C D E Rule name: If code contains: Apply rule: Store x.f = e; SAS 2002

x  newy newy.f  e Andersen example g t = x.f; t.g = y; x.g = x; f g x x.f y g g f C D E Rule name: If code contains: Apply rule: Store x.f = e; SAS 2002

Mapping method calls t.bar(y); g t = x.f; t.g = y; x.g = x; t.bar(y); f g x x.f y g g f C D E SAS 2002

Mapping method calls t.bar(y); g t = x.f; t.g = y; x.g = x; t.bar(y); f g x x.f y g g f C D E Bar:this Bar:p1 SAS 2002

Overall Picture “Abstract” world F E “Concrete” world C D SAS 2002

Graph-based Andersen • Computing full transitive closure is prohibitively expensive • Store the graph in pre-transitive form, and calculate reachable nodes on demand SAS 2002

Algorithm foreach write edge e1→ e2 do foreach n in getConcreteNodes(e1) add write edge n.f → e2 foreach read edge e1→ e2 do foreach n in getConcreteNodes(e1) add inclusion edge e2 n.f foreach method call e1.f() foreach n in getConcreteNodes(e1) add parameter mappings for target method SAS 2002

Caching reachability queries • getConcreteNodes(e): transitive closure query on the inclusion graph • The same queries are repeated many times • Store the result in a hash table • Cached result may be stale due to edges added since the last query • Iterate until convergence SAS 2002

Online cycle detection • Inclusion graph includes cycles • The algorithm collapses cycles as they are traversed • During traversal, keeps track of current path • If a node on current path is revisited, collapse all nodes in cycle • Each node has a “skip” pointer, which is set when collapsed and followed on all accesses SAS 2002

Reusing caches • Concrete node cache values don’t change much between algorithm iterations • Reallocation and rebuilding them is expensive • Reuse caches from old iterations • Keep track of an iteration ‘version’ number for each cache entry SAS 2002

Minimizing set union operations • Many caches don’t change across iterations • Avoid set union operations for caches that haven’t changed since the last iteration • Keep a ‘changed’ flag for each cache entry, records if last computation changed the entry • If input set hasn’t changed, set union operation is redundant SAS 2002

Experimental Results • Concrete type inference • Static call graph • Implemented in ~800 lines of Java • Freely available at: http://joeq.sourceforge.net SAS 2002

Programs • SpecJVM • Standard benchmark suite • J2EE – Java 2 Enterprise Edition v1.3 • Massive (1+ million lines) business framework • joeq • Compiler infrastructure, 75K lines • Cloudscape • Database shipped with J2EE, no source code • JEdit • Full-featured editor, 100K lines SAS 2002

Experimental Results • We analyzed the reachable code for each application • Results include code in class library • Analysis was very effective in reducing total program size • Pentium 4 2GHz 2GB RAM, Redhat 7.2 • Sun JDK 1.3.1_01 with 512MB heap SAS 2002

Analysis Precision vs. RTA SAS 2002

Analysis time: Small benchmarks SAS 2002

Analysis time: Large benchmarks SAS 2002

Analysis time (speedup) SAS 2002

Analysis time (bytecodes/second) SAS 2002

Related Work • Original CLA paper • Heintze and Tardieu (PLDI 2001) • Anderson’s analysis for Java • Rountev, Milanova, Ryder (OOPSLA 2001) • Liang, Pennings, Harrold (PASTE 2001) • Many others… • Concrete type inference • CHA, RTA • Flow and context sensitivity, 0-CFA SAS 2002

Conclusion • Improved precision • Field sensitivity • Local flow sensitivity • Improved efficiency • Reuse reachability cache across iterations • Minimize set-union operations • Scales to the largest Java programs • A new baseline for Java pointers • No reason to use a less precise analysis SAS 2002

An Efficient Inclusion-Based Points-To Analysis for Strictly-Typed Languages

An Efficient Inclusion-Based Points-To Analysis for Strictly-Typed Languages

Presentation Transcript

SAFECode: Enforcing Alias Analysis for Weakly Typed Languages

Typed Assembly Languages and Security Automatons

Improving Rotor for Dynamically Typed Languages

Parallel Inclusion-based Points-to Analysis

Refinement-Based Context-Sensitive Points-To Analysis for JAVA

CSE-321 Programming Languages Simply Typed  -Calculus

An Efficient Identity-based Cryptosystem for End-to-end Mobile Security

CSE-321 Programming Languages Extensions to the Simply Typed  -Calculus

An Efficient Brush Model for Physically-Based 3D Painting

Merging Equivalent Contexts for Scalable Heap-cloning-based Points-to Analysis

Incremental Algorithms for Dispatching in Dynamically Typed Languages

Merging Equivalent Contexts for Scalable Heap-cloning-based Points-to Analysis

An Efficient Compilation Framework for Languages Based on a Concurrent Process Calculus

CSE-321 Programming Languages Simply Typed  -Calculus

Typed Assembly Languages

Approximating Inclusion-based Points-to Analysis

Efficient Policy Analysis for Administrative Role-Based Access Control

Refinement-Based Context-Sensitive Points-To Analysis for Java

An Efficient File Hierarchy Attribute Based

Merging Equivalent Contexts for Scalable Heap-cloning-based Points-to Analysis

Intraprocedural Points-to Analysis