1 / 54

Adaptive Optimization with On-Stack Replacement

Adaptive Optimization with On-Stack Replacement. Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University http://www.sable.mcgill.ca. Motivation. Modern VM uses adaptive recompilation strategies

celina
Download Presentation

Adaptive Optimization with On-Stack Replacement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University http://www.sable.mcgill.ca

  2. Motivation • Modern VM uses adaptive recompilation strategies • VM replaces entry in dispatching table with newly compiled code • Switching to new code can only happen at the next invocation • On-stack replacement (OSR) allows transformation happen in the middle of method execution

  3. stack stack m2 m1 m2 m1 frame frame PC PC What is On-stack Replacement? • Transfer execution from compiled code m1 to compiled code m2 even while m1 runs on some thread’s stack

  4. Why On-Stack Replacement (OSR)? • Debugging optimized code via dynamic de-optimization [SELF-93] • Deferred compilation of cold paths in a method [SELF-91, HotSpot, Whaley 2001] • Promotion of long-run activations [SELF-93] • Safe invalidation for speculative optimization [HotSpot, SELF-91]

  5. Related Work • Holzle, Chambers, and Ungar (SELF-91, SELF-93) deferred compilation, de-optimization for debugging, promotion of long-run loops, safe invalidation [OOPSLA’91, PLDI’92, OOPSLA’94] • HotSpot server compiler [JVM’01] • Partial method compilation [OOPSLA’01]

  6. OSR Challenges • Engineering Complexity • How to minimize disruption to VM code base? • How to constrain optimizations? • Policies for applying OSR • How to make rational decisions for applying OSR? • Effectiveness • How does OSR improve/constrain dataflow optimizations? • How effective are online OSR-based optimizations?

  7. Outline • Motivation • OSR Mechanism • Applications • Experimental Results • Conclusion

  8. OSR Mechanism Overview • Extract compiler-independent state from a suspended activation for m1 • Generate specialized code m2 for the suspended activation • Compile and transfer execution to the new code m2 stack stack 2 3 1 compiler- independent state m2 m2 m1 m2 m1 frame frame PC PC

  9. JVM Scope Descriptor • Compiler-independent state of a running activation • Based on Java Virtual Machine Architecture • Five components: • Thread running the activation • Reference to the activation's stack frame • Program Counter (as a bytecode index) • Value of each local variable • Value of each stack location

  10. JVM Scope Descriptor Example class C { static int sum(int c) { int y = 0; for (int i=0; i<c; i++) { y += i; } return y; } } JVM Scope Descriptor Bytecode 0 iconst_0 1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc 2 1 14 iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn Running thread: MainThread Frame Pointer: 0xSomeAddress Program Counter:16 Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50; Stack Expressions: S0 = 50; S1 = 100; Suspend after 50 loop iterations (i = 50)

  11. Extracting JVM Scope Descriptor • Trivial from interpreter • Optimizing Compiler • Insert OSR Point (safe-point) instructions in initial IR • OSR Pointuses stack, local state needed to recover scope descriptor • OSR Point is treated as a call, transfers control to exit block • Aggregate OSR points to an OSR map when generating machine instructions stack 1 compiler- independent state m1 m1 frame PC

  12. Specialized Code Generation • Prepend a specialized prologue to original bytecode • Prologue will • Save JVM Scope Descriptor values into local variables • Push JVM Scope Descriptor values onto the stack • Jump to the desired program counter 2 compiler- independent state m2

  13. Transition Example JVM Scope Descriptor Original Bytecode Specialized Bytecode Running thread: MainThread Frame Pointer: 0xSomeAddress Program Counter: 16 Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50; Stack Expressions: S0 = 50; S1 = 100; 0 iconst_0 1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc 2 1 14 iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn ldc 100 istore_0 ldc 1225 istore_1 ldc 50 istore_2 ldc 50 ldc 100 goto 16 0 iconst_0 ... 16 if_icmplt 7 ... 20 ireturn

  14. Transfer Execution to the New Code • Compile m2 as a normal method • System unfolds the stack frame of m1 • Reschedule the thread to execute m2 • By construction, executing specialized m2sets up target stack frame and continues execution stack stack 3 3 m2 m2 m2 m2 m2 m2 frame frame PC PC

  15. Recovering from Inlining • Suppose optimizer inlines A -> B -> C: JVM Scope Descriptor A stack A' A' JVM Scope Descriptor B A stack m2 A' frame 2 3 1 B' frame frame A B' B' frame A A JVM Scope Descriptor C frame C' C' frame PC C' PC

  16. Inlining Example foo_prime() { <specialized foo prologue> call bar_prime() goto A; ... bar(); A: ... } bar_prime() { <specialized bar prologue> goto B: ... B: ... } void foo() { bar(); A: ... } void bar() { ... B: ... } Wipe stack to caller C and call foo_prime stack Suspend at B: in A -> B C foo' A m2 foo' frame frame frame bar' bar' frame PC

  17. Implementation Details • Target Compiler unmodified, except for .... • New pseudo-bytecodes • Load literals (to avoid inserting new constants in constant pool) • Load an address/bytecode index: JSR return address on stack • Fix bytecode indices for GC maps, exception tables, line number tables

  18. Pros and Cons • Advantages • mostly compiler-independent • avoid multi-entry points of compiled code • target compiler can exploit run-time constants • Disadvantage • must compile target method twice (once for transition, once for next invocation)

  19. Outline • Motivation • OSR Mechanism • Applications • Experimental Results • Conclusion

  20. if (foo is currently final) trap/OSR; x = 1; x = foo(); return x; Two OSR Applications • Promotion (see the paper for details) • recompile a long-running activation • Deferred Compilation • don't compile uncommon paths • saves compile-time

  21. Deferred Compilation • What's "infrequent"? • static heuristics • profile data • Adaptive recompilation decision is modified to consider OSR factors Feng Qian: Class initialization is called by a class loader, when do we need OSR for it?

  22. Outline • Motivation • OSR Mechanism • Applications • Experimental Results • Conclusion

  23. Online Experiments Eager : (by default) no deferred compilation • OSR/static: deferred compilation for CHA-based inlining only • OSR/edge counts: deferred compilation w/online profile data & CHA-based inlining

  24. Adaptive System Performance better

  25. Adaptive System Performance better

  26. OSR Activities SPECjvm98 size 100 First Run Promotions Invalidations compress 3 6 jess 0 0 db 0 1 javac 0 10 mpegaudio 0 1 mtrt 0 5 jack 0 1 total 3 24

  27. Outline • Motivation • OSR Mechanism • Applications • Experimental Results • Conclusion

  28. Summary • A new On-stack replacement mechanism • Online profile-directed deferred compilation • Evaluation of OSR applications in JikesRVM

  29. Conclusion • Should a VM implement OSR? • Can be done with minimal intrusion to code base • Modest gains from deferred compilation • No benefit for class-hierarchy-based inlining • Debugging with dynamic de-optimization valuable • TODO: More advanced speculative optimizations Implementation is available to public in JikesRVM under CPL: Linux/x86, Linux/PPC, and AIX/PPC http://www-124.ibm.com/developerworks/oss/jikesrvm/

  30. Backup Slides

  31. Compile Rate Offline Profile

  32. Compile Rate Offline Profile

  33. Machine Code Size Offline Profile

  34. Machine Code Size Offline Profile

  35. Code Quality Offline Profile

  36. Code Quality Offline Profile better

  37. Jikes RVM Analytic Recompilation Model • Define • cur, current optimization level for methodm • Tj, expected future execution time at levelj • Cj, compilation cost at opt levelj • Choose j > cur that minimizes Tj + Cj • If Tj + Cj < Tcurrecompile at level j • Assumptions • Method will execute for twice its current duration • Compilation cost and speedup based on offline average • Sample data determines how long a method has executed

  38. Jikes RVM OSR Promotion Model • Given: Outdated activationA of method m • Define • L,last optimization level for any compiled version of m • cur, current optimization level for activationA • Tcur, expected future execution time of A at level cur • CL, compilation cost for method m at opt levelL • TL, expected future execution time of A at levelL • If TL + CL < Tcurspecialize A at level L • Assumption • Outdated activation will execute for twice its current duration

  39. Jikes RVM Recompilation Model, with Profile-Driven Deferred Compilation • Define • cur, current optimization level for methodm • Tj, expected future execution time at levelj • Cj, compilation cost at opt levelj • P,percentage of code in m that profile data indicates was reached • Choose j > cur that minimizes Tj + P*Cj • If Tj + P*Cj < Tcurrecompile at level j • Assumptions • Method will execute for twice its current duration • Compilation cost and speedup based on offline average • Sample data determines how long a method has executed

  40. Offline Profile experiments • Collect "perfect" profile data offline • Mark any block never reached as "uncommon" • Defer compilation of "uncommon" blocks • Four configurations • Ideal: deferred compilation trap keeps no state live • Ideal-OSR: deferred compilation trap is valid OSR point • Static-OSR: no profile data; defer compilation for CHA-based inlining; trap is valid OSR point • Eager: (default) no deferred compilation

  41. Compile Rate Offline Profile

  42. Machine Code Size Offline Profile

  43. Code Quality Offline Profile

  44. OSR Challenges • Engineering Complexity • How to minimize disruption to VM code base? • How to constrain optimizations? • Policies for applying OSR • How to make rational decisions for applying OSR? • Effectiveness • How does OSR improve/constrain dataflow optimizations? • How effective are online OSR-based optimizations?

  45. Recompilation Activities First Run With OSR Without OSR O0 O1 O2 total O0 O1 O2 total compress 17 7 2 26 13 9 6 28 jess 49 20 1 70 39 17 4 60 db 8 4 2 14 8 4 5 17 javac 171 19 2 192 168 16 3 187 mpegaudio 68 32 7 107 66 29 6 101 mtrt 57 14 3 74 61 11 3 75 jack 59 25 8 92 54 26 5 85 total 429 121 25 575 409 112 32 553

  46. Summary of Study (1) • Engineering Complexity • How to minimize disruption to VM code base? • Compiler-independent specialized source code to manage transition transparently • How to constrain optimizations? • Model OSR Points like CALLS in standard transformations • Policies for applying OSR • How to make rational decisions for applying OSR? • Simple modifications to cost-benefit analytic model

  47. Summary of Study (2) • Effectiveness • (for an implementation of online profile-directed deferred compilation) • How does OSR improve/constrain dataflow optimizations? • small ideal benefit from dataflow merges (0.5 - 2.2%) • negligible benefit when constraining optimization for potential invalidation • negligible benefit for just CHA-based inlining • patch points + splitting + pre-existence good enough • How effective are online OSR-based optimizations? • average performance improvement of 2.6% on first run SPECjvm98 s=100 • individual benchmarks range from +8% to -4% • negligible impact on steady state performance (best of 10 iterations) • adaptive recompilation model relatively insensitive, compiles 4% more methods

  48. Experimental Details • SPECjvm98, size 100 • Jikes RVM 2.1.1 • FastAdaptiveSemispace configuration • one virtual processor • 500MB heap • separate VM instance for each benchmark • IBM RS/6000 Model F80 • six 500 MHz PowerPC 630's • AIX 4.3.3 • 4 GB memory

  49. Specialized Code Generation • Generate specialized m2 that sets up new stack frame and continues execution, preserving semantics. • Express the transition to new stack frame in source code (bytecode) 2 compiler- independent state m2

  50. Deferred Compilation • Don't compile "infrequent" blocks if (foo is currently final) if (foo is currently final) x = 1; x = 1; trap/OSR; x = foo(); return x; return x;

More Related