Adaptive Optimization with On-Stack Replacement
Download
1 / 54

Adaptive Optimization with On-Stack Replacement - PowerPoint PPT Presentation


  • 148 Views
  • Uploaded on

Adaptive Optimization with On-Stack Replacement. Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University http://www.sable.mcgill.ca. Motivation. Modern VM uses adaptive recompilation strategies

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Adaptive Optimization with On-Stack Replacement' - celina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Adaptive Optimization with On-Stack Replacement

Stephen J. Fink

IBM T.J. Watson Research Center

Feng Qian (presenter)

Sable Research Group, McGill University

http://www.sable.mcgill.ca


Motivation

  • Modern VM uses adaptive recompilation strategies

  • VM replaces entry in dispatching table with newly compiled code

  • Switching to new code can only happen at the next invocation

  • On-stack replacement (OSR) allows transformation happen in the middle of method execution


stack

stack

m2

m1

m2

m1

frame

frame

PC

PC

What is On-stack Replacement?

  • Transfer execution from compiled code m1 to compiled code m2 even while m1 runs on some thread’s stack


Why On-Stack Replacement (OSR)?

  • Debugging optimized code via dynamic de-optimization [SELF-93]

  • Deferred compilation of cold paths in a method [SELF-91, HotSpot, Whaley 2001]

  • Promotion of long-run activations [SELF-93]

  • Safe invalidation for speculative optimization [HotSpot, SELF-91]


Related Work

  • Holzle, Chambers, and Ungar (SELF-91, SELF-93) deferred compilation, de-optimization for debugging, promotion of long-run loops, safe invalidation [OOPSLA’91, PLDI’92, OOPSLA’94]

  • HotSpot server compiler [JVM’01]

  • Partial method compilation [OOPSLA’01]


OSR Challenges

  • Engineering Complexity

    • How to minimize disruption to VM code base?

    • How to constrain optimizations?

  • Policies for applying OSR

    • How to make rational decisions for applying OSR?

  • Effectiveness

    • How does OSR improve/constrain dataflow optimizations?

    • How effective are online OSR-based optimizations?


Outline

  • Motivation

  • OSR Mechanism

  • Applications

  • Experimental Results

  • Conclusion


OSR Mechanism Overview

  • Extract compiler-independent state from a suspended activation for m1

  • Generate specialized code m2 for the suspended activation

  • Compile and transfer execution to the new code m2

stack

stack

2

3

1

compiler-

independent

state

m2

m2

m1

m2

m1

frame

frame

PC

PC


JVM Scope Descriptor

  • Compiler-independent state of a running activation

  • Based on Java Virtual Machine Architecture

  • Five components:

    • Thread running the activation

    • Reference to the activation's stack frame

    • Program Counter (as a bytecode index)

    • Value of each local variable

    • Value of each stack location


JVM Scope Descriptor Example

class C {

static int sum(int c) {

int y = 0;

for (int i=0; i<c; i++) {

y += i;

}

return y;

}

}

JVM Scope Descriptor

Bytecode

0 iconst_0

1 istore_1

2 iconst_0

3 istore_2

4 goto 14

7 iload_1

8 iload_2

9 iadd

10 istore_1

11 iinc 2 1

14 iload_2

15 iload_0

16 if_icmplt 7

19 iload_1

20 ireturn

Running thread: MainThread

Frame Pointer: 0xSomeAddress

Program Counter:16

Local variables:

L0(c) = 100;

L1(y) = 1225;

L2(i) = 50;

Stack Expressions:

S0 = 50;

S1 = 100;

Suspend after

50 loop iterations

(i = 50)


Extracting JVM Scope Descriptor

  • Trivial from interpreter

  • Optimizing Compiler

    • Insert OSR Point (safe-point) instructions in initial IR

    • OSR Pointuses stack, local state needed to recover scope descriptor

    • OSR Point is treated as a call, transfers control to exit block

    • Aggregate OSR points to an OSR map when generating machine instructions

stack

1

compiler-

independent

state

m1

m1

frame

PC


Specialized Code Generation

  • Prepend a specialized prologue to original bytecode

  • Prologue will

    • Save JVM Scope Descriptor values into local variables

    • Push JVM Scope Descriptor values onto the stack

    • Jump to the desired program counter

2

compiler-

independent

state

m2


Transition Example

JVM Scope Descriptor

Original Bytecode

Specialized Bytecode

Running thread: MainThread

Frame Pointer: 0xSomeAddress

Program Counter: 16

Local variables:

L0(c) = 100;

L1(y) = 1225;

L2(i) = 50;

Stack Expressions:

S0 = 50;

S1 = 100;

0 iconst_0

1 istore_1

2 iconst_0

3 istore_2

4 goto 14

7 iload_1

8 iload_2

9 iadd

10 istore_1

11 iinc 2 1

14 iload_2

15 iload_0

16 if_icmplt 7

19 iload_1

20 ireturn

ldc 100

istore_0

ldc 1225

istore_1

ldc 50

istore_2

ldc 50

ldc 100

goto 16

0 iconst_0

...

16 if_icmplt 7

...

20 ireturn


Transfer Execution to the New Code

  • Compile m2 as a normal method

  • System unfolds the stack frame of m1

  • Reschedule the thread to execute m2

  • By construction, executing specialized m2sets up target stack frame and continues execution

stack

stack

3

3

m2

m2

m2

m2

m2

m2

frame

frame

PC

PC


Recovering from Inlining

  • Suppose optimizer inlines A -> B -> C:

JVM Scope

Descriptor

A

stack

A'

A'

JVM Scope

Descriptor

B

A

stack

m2

A'

frame

2

3

1

B'

frame

frame

A

B'

B'

frame

A

A

JVM Scope

Descriptor

C

frame

C'

C'

frame

PC

C'

PC


Inlining Example

foo_prime() {

<specialized foo

prologue>

call bar_prime()

goto A;

...

bar();

A: ...

}

bar_prime() {

<specialized bar

prologue>

goto B:

...

B:

...

}

void foo() {

bar();

A:

...

}

void bar() {

...

B:

...

}

Wipe stack

to caller C

and call

foo_prime

stack

Suspend

at B: in

A -> B

C

foo'

A

m2

foo'

frame

frame

frame

bar'

bar'

frame

PC


Implementation Details

  • Target Compiler unmodified, except for ....

    • New pseudo-bytecodes

      • Load literals (to avoid inserting new constants in constant pool)

      • Load an address/bytecode index: JSR return address on stack

    • Fix bytecode indices for GC maps, exception tables, line number tables


Pros and Cons

  • Advantages

  • mostly compiler-independent

  • avoid multi-entry points of compiled code

  • target compiler can exploit run-time constants

  • Disadvantage

  • must compile target method twice (once for transition, once for next invocation)


Outline

  • Motivation

  • OSR Mechanism

  • Applications

  • Experimental Results

  • Conclusion


if (foo is currently final)

trap/OSR;

x = 1;

x = foo();

return x;

Two OSR Applications

  • Promotion (see the paper for details)

    • recompile a long-running activation

  • Deferred Compilation

    • don't compile uncommon paths

    • saves compile-time


Deferred Compilation

  • What's "infrequent"?

    • static heuristics

    • profile data

  • Adaptive recompilation decision is modified to consider OSR factors

Feng Qian:

Class initialization is called by a class loader, when do we need OSR for it?


Outline

  • Motivation

  • OSR Mechanism

  • Applications

  • Experimental Results

  • Conclusion


Online Experiments

Eager : (by default) no deferred compilation

  • OSR/static: deferred compilation for CHA-based inlining only

  • OSR/edge counts: deferred compilation w/online profile data & CHA-based inlining




OSR Activities

SPECjvm98 size 100 First Run

Promotions

Invalidations

compress

3

6

jess

0

0

db

0

1

javac

0

10

mpegaudio

0

1

mtrt

0

5

jack

0

1

total

3

24


Outline

  • Motivation

  • OSR Mechanism

  • Applications

  • Experimental Results

  • Conclusion


Summary

  • A new On-stack replacement mechanism

  • Online profile-directed deferred compilation

  • Evaluation of OSR applications in JikesRVM


Conclusion

  • Should a VM implement OSR?

    • Can be done with minimal intrusion to code base

    • Modest gains from deferred compilation

    • No benefit for class-hierarchy-based inlining

    • Debugging with dynamic de-optimization valuable

    • TODO: More advanced speculative optimizations

      Implementation is available to public in JikesRVM under CPL:

      Linux/x86, Linux/PPC, and AIX/PPC

      http://www-124.ibm.com/developerworks/oss/jikesrvm/



Compile Rate

Offline Profile


Compile Rate

Offline Profile


Machine Code Size

Offline Profile


Machine Code Size

Offline Profile


Code Quality

Offline Profile


Code Quality

Offline Profile

better


Jikes RVM Analytic Recompilation Model

  • Define

    • cur, current optimization level for methodm

    • Tj, expected future execution time at levelj

    • Cj, compilation cost at opt levelj

  • Choose j > cur that minimizes Tj + Cj

  • If Tj + Cj < Tcurrecompile at level j

  • Assumptions

    • Method will execute for twice its current duration

    • Compilation cost and speedup based on offline average

    • Sample data determines how long a method has executed


Jikes RVM OSR Promotion Model

  • Given: Outdated activationA of method m

  • Define

    • L,last optimization level for any compiled version of m

    • cur, current optimization level for activationA

    • Tcur, expected future execution time of A at level cur

    • CL, compilation cost for method m at opt levelL

    • TL, expected future execution time of A at levelL

  • If TL + CL < Tcurspecialize A at level L

  • Assumption

    • Outdated activation will execute for twice its current duration


Jikes RVM Recompilation Model,

with Profile-Driven Deferred Compilation

  • Define

    • cur, current optimization level for methodm

    • Tj, expected future execution time at levelj

    • Cj, compilation cost at opt levelj

    • P,percentage of code in m that profile data indicates was reached

  • Choose j > cur that minimizes Tj + P*Cj

  • If Tj + P*Cj < Tcurrecompile at level j

  • Assumptions

    • Method will execute for twice its current duration

    • Compilation cost and speedup based on offline average

    • Sample data determines how long a method has executed


Offline Profile experiments

  • Collect "perfect" profile data offline

  • Mark any block never reached as "uncommon"

  • Defer compilation of "uncommon" blocks

  • Four configurations

    • Ideal: deferred compilation trap keeps no state live

    • Ideal-OSR: deferred compilation trap is valid OSR point

    • Static-OSR: no profile data; defer compilation for CHA-based inlining; trap is valid OSR point

    • Eager: (default) no deferred compilation


Compile Rate

Offline Profile


Machine Code Size

Offline Profile


Code Quality

Offline Profile


OSR Challenges

  • Engineering Complexity

    • How to minimize disruption to VM code base?

    • How to constrain optimizations?

  • Policies for applying OSR

    • How to make rational decisions for applying OSR?

  • Effectiveness

    • How does OSR improve/constrain dataflow optimizations?

    • How effective are online OSR-based optimizations?


Recompilation Activities

First Run

With OSR

Without OSR

O0

O1

O2

total

O0

O1

O2

total

compress

17

7

2

26

13

9

6

28

jess

49

20

1

70

39

17

4

60

db

8

4

2

14

8

4

5

17

javac

171

19

2

192

168

16

3

187

mpegaudio

68

32

7

107

66

29

6

101

mtrt

57

14

3

74

61

11

3

75

jack

59

25

8

92

54

26

5

85

total

429

121

25

575

409

112

32

553


Summary of Study (1)

  • Engineering Complexity

    • How to minimize disruption to VM code base?

      • Compiler-independent specialized source code to manage transition transparently

    • How to constrain optimizations?

      • Model OSR Points like CALLS in standard transformations

  • Policies for applying OSR

    • How to make rational decisions for applying OSR?

      • Simple modifications to cost-benefit analytic model


Summary of Study (2)

  • Effectiveness

  • (for an implementation of online profile-directed deferred compilation)

    • How does OSR improve/constrain dataflow optimizations?

      • small ideal benefit from dataflow merges (0.5 - 2.2%)

      • negligible benefit when constraining optimization for potential invalidation

      • negligible benefit for just CHA-based inlining

        • patch points + splitting + pre-existence good enough

    • How effective are online OSR-based optimizations?

      • average performance improvement of 2.6% on first run SPECjvm98 s=100

      • individual benchmarks range from +8% to -4%

      • negligible impact on steady state performance (best of 10 iterations)

      • adaptive recompilation model relatively insensitive, compiles 4% more methods


Experimental Details

  • SPECjvm98, size 100

  • Jikes RVM 2.1.1

    • FastAdaptiveSemispace configuration

    • one virtual processor

    • 500MB heap

  • separate VM instance for each benchmark

  • IBM RS/6000 Model F80

    • six 500 MHz PowerPC 630's

    • AIX 4.3.3

    • 4 GB memory


Specialized Code Generation

  • Generate specialized m2 that sets up new stack frame and continues execution, preserving semantics.

  • Express the transition to new stack frame in source code (bytecode)

2

compiler-

independent

state

m2


Deferred Compilation

  • Don't compile "infrequent" blocks

if (foo is currently final)

if (foo is currently final)

x = 1;

x = 1;

trap/OSR;

x = foo();

return x;

return x;


Experimental Results

  • Online profile-directed deferred compilation

  • Evaluation

    • How much do OSR points improve optimization by eliminating merges?

    • How much do OSR points constrain optimization?

    • How effective is online profile-directed deferred compilation?




Online Experiments

  • Before optimizing, collect intraprocedural edge counters

  • Defer compilation at blocks that profile data says not reached

  • If deferred block reached

    • Trigger OSR and deoptimize

    • Invalidate compiled code

  • Modify analytic recompilation model

    • Promotion from baseline to optimized

    • Compile-time cost estimate modified according to profile data


ad