HLL VM Implementation

HLL VM Implementation chap6.5~6.7 김정기 Kim, Jung ki October 11th, 2006 System programming 특강, 2006

Contents • Basic Emulation • High-performance Emulation • Optimization Framework • Optimizations • Case study: The Jikes Research Virtual Machine

Basic Emulation • The emulation engine in a JVM can be implemented in a number of ways • interpretation • just-in-time compilation(JIT) • JIT • Methods are compiled at the time they are first invoked • JIT compilation is enabled because the Java ISA’s instructions in a method can easily be discovered

JIT vs conventional compiler • JIT doesn’t have a frontend for parsing and syntax checking before intermediate form • Different intermediate form before optimization • Optimization strategy • multiple optimization levels through profiling • applying optimizations selectively to hot spot(not entire method) • Examples • interpretation : Sun HotSpot, IBM DK • compilation : Jikes RVM

High-Performance Emulation • Two challenges for HLL VMs • to offset run-time optimization overhead with execution-time improvement • to make an object-oriented program go fast • Frequent use of addressing indirection and small methods

Optimization Framework translated code profile data SimpleCompiler OptimizingCompiler Bytecodes Interpreter Profile Data CompiledCode OptimizedCode Host Platform

Contents • Basic Emulation • High-performance Emulation • Optimization Framework • Optimizations • Code Relayout • Method Inlining • Optimizing Virtual Method Calls • Multiversioning and Specialization • On-Stack Replacement • Optimization of Heap-Allocated Objects • Low-Level Optimizations • Optimizing Garbage Collection • Case study: The Jikes Research Virtual Machine

Code Relayout • the most commonly followed control flow paths are in contiguous location in memory • improved locality and conditional branchpredictability

B Code Relayout A 3 Br cond1 == false D A Br cond3 == true 30 70 E D B G Br cond4 == true 1 29 68 Br uncond 2 E F C Br cond2 == false 29 68 1 C G Br uncond 97 1 F Br uncond

Method Inlining • Benefits • calling overheads decrease especially in object-oriented • passing parameters • managing stack frame • control transfer • code analysis scope expands • more optimizations are applicable. • Effects may be different by method’s size • small method : beneficial in most of cases • large method : low portion of calling sequence, sophisticated cost-benefit analysis is needed ->code explosion may occur : poor cache behavior, performance losses

Method Inlining • Processing sequence1. profiling by instrument2. constructing call-graph at certain intervals3. invoking dynamic optimization system when call counts exceeds certain threshold • Reducing analysis overhead • profile counter is included in stack frame. • When meet the threshold, “walk” backward through the stack

Method Inlining MAIN MAIN 900 100 900 A X A 100 1000 1500 25 1500 threshold C B C Y threshold via stack frame With a call graph

Optimizing Virtual Method Calls • the most common case • Determination of which code to use is done at run time via a dynamic method table lookup. If (a.isInstanceof(Sqaure)){ inlined code…. . }Else invokevirtual <perimeter> Invokevirtual <perimeter>

Optimizing Virtual Method Calls • If inlining is not useful, just removing method table lookup is also helpful • Polymorphic Inline Caching circle perimeter code square perimeter code if type = circle jump to circle perimeter code else if type = square jump to square perimeter codeelse call lookup …invokevirtualperimeter… …call PIC stub… polymorphic Inline Cache stub update PIC stub;method table lookup code

Multiversioning and Specialization • Multiversioning by specialization • If some variables or references are always assigned data values or types known to be constant (or from a limited range) • simplified, specialized code can be used for (int i=0;i<1000;i++) { if (A[i] ==0 ) B[i] = 0; } for (int i=0;i<1000;i++) { if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i];} if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i]; specialized code

Multiversioning and Specialization • defered compilation of the general case for (int i=0;i<1000;i++) { if (A[i] ==0 ) B[i] = 0; } for (int i=0;i<1000;i++) { if(A[i]<0) B[i] = -A[i]*C[i]; else B[i] = A[i]*C[i];} jump to dynamic compiler for deferred compilation

On-Stack Replacement • due to no benefit until the next call, OSR is needed • Implementation stack needs to be modified on the fly. • OSR is needed in this case • inlining in long-running method • defered compilation • debugging (user expect to observe the architected instruction sequence)

On-Stack Replacement optimize/de-optimizemethod code methodcodeopt. level y methodcodeopt. level x stack stack 2. generate a new implementation frame implementationframe A implementationframe B architectedframe 3. replace the current implementation stack frame 1. extract architected state

On-Stack Replacement • OSR is a complex operation • If the initial stack frame is maintained by an interpreter or an nonoptimizing compiler, then extracting architected stack state straightforward • On the other hand, compiler may define a set of program points where OSR can potentially occur and then ensure that the architected values are live at that point in the execution.

On-Stack Replacement • Meaning of OSR • state performance benefits are small • allowing the implementation of debuggers • reducing start-up time in defered compilation • improving cache performance

Optimization of Heap-Allocated Objects • Creating objects and garbage collection have high cost • the code forthe heap allocation and objectinitializationcan beinlined for frequently allocated objects • scalar replacement • escape analysis • effective for reducing object access delays

Optimization of Heap-Allocated Objects • Scarlar Replacement -access delays are reduced class square {int side;int area;}void calculate() { a = new square(); a.side= 3; a.area= a.side * a.side;System.out.println(a.area); } void calculate() { int t1= 3;int t2= t1 * t1;System.out.println(t2); }

Optimization of Heap-Allocated Objects • field ordering for data usage patterns • to improve data cache performance • to remove redundant object accesses redundant getfield (load) removal a = new square; b = new square; c = a; … a.side = 5; b.side = 10; z = c.side; a = new square; b = new square; c = a; … t1 = 5; a.side = t1; b.side = 10 z = t1;

Low-Level Optimizations • array range and null reference checking is significant in object-oriented HLL VMs • array range and null reference checking for throwing exception may cause two performance losses • overhead needed to perform check itself • some optimizations are inhibited for a precise state

Low-Level Optimizations p = new Z q = new Z r = p … p.x = … <null check p> … = p.x<null check p> … q.x = …<null check q> … r.x = …<null check r(p)> p = new Z q = new Z r = p … p.x = … <null check p> … = p.x … r.x = … q.x = … <null check q> Removing Redundant Null Checks

Low-Level Optimizations • Hoisting an Invariant Check • checking can be hoisted outside the loop if (j < A.length)then for (int i=0;i<j;i++) { sum += A[i]; } else for (int i=0;i<j;i++) { sum += A[i]; <range check A> } for (int i=0;i<j;i++) { sum += A[i]; <range check A> }

Low-Level Optimizations • Loop Peeling • the null check is not needed for the remaining loop iterations r = A[0]; B[0] = r*2; p.x = A[0]; <null check p> for (int i=1;i<100;i++) { r = A[i]; p.x += A[i]; B[i] = r*2; } for (int i=0;i<100;i++) { r = A[i]; B[i] = r*2; p.x += A[i]; <null check p> }

Optimizing Garbage Collection • Compiler support • Compiler provide the garbage collector with “yield point” at regular intervals in the code. At these points a thread can guarantee a consistent heap state, and control can be yielded to the garbage collector. • Compiler also helps specific garbage-collection algorithm.

The Jikes Research Virtual Machines • an open sourceJava compiler. • The original version was developed by IBM • It is much faster in compiling small projects than Sun's own compiler. • Unfortunately it is no longer actively being developed.

The Jikes Research Virtual Machines • only to use compile (without interpretation step) • first, compiler translates bytecodes into native code • generated code simply emulates the Java stack, and then optimization is applied • dynamic compiler supports optimization depending on an estimate of cost-benefit • multithreaded implementation • preemptive thread scheduling by control bit

The Jikes Research Virtual Machines • Adaptive Optimization System • runtime measurement system *gathers raw performance data by sampling at yield point • recompilation system • controller *responsible for coordinating the activities *determining optimization level in recomilation • cost-benefit function(to given method) • Cj(recompile time) , Tj(recompiled exec) , Ti(not recompiled) • Cj + Tj < Ti -> recompilation.

The Jikes Research Virtual Machines Collected sample New code Executing Code RuntimeMeasurementSubsystem RecompilationSubsystem Instrumentation/Compilation Plan MethodSamples AOS Database(Profile Data) CompilationThread OptimizingCompiler Instrumented/Optimized Code Hot MethodOrganizer Controller Event queue Compilation queue

The Jikes Research Virtual Machines Optimization levels • Level 0 • copy/constant propagation, branch optimization, etc. • trivial methods inlining • simple code relayout, register allocation (simple linear scan) • Level 1 • higher-level code restructuring : more aggressive inlining, code relayout • Level 2 • use a static single assignment intermediate form • SSA allows global optimizations • loop unrolling, eliminating loop-closing branches

The Jikes Research Virtual Machines start-up

The Jikes Research Virtual Machines steady-state

The Jikes Research Virtual Machines

HLL VM Implementation

HLL VM Implementation

Presentation Transcript

Oracle VM

XenSocket: VM-to-VM IPC

VM OPTICALS

VM PORTFOLIO

An Implementation of Mostly-Copying GC on Ruby VM

VM CAT

HLL VM Implementation

VM Capital

HLL Project Sakthi

vm@etf.rs

VM Internals

vm@etf.rs

OpenNebula VM

VM Lifecycle

Request VM

VM CAT

VM SERIES