Context threading a flexible and efficient dispatch technique for virtual machine interpreters
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters. Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown. Research supported by IBM CAS, NSERC, CITO. Interpreter performance. Why not just JIT? High performance JVMs still interpret

Download Presentation

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Context threading a flexible and efficient dispatch technique for virtual machine interpreters

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

  • Marc Berndl

  • Benjamin Vitale

  • Mathew Zaleski

  • Angela Demke Brown

Research supported by IBM CAS, NSERC, CITO


Interpreter performance

Interpreter performance

  • Why not just JIT?

    • High performance JVMs still interpret

    • People use interpreted languages that don’t yet have JITs

    • They still want performance!

  • 30-40% of execution time is due to branch misprediction

  • Our technique eliminates 95% of branch mispredictions


Overview

Overview

  • Motivation

  • Background: The Context Problem

  • Existing Solutions

  • Our Approach

  • Inlining

  • Results


Context threading a flexible and efficient dispatch technique for virtual machine interpreters

Loaded

Program

A Tale of Two Machines

Virtual Machine Interpreter

Execution Cycle

Virtual

Program

load

Bytecode

Bodies

Pipeline

Execution Cycle

Real

Machine

CPU

Target Address

(Indirect)

Predictors

Return Address

Wayness

(Conditional)


Interpreter

execute

Loaded

Program

dispatch

Interpreter

fetch

Load

Parms

Internal

Representation

Bytecode

bodies

Execution Cycle


Running java example

Javac

compiler

Running Java Example

Java Bytecode

Java Source

0: iconst_0

1: istore_1

2: iload_1

3: iload_1

4: iadd

5: istore_1

6: iload_1

7: bipush 64

9: if_icmplt 2

12: return

void foo(){

int i=1;

do{

i+=i;

} while(i<64);

}


Switched interpreter

Switched Interpreter

while(1){

opcode = *vPC++;

switch(opcode){

//and many more..

}

};

case iload_1:

..

break;

case iadd:

..

break;

4slow. burdened by switch and loop overhead


Threading dispatch

iload_1:

..

goto *vPC++;

iadd:

..

goto *vPC++;

istore:

..

goto *vPC++;

“Threading” Dispatch

0: iconst_0

1: istore_1

2: iload_1

3: iload_1

4: iadd

5: istore_1

6: iload_1

7: bipush 64

9: if_icmplt 2

12: return

execution of virtual program “threads” through bodies

(as in needle & thread)

  • No switch overhead. Data driven indirect branch.


Context problem

iload_1:

..

goto *vPC++;

iadd:

..

goto *vPC++;

istore:

..

goto *vPC++;

Context Problem

0: iconst_0

1: istore_1

2: iload_1

3: iload_1

4: iadd

5: istore_1

6: iload_1

7: bipush 64

9: if_icmplt 2

12: return

indirect branch predictor

(micro-arch)

  • Data driven indirect branches hard to predict


Direct threaded interpreter

vPC

iload_1:

..

goto *vPC++;

iadd:

..

goto *vPC++;

istore:

..

goto *vPC++;

Direct Threaded Interpreter

&&iload_1

iload_1

iload_1

iadd

istore_1

iload_1

bipush 64

if_icmplt 2

&&iload_1

&&iadd

&&istore_1

&&iload_1

&&bipush

64

&&if_icmplt

-7

DTT - Direct

Threading Table

C implementation

of each body

Virtual

Program

4Target of computed goto is data-driven


Existing solutions

1

iload_1

1

Body

Body

1

Body

Body

2

iload_1

Body

2

GOTO *PC

2

????

Existing Solutions

Replicate

Super Instruction

goto *pc

goto *pc

Ertl & Gregg:

Bodies and Dispatch Replicated

Piumarta & Ricardi :

Bodies Replicated

4Limited to relocatable virtual instructions


Overview1

Overview

  • Motivation

  • Background: The Context Problem

  • Existing Solutions

  • Our Approach

  • Inlining

  • Results


Key observation

Key Observation

  • Virtual and native control flow similar

    • Linear or straight-line code

    • Conditional branches

    • Calls and Returns

    • Indirect branches

  • Hardware has predictors for each type

    • Direct uses indirect branch for everything!

  • Solution: Leverage hardware predictors


Essence of our solution

iload_1:

..

ret;

iadd:

..

ret;

Essence of our Solution

CTT - Context

Threading Table

(generated code)

iload_1

iload_1

iadd

istore_1

iload_1

bipush 64

if_icmplt 2

Bytecode bodies

(ret terminated)

call iload_1

call iload_1

call iadd

call istore_1

call iload_1

..

Return Branch Predictor Stack

4Package bodies as subroutines and call them


Subroutine threading

vPC

iload_1

iload_1

iadd

istore_1

iload_1

bipush 64

if_icmplt 2

iload_1:

ret;

iadd:

ret;

64

-7

if_cmplt:

goto *vPC++;

DTT contains

addresses in CTT

Subroutine Threading

Bytecode bodies

(ret terminated)

call iload_1

call iload_1

call iadd

call istore_1

call iload_1

call bipush

call if_icmplt

CTT load time

generated code

4virtual branch instructions as before


The context threading table

The Context Threading Table

  • A sequence of generated call instructions

  • Good alignment of virtual and hardware control flow for straight-line code.

  • Can virtual branches go into the CTT?


Specialized branch inlining

vPC

if(icmplt)

goto target:

Specialized Branch Inlining

target:

5

call iload_1

Conditional Branch Predictor now mobilized

call …

target:

Branch Inlined

Into the CTT

DTT

4Inlining conditional branches provides context


Tiny inlining

Tiny Inlining

  • Context Threading is a dispatch technique

    • But, we inline branches

  • Some non-branching bodies are very small

    • Why not inline those?

  • Inline all tiny linear bodies into the CTT


Overview2

Overview

  • Motivation

  • Background: The Context Problem

  • Existing Solutions

  • Our Approach

  • Inlining

  • Results


Experimental setup

Experimental Setup

  • Two Virtual Machines on two hardware architectures.

    • VM: Java/SableVM, OCaml interpreter

      • Compare against direct threaded SableVM

      • SableVM distro uses selective inlining

    • Arch: P4, PPC

  • Branch Misprediction

  • Execution Time

  • Is our technique effective and general?


Mispredicted taken branches

Mispredicted Taken Branches

Normalized to

Direct Threading

SableVm/Java Pentium 4

495% mispredictions eliminated on average


Execution time

Execution time

Pentium 4

Normalized to

Direct Threading

427% average reduction in execution time


Execution time geomean

Execution Time (geomean)

Normalized to

Direct Threading

4Our technique is effective and general


Conclusions

Conclusions

  • Context Problem: branch mispredictions due to mismatch between native and virtual control flow

  • Solution: Generate control flow code into the Context Threading Table

  • Results

    • Eliminate 95% of branch mispredictions

    • Reduce execution time by 30-40%

  • recent, post CGO 2005, work follows


What about scripting languages

What about Scripting Languages?

  • Recently ported context threading to TCL.

  • 10x cycles executed per bytecode dispatched.

  • Much lower dispatch overhead.

  • Speedup due to subroutine threading, approx. 5%.

  • TCL conference 2005

Cycles per virtual instruction


  • Login