1 / 25

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters. Marc Berndl Benjamin Vitale Mathew Zaleski Angela Demke Brown. Research supported by IBM CAS, NSERC, CITO. Interpreter performance. Why not just JIT? High performance JVMs still interpret

salene
Download Presentation

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters • Marc Berndl • Benjamin Vitale • Mathew Zaleski • Angela Demke Brown Research supported by IBM CAS, NSERC, CITO

  2. Interpreter performance • Why not just JIT? • High performance JVMs still interpret • People use interpreted languages that don’t yet have JITs • They still want performance! • 30-40% of execution time is due to branch misprediction • Our technique eliminates 95% of branch mispredictions

  3. Overview • Motivation • Background: The Context Problem • Existing Solutions • Our Approach • Inlining • Results

  4. Loaded Program A Tale of Two Machines Virtual Machine Interpreter Execution Cycle Virtual Program load Bytecode Bodies Pipeline Execution Cycle Real Machine CPU Target Address (Indirect) Predictors Return Address Wayness (Conditional)

  5. execute Loaded Program dispatch Interpreter fetch Load Parms Internal Representation Bytecode bodies Execution Cycle

  6. Javac compiler Running Java Example Java Bytecode Java Source 0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return void foo(){ int i=1; do{ i+=i; } while(i<64); }

  7. Switched Interpreter while(1){ opcode = *vPC++; switch(opcode){ //and many more.. } }; case iload_1: .. break; case iadd: .. break; 4slow. burdened by switch and loop overhead

  8. iload_1: .. goto *vPC++; iadd: .. goto *vPC++; istore: .. goto *vPC++; “Threading” Dispatch 0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return execution of virtual program “threads” through bodies (as in needle & thread) • No switch overhead. Data driven indirect branch.

  9. iload_1: .. goto *vPC++; iadd: .. goto *vPC++; istore: .. goto *vPC++; Context Problem 0: iconst_0 1: istore_1 2: iload_1 3: iload_1 4: iadd 5: istore_1 6: iload_1 7: bipush 64 9: if_icmplt 2 12: return indirect branch predictor (micro-arch) • Data driven indirect branches hard to predict

  10. vPC iload_1: .. goto *vPC++; iadd: .. goto *vPC++; istore: .. goto *vPC++; Direct Threaded Interpreter &&iload_1 … iload_1 iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 … &&iload_1 &&iadd &&istore_1 &&iload_1 &&bipush 64 &&if_icmplt -7 DTT - Direct Threading Table C implementation of each body Virtual Program 4Target of computed goto is data-driven

  11. 1 iload_1 1 Body Body 1 Body Body 2 iload_1 Body 2 GOTO *PC 2 ???? Existing Solutions Replicate Super Instruction goto *pc goto *pc Ertl & Gregg: Bodies and Dispatch Replicated Piumarta & Ricardi : Bodies Replicated 4Limited to relocatable virtual instructions

  12. Overview • Motivation • Background: The Context Problem • Existing Solutions • Our Approach • Inlining • Results

  13. Key Observation • Virtual and native control flow similar • Linear or straight-line code • Conditional branches • Calls and Returns • Indirect branches • Hardware has predictors for each type • Direct uses indirect branch for everything! • Solution: Leverage hardware predictors

  14. iload_1: .. ret; iadd: .. ret; Essence of our Solution CTT - Context Threading Table (generated code) … iload_1 iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 … Bytecode bodies (ret terminated) call iload_1 call iload_1 call iadd call istore_1 call iload_1 .. Return Branch Predictor Stack 4Package bodies as subroutines and call them

  15. vPC … iload_1 iload_1 iadd istore_1 iload_1 bipush 64 if_icmplt 2 … iload_1: … ret; iadd: … ret; 64 -7 if_cmplt: … goto *vPC++; DTT contains addresses in CTT Subroutine Threading Bytecode bodies (ret terminated) call iload_1 call iload_1 call iadd call istore_1 call iload_1 call bipush call if_icmplt CTT load time generated code 4virtual branch instructions as before

  16. The Context Threading Table • A sequence of generated call instructions • Good alignment of virtual and hardware control flow for straight-line code. • Can virtual branches go into the CTT?

  17. vPC if(icmplt) goto target: Specialized Branch Inlining … … target: … 5 call iload_1 Conditional Branch Predictor now mobilized call … … … … target: Branch Inlined Into the CTT DTT 4Inlining conditional branches provides context

  18. Tiny Inlining • Context Threading is a dispatch technique • But, we inline branches • Some non-branching bodies are very small • Why not inline those? • Inline all tiny linear bodies into the CTT

  19. Overview • Motivation • Background: The Context Problem • Existing Solutions • Our Approach • Inlining • Results

  20. Experimental Setup • Two Virtual Machines on two hardware architectures. • VM: Java/SableVM, OCaml interpreter • Compare against direct threaded SableVM • SableVM distro uses selective inlining • Arch: P4, PPC • Branch Misprediction • Execution Time • Is our technique effective and general?

  21. Mispredicted Taken Branches Normalized to Direct Threading SableVm/Java Pentium 4 495% mispredictions eliminated on average

  22. Execution time Pentium 4 Normalized to Direct Threading 427% average reduction in execution time

  23. Execution Time (geomean) Normalized to Direct Threading 4Our technique is effective and general

  24. Conclusions • Context Problem: branch mispredictions due to mismatch between native and virtual control flow • Solution: Generate control flow code into the Context Threading Table • Results • Eliminate 95% of branch mispredictions • Reduce execution time by 30-40% • recent, post CGO 2005, work follows

  25. What about Scripting Languages? • Recently ported context threading to TCL. • 10x cycles executed per bytecode dispatched. • Much lower dispatch overhead. • Speedup due to subroutine threading, approx. 5%. • TCL conference 2005 Cycles per virtual instruction

More Related