1 / 29

Targeting Dynamic Compilation for Embedded Systems

Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford University. Targeting Dynamic Compilation for Embedded Systems. Outline. Motivating Problem Compiler Design Performance Results Conclusions. Challenges of Running Java on Embedded Devices.

chul
Download Presentation

Targeting Dynamic Compilation for Embedded Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford University Targeting Dynamic Compilation for Embedded Systems

  2. Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions

  3. Challenges of Running Java on Embedded Devices • J2ME (micro edition) on CDC (connected device configuration) • PDAs, thin clients, and high-end cellphones • Highly resource constrained • 30MHz - 200MHz embedded processors • 2MB - 32MB RAM • < 4MB ROM • Differences from running Java on desktop machines • Satisfying performance requirements difficult with slower processors • Virtual machine footprint matters • Limited dynamic memory available for runtime system J2ME/CLDC J2ME/CDC J2SE J2EE Embedded Server Desktop

  4. Java Execution Models • Interpretation • Decode and execute bytecodes in software • Incurs high performance penalty • Fast code generators • Dynamic compilation without aggressive optimization • Sacrifices code quality for compilation speed • Lazy compilation • Interpret bytecodes and translate code with optimizing compiler for frequently executed methods • Adds complexity and total ROM footprint of interpreter + compiler large • Alternative approach?

  5. microJIT: An Efficient Optimizing Compiler • Minimize major compiler passes while optimizing aggressively • Perform several optimizations concurrently • Pipeline information from one pass drive optimizations in subsequent passes • Budget overheads for dataflow analysis • Efficient implementations of straightforward optimizations • Use good heuristics for difficult optimizations • Manage compiler dynamic memory requirements • Efficient dataflow representation

  6. Using microJIT in Embedded Systems • Configuration • Compile everything to native code • Potential advantages over other execution models • Lower total system cost • Multiple execution engines require more ROM • Reduced complexity • Only need to maintain one compiler • Doesn't sacrifice long or short running performance • Generates fast code while minimizing overheads

  7. Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions

  8. microJIT Compiler Overview Dataflow Information ISA Dependent Optimizations CFG Construction Locals & field accesses Loop identification Register reservations IR expression optimizations IR expression use counts DFG Generation Assembler macros Instruction delays Register allocator Machine idioms Instruction scheduler Native Code Generation

  9. Pass 1: CFG Construction • Quickly scan bytecodes in one pass • Partially decode bytecodes to extract desired information • Decompose method into extended basic blocks (EBBs) • Build blocks and arcs as branches and targets are encountered • Compute block-level dataflow information • Identify loops • Record local and field accesses for blocks and loops

  10. Pass 2: DFG Generation • Intermediate representation (IR) • Closer to machine instructions than bytecodes (LIR) • Triples representation – unnamed destination • Source arguments are pointers to other IR expression nodes • Complex bytecodes decompose into several IR expressions [L0] [1] const 1 [2] add [1] [L0] [3] neg [2]

  11. Block-local Optimizations Pass 2: DFG Generation id IR expression [L0] [1] load @ [L0]+16 [2] const 1 [3] add [1] [2] [4] store [4] @ [L0]+16 • Maintain mimic stack when translating into IR expressions • Manipulate pointers in place of locals and stack accesses which do not generate IR expressions • Immediately eliminates copy expressions • Optimizations immediately applied to newly created IR expressions • Check source arguments for constant propagation and algebraic simplifications • Search backwards in EBB for available matching expression (CSE) Java source L0.count++; bpc bytecode 0 aload_0 1 dup 2 getfield count 4 iconst_1 5 iadd 6 putfield count

  12. Global Optimizations Pass 2: DFG Generation B1 • Global optimizations also immediately applied to newly created IR expressions • Global forward flow information available for every new IR expression • Blocks processed in reverse post-order (predecessors first) • Use loop field and locals access statistics from previous pass to calculate fixed point solution at loop header • Restricted to dataflow optimizations that rely primarily on forward flow information • Global constant propagation, copy propagation, and CSE B2 B3 B5 B4 B6 B7 loop locals access table

  13. Loop Invariant Code Motion Pass 2: DFG Generation • Check loop statistics to make sure source arguments are not redefined in loop • Can perform code motion on dependent instructions without iterating • Hoisted IR expressions immediately communicated to successive instructions and blocks in loop PH [1]à [G0] loop locals access table [3]à [G1] H [1] add [L0] [L1] [2] const 1 [3] sub [1] [2] E

  14. Inlining Pass 2: DFG Generation • Optimized for small methods • Handles nested inlining • Important for object initializers with deep sub-classing • Can inline non-final public virtual and interface methods with only one target found at runtime • Protected with a class check

  15. Pass 3: Code Generation • Registers allocated dynamically as code is generated • Instruction scheduling within a basic block • Use standard list scheduling techniques • Fills load and branch delay slots • Successfully ported to three different ISAs • MIPS, SPARC, StrongARM • Ports took only a few weeks to implement • Plans to port to x86

  16. Fast Optimization of Machine Idioms Pass 3: Code Generation • Traditionally done using a peephole optimizer • Requires additional pass over generated code • Compiler features allow optimization of machine idioms without additional pass • Machine specific code can be invoked two passes • Configurable IR expressions • Deferred code generation of IR expressions • Optimized machine idioms • Register calling conventions • Mapping branch implementations • Immediate operands • Different addressing modes

  17. Code Generation Example Pass 3: Code Generation {blk,glb} uses {2,0} {2,0} {1,0} {1,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {2,0} {1,0} {1,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {1,0} {0,0} {0,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {0,0} {0,0} {0,0} {0,1} {0,0} {1,0} {0,0} {blk,glb} uses {0,0} {0,0} {0,0} {0,0} {0,1} {0,0} {0,0} {0,0} last use [7] [6] [4] [4] [6] [7] flags %o1 %o0 %o0 imm reg alloc generated code N %l0 N %o1 ldw [%l0+16],%o1 N %o0 mov 5, %o0 N %l1 mov %o1,%l1 F %o0 call newarray F %o1 N %g1 add %l1,1,%g1 F %l1 F %g1 stw %g1,[%l0+16] F %l0 id IR expression [L0] [1] load @ [L0]+16 [2] const 5 [3] const &newarray [4] call [3] ([2] [1]) à [L1] [5] const 1 [6] add [1] [5] [7] store [6] @ [L0]+16 Register conventions %ln – call preserved reg %on – argument reg %gn – temp reg DFG generation Code generation

  18. Global Register Allocation Pass 3: Code Generation B0 J0 Out – B0 In – B1 B2 B1 B2 J1 Out – B1 B3 In – B3 B4 B3 J2 Out – B2 B4 In – B5 Reserve outgoing registers B4 Reserve outgoing registers B5

  19. Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions

  20. Experiment Setup • SPARC VMs chosen for comparison • Large number of VMs with source code available • Required for timing and memory use instrumentation • Neutral RISC ISA • No embedded JITs available for comparison • Variety of benchmarks chosen • Benchmark suites – SPECjvm98, Java Grande, jBYTEmark • Other significant applications – MipsSimulator, h263 Decoder, jLex, jpeg2000

  21. Comparisons to Other Dynamic Compilers

  22. Compilation Speed • 30% faster than Sun-client • 2.5x faster than nearest dataflow compiler (LaTTe) UltraSparcII @ 200MHz Sun Solaris 8

  23. Time spent in each compiler pass • CFG construction consistently < 10% of compile time • DFG generation grows in proportion for large methods • Can improve code generation time for large methods • Limit optimizations with costs that grow with method size • CSE time grows with increasing code size

  24. Performance on Long Running Benchmarks • Compilation to execution time proportionally smaller • Collected times also include Sun interpreter • Good performance for numerical programs • Performance suffers on object-oriented code Speedup normalized to microJIT

  25. Performance on Short Running Benchmarks • Compilation to execution time proportionally larger • Fast optimizing compiler can compete against lazy compilation on total run time Speedup normalized to microJIT

  26. Factors limiting microJIT performance • Sun-client and Sun-server support speculative inlining • Inline non-final public virtual and interface calls that only have one target • Decompile and fix if class loading adds new targets • Garbage collection overheads are higher for our system • Impacted object-oriented programs

  27. Dynamic Memory Usage • microJIT compiler requires 2x memory of Sun-client, but less than ¼ of dataflow compilers • 250KB sufficient to compile 1KB method • Can reduce memory requirements for compilation of large methods by build DFG and generating code for only subsections of CFG per pass • 300KB native code buffer sufficient for largest benchmark applications (pizza compiler and jpeg2000)

  28. Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions

  29. Conclusions • Proposed Java dynamic compilation scheme for embedded devices • Compile all code • Fast compiler which performs aggressive optimizations • Results show potential of this approach • Small dynamic and static memory footprint • Good compilation speed and generated code performance • Possible improvements • Memory usage and compilation performance on large methods • Implement additional optimizations • Aggressive array bounds check removal from loops

More Related