1 / 25

Using Platform-Specific Performance Counters for Dynamic Compilation

Using Platform-Specific Performance Counters for Dynamic Compilation. Florian Schneider and Thomas Gross ETH Zurich. Introduction & Motivation. Dynamic compilers common execution platform for OO languages (Java, C#) Properties of OO programs difficult to analyze at compile-time

Download Presentation

Using Platform-Specific Performance Counters for Dynamic Compilation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich

  2. Introduction & Motivation • Dynamic compilers common execution platform for OO languages (Java, C#) • Properties of OO programs difficult to analyze at compile-time • JIT compiler can immediately use information obtained at run-time

  3. Introduction & Motivation Types of information: • Profiles: e.g. execution frequency of methods / basic blocks • Hardware-specific properties: cache misses, TLB misses, branch prediction failures

  4. Outline • Introduction • Requirements • Related work • Implementation • Results • Conclusions

  5. Requirements • Infrastructure flexible enough to measure different execution metrics • Hide machine-specific details from VM • Keep changes to the VM/compiler minimal • Runtime overhead of collecting information from the CPU low • Information must be precise to be useful for online optimization

  6. Related work • Profile guided optimization • Code positioning [PettisPLDI90] • Hardware performance monitors • Relating HPM data to basic blocks [Ammons PLDI97] • “Vertical profiling” [Hauswirth OOPSLA 2004] • Dynamic optimization • Mississippi delta [Adl-Tabatabai PLDI2004] • Object reordering [Huang OOPSLA 2004] • Our work: • No instrumentation • Use profile data + hardware info • Targets fully automatic dynamic optimization

  7. Hardware performance monitors • Sampling-based counting • CPU reports state every n events • Precision platform-dependent (pipelines, out-of-order execution) • Sampling provides method, basic block, or instruction-level information • Newer CPUs support precise sampling (e.g. P4, Itanium)

  8. Hardware performance monitors • Way to localize performance bottlenecks • Sampling interval determines how fine-grained the information is • Smaller sampling interval  more data • Trade-off: precision vs. runtime overhead • Need enough samples for a representative picture of the program behavior

  9. Implementation Main parts • Kernel module: low level access to hardware, per process counting • User-space library: hides kernel & device driver details from VM • Java VM thread: collects samples periodically, maps samples to Java code • Implemented on top of Jikes RVM

  10. System overview

  11. Implementation • Supported events: • L1 and L2 cache misses • DTLB misses • Branch prediction • Parameters of the monitoring module: • Buffer size (fixed) • Polling interval (fixed) • Sampling interval (adaptive) • Keep runtime overhead constant by changing interval during run-time automatically

  12. From raw data to Java • Determine method + bytecode instr • Build sorted method table • Map offset to bytecode 0x080485e1: mov 0x4(%esi),%esi 0x080485e4: mov $0x4,%edi 0x080485e9: mov (%esi,%edi,4),%esi 0x080485ec: mov %ebx,0x4(%esi) 0x080485ef: mov $0x4,%ebx 0x080485f4: push %ebx 0x080485f5: mov $0x0,%ebx 0x080485fa: push %ebx 0x080485fb: mov 0x8(%ebp),%ebx 0x080485fe: push %ebx 0x080485ff: mov (%ebx),%ebx 0x08048601: call *0x4(%ebx) 0x08048604: add $0xc,%esp 0x08048607: mov 0x8(%ebp),%ebx 0x0804860a: mov 0x4(%ebx),%ebx GETFIELD ARRAYLOAD INVOKEVIRTUAL

  13. From raw data to Java • Sample gives PC + register contents • PC  machine code  compiled Java code  bytecode instruction • For data address: use registers + machine code to calculate target address: • GETFIELD  indirect load mov 12(eax), eax // 12 = offset of field

  14. Engineering issues • Lookup of PC to get method / BC instr must be efficient • Done in parallel with user program • Use binary search / hash table • Update at recompilation, GC • Identify 100% of instructions (PCs): • Include samples from application, VM, and library code • Dealing with native parts

  15. Infrastructure • Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform • Pentium 4, 3 GHz, 1G RAM, 1M L2 cache • Measured data show: • Runtime overhead • Extraction of meaningful information

  16. Runtime overhead • Experiment setup: monitor L2 cache misses

  17. Runtime overhead: specJBB Total cost / sample: ~ 3000 cycles

  18. Measurements • Measure which instructions produce most events (cache misses, branch mispred) • Potential for data locality and control flow optimizations • Compare different spec-benchmarks • Find “hot spots”: instructions that produce 80% of all measured events

  19. L1/L2 Cache misses 80% quantile = 21 instructions (N=571) 80% quantile = 13 (N=295)

  20. L1/L2 Cache misses 80% quantile = 477 (N=8526) 80% quantile = 76 (N=2361)

  21. L1/L2 Cache misses 80% quantile = 1296 (N=3172) 80% quantile = 153 (N=672)

  22. Branch prediction 80% quantile = 307 (N=4193) 80% quantile = 1575 (N=7478)

  23. Summary • Distribution of events over program differ significantly between benchmarks • Challenge: Are data precise enough to guide optimizations in a dynamic compiler?

  24. Further work • Apply information in optimizer • Data: access path expressions p.x.y • Control flow: inlining, I-cache locality • Investigate flexible sampling interval • Further optimizations of monitoring system • Replacing expensive JNI calls • Avoid copying of samples

  25. Concluding remarks • Precise performance event monitoring is possible with low overhead (~ 2%) • Monitoring infrastructure tied into Jikes RVM compiler • Instruction level information allows optimizations to focus on “hot spots” • Good platform to study coupling compiler decisions to hardware-specific platform properties

More Related