1 / 25

Microarchitectural Characterization of Production JVMs and Java Workload work in progress

Microarchitectural Characterization of Production JVMs and Java Workload work in progress. Jungwoo Ha ( UT Austin ) Magnus Gustafsson ( Uppsala Univ. ) Stephen M. Blackburn ( Australian Nat’l Univ. ) Kathryn S. McKinley ( UT Austin ). Challenges of JVM Performance Analysis.

miya
Download Presentation

Microarchitectural Characterization of Production JVMs and Java Workload work in progress

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarchitectural Characterizationof Production JVMs and Java Workloadwork in progress Jungwoo Ha (UT Austin)Magnus Gustafsson (Uppsala Univ.) Stephen M. Blackburn (Australian Nat’l Univ.) Kathryn S. McKinley (UT Austin)

  2. Challenges of JVM Performance Analysis • Controlling nondeterminism • Just-In-Time Compilation driven by nondeterministic sampling • Garbage Collectors • Other Helper Threads • Production JVMs are not created equal • Thread model (kernel, user threads) • Type of helper threads • Need a solid measurement methodology! • Isolate each JVM part

  3. Forest and Trees • What performance metrics explain performance differences and bottlenecks? • Cache miss? L1 or L2? • TLB miss? • # of instructions? • Inspecting one or two metrics is not always enough • Performance counters give us only small number of counters at a time • Multiple invocation for the measurement inevitable

  4. Case Study: jython • Application performance (Cycles)

  5. Case Study: jython • L1 Instruction cache miss/cyc

  6. Case Study: jython • L1 Data cache miss/cyc

  7. Case Study: jython • Total Instruction executed (retired)

  8. Case Study: jython • L2 Data cache miss/cycle

  9. Project Status • Established methodology to characterize application code performance • Large number of metrics (40+) measured from hardware performance counters • apples to apple comparison of JVMs using standard interface(JVMTI, JNI) • Simulator data for detail analysis • Limit studies • What if L1 cache had no misses? • More performance metrics • e.g. uop mix

  10. Full Heap GC Measured Run Performance Counter Methodology • Collecting n metric • x warmup iterations (x = 10) • p performance counters (can measure at most p metrics per iter.) • n/p iterations needed for measurement • k redundant measurement for statistical validation (k = 1) • Need to hold workload constant for multiple measurements 1st – xth iteration Warmup JVM Stop JIT (x+1)th iteration Invoke JVM y times (x+2)th – (x+2+(n/p)k)th iteration change metric

  11. Performance Counter Methodology • Stop-the-world Garbage Collector • No concurrent marking • One perfctr instance per pthread • JVM internal threads are different pthreads from the application • JVMTI Callbacks • Thread start - start counter • Thread finish - stop counter • GC start - pause counter, only for userlevel thread • GC stop - resume counter, only for userlevel thread

  12. Methodology Limitations • Cannot factor out memory barrier overhead • Use garbage collector with the least application overhead • If a helper thread runs in the same pthread with the application (user-level thread), it will cause perturbation • No evidence in J9, HotSpot, JRockit • Instrumented code overhead • Must be included in the measurement

  13. Experiment • Performance Counter Experiment • Pentium-M uni-processor • 32KB 8-way L1 cache (data & instruction) • 2MB 4-way L2 cache • 2 hardware counter (18 if multiplexed) • 1GB Memory • 32bit Linux 2.6.20 with perfctr patch • PAPI 3.5.0 Library • Simulator Experiment • PTLsim (http://www.ptlsim.org) x86 simulator • 64bit AMD Athlon

  14. Experiment • 3 Production JVMs * 2 versions • IBM J9, Sun HotSpot JVM, JRockit (perfctr only) • 1.5 and 1.6 • Heap Size = max (16MB, 4*minimum heap size) • 18 Benchmarks • 9 DaCapo benchmarks • 8 SPEC JVM 98 • 1 PseudoJBB

  15. Experiment • 40+ Metrics • 40 distinct metrics from performance counter • L1 or L2 Cache misses (Instruction, Data, Read, Write) • TLB-I miss • Branch predictions • Resource Stalls • More rich metrics from the simulator • Micro operation mix • Load to store

  16. Performance Counter Results (Cycle Counts) • PseudoJBB • pmd • jython • jess

  17. Performance Counter Results (Cycle Counts) • jack • hsqldb • compress • db

  18. Performance Counter Results • IBM J9 1.6 performed better than Sun HotSpot 1.6 in the average • JRockit has the most variation in performance • Full results • ~800 graphs • Full jython results in the paper • http://z.cs.utexas.edu/users/habals/jvmcmp • or Google my name (Jungwoo Ha)

  19. Future Work • JVM activity characterization • Garbage collector • JIT • Statistical analysis of performance metrics • metrics correlation • Methodology to identify performance bottleneck • Multicore performance analysis

  20. Conclusions • Methodology for production JVM comparison • Performance evaluation data • Simulator results for deeper analysis

  21. Thanks you!

  22. Simulation Result

  23. Perfect Cache - compress

  24. Perfect Cache - db

More Related