Online Performance Auditing Using Hot Optimizations Without Getting Burned

Online Performance AuditingUsing Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Problem • Trend: Increasing complexity of computer systems • Hardware: more speculation and parallelism • Software: more abstraction layers and virtualization • Increasing complexity makes it more difficult to reason about performance • Will optimization X improve performance?

Increasing Complexity • Increasing distance between application and raw performance • Stack on right vs. classic Application-OS-Hardware stack • Hard to predict how all layers will react to application-level optimization Application Application Server Java VM OS Hypervisor Hardware

Heuristics • When should I use optimization X? • Common solution: Use heuristics • Example: Apply optimization X if code size < N • “We believe X will improve performance when code size < N” • Determine N by running benchmarks and tuning to maximize average performance • But heuristics will miss opportunities to improve performance • Because they are tuned for the average case

Experiment • Aggressive inlining: 4x inlining thresholds • Allows much larger methods to be inlined • Apply aggressive inlining to one hot method at a time • Calculate per-method speedups vs. default inlining policy • Use cycle counter to measure performance

Experiment Results Aggressive inlining vs. default inlining Per-Method Speedups Using J9, IBM’s high-performance Java VM

Experiment Analysis • Aggressive inlining: mixed results • More slowdowns than speedups • But there are significant speedups!

Wishful Thinking • Dream: A world without slowdowns • Default inlining heuristics miss these opportunities to improve performance • Goal: Be aggressive only when it produces speedup

Approach • Determine if optimization improves or degrades performance as program executes • For general purpose applications • Using VM support (dynamic compilation) • Plan: • Compile two versions of the code: with and without optimization • Measure performance of both versions • Use best performing version

Benefits • Defense: Avoid slowdowns due to poor optimization decisions • Sometimes O3 is slower than O2. Detect and correct • Offense: Find speedups by searching the optimization space • Try high-risk optimizations without fear of long-term slowdowns

Challenge • Which implementation is fastest? • Decide online, without stopping and restarting the program • Can’t just invoke each version once and compare times • Changing inputs, global state, etc • Example: Sorting routine. Size of input determines run time • SortVersionA(10 entries) vs. SortVersionB(1,000,000 entries) • Invocation timings don’t reflect performance of A and B • Unless we know that input size correlates with runtime • But that requires high-level understanding of program behavior • Solution: Collect multiple timing samples for each version • Use statistics to determine how many samples to collect

Invocation of Sort() Randomly choose Version A or B Start timer Sort() Version A Sort() Version B Stop timer Record timing Method exit Timing Infrastructure • Can generalize: • Doesn’t have to be method granularity and • Can use more than two versions

INPUT: Two sets of method timings Statistical Analysis • Is A faster than B? • How confident are we? • Use standard statistical hypothesis testing (t-test) • If low confidence, collect more timing data Version A Timings Version B Timings Statistical Timing Analysis A is faster (or slower) than B by X% with Y% confidence OUTPUT:

Time to Converge • How long will it take to reach a confident conclusion? • Any speedup can be detected with enough timing data • Time to converge depends on: • Variance in timing data • Easy to detect speedup if method always does the same amount of work • Speedup due to optimization • Easy to detect big speedups • Fastest convergence for low variance methods with high speedup

Fixed Number of Samples • Why not just collect 100 samples? • Experiment: Try to detect an X% speedup with 100 samples • How often do the samples indicate a slowdown? • Each slowdown detected is a false positive • Samples do not accurately represent the population

Fixed Number of Samples

Fixed Number of Samples • Number of samples needed depends on speedup • More speedup → Fewer samples • Fixed sampling inefficient • Suppose we want to maintain 5% false positive rate • Could always collect 10k samples, but that wastes time • Statistical approach collects only as many samples as needed to reach confident conclusion

Prototype Implementation • Prototype online performance auditing system implemented in IBM’s J9 Java VM • Currently audits a single optimization • Experiment with aggressive inlining • Infrastructure is not tied to aggressive inlining. Can evaluate any single optimization • When a method reaches highest optimization level: • Compile two versions of the method (with and without aggressive inlining), collect timing data, run statistical analysis • If aggressive inlining generates quickly detectable speedup, use it, else fall back to default inlining • Timeout can occur when confident conclusion not reached in 5 seconds

Results

Per-Method Accuracy

Timeouts • Good news: Few incorrect decisions • Timeouts: Only collect one timing sample for each method invocation • Most methods are not invoked frequently enough to converge before timeout • Future work: Reduce timeouts by reducing convergence time • Collect multiple timings per invocation: use loop iteration times instead of invocation times

Future Work • Audit multiple optimizations and settings • Search the optimization space online, as program executes • Exponential search space is both challenge and opportunity • Apply prior work in offline optimization space search • Use Performance Auditor to tune optimization strategy for each method

Summary • Not easy to predict performance • Should I apply optimization X? • Online Performance Auditing • Measure code performance as the program executes • Detect slowdowns • Due to poor optimization decisions • Find speedups • Use high-risk optimizations without long-term slowdown • Enable online optimization space search

Online Performance Auditing Using Hot Optimizations Without Getting Burned