330 likes | 344 Views
Friends Don’t Let Friends Tune Code. Jeffrey K. Hollingsworth University of Maryland hollings@cs.umd.edu. About the Title. Why Automate Performance Tuning?. Too many parameters that impact performance. Optimal performance for a given system depends on: Details of the processor
E N D
Friends Don’t Let Friends Tune Code Jeffrey K. Hollingsworth University of Maryland hollings@cs.umd.edu
Why Automate Performance Tuning? • Too many parameters that impact performance. • Optimal performance for a given system depends on: • Details of the processor • Details of the inputs (workload) • Which nodes are assigned to the program • Other things running on the system • Parameters come from: • User code • Libraries • Compiler choices Automated Parameter tuning can be used for adaptive tuning in complex software.
Automated Performance Tuning • Goal: Maximize achieved performance • Problems: • Large number of parameters to tune • Shape of objective function unknown • Multiple libraries and coupled applications • Analytical model may not be available • Shape of objective function could change during execution! • Requirements: • Runtime tuning for long running programs • Don’t run too many configurations • Don’t try really bad configurations
I really mean don’t let people hand tune • Example, a dense matrix multiply kernel • Various Options: • Original program: 30.1 sec • Hand Tuned (by developer): 11.4 sec • Auto-tuned of hand-tuned: 15.9 sec • Auto-tuned original program: 8.5 sec • What Happened? • Hand tuning prevented analysis • Auto-tuned transformations were then not possible
Active Harmony • Runtime performance optimization • Can also support training runs • Automatic library selection (code) • Monitor library performance • Switch library if necessary • Automatic performance tuning (parameter) • Monitor system performance • Adjust runtime parameters • Hooks for Compiler Frameworks • Integrated with USC/ISI Chill
BasicClient API Overview • Only five core functions needed • Initialization/Teardown (4 functions) • Critical loop analysis (2 functions) • Available for C, C++, Fortran, and Chapel hdesc_t *harmony_init (appName, iomethod) intharmony_connect (hdesc, host, port) intharmony_bundle_file (hdesc, filename) harmony_bind_int(hdesc, varName, &ptr) intharmony_fetch (hdesc) intharmony_report (hdesc, performance)
Harmony Bundle Definitions • External bundle definition file • Equivalent client API routines available [rchen@brood00 code_generation]$ cat bundle.cfg harmonyAppgemm { { harmonyBundle TI { int {2 500 2} global } } { harmonyBundle TJ { int {2 500 2} global } } { harmonyBundle TK { int {2 500 2} global } } { harmonyBundle UI { int {1 8 1} global } } { harmonyBundle UJ { int {1 8 1} global } } } Bundle Name Value Class Value Range Bundle Scope
A Bit More About Harmony Search • Pre-execution • Sensitivity Discovery Phase • Used to Order not Eliminate search dimensions • Online Algorithm • Use Parallel Rank Order Search • Variation of Rank Oder Algorithm • Part of the class of Generating Set Algorithms • Different configurations on different nodes
Parallel Rank Ordering Algorithm • All, but the best point of simplex moves. • Computations can be done in parallel.
Performance for Two Programs High Performance Linpack POP Ocean Code Benefits of Searching in Parallel
Concern: Performance Variability • Performance has two components: • Deterministic: Function of parameters • Stochastic: Function of parameters and high priority job throughput. • More of a problem in clusters
Performance variability impact on optimization • We do not need the exact performance value. • We only need to compare different configurations. • Approach: Take multiple samples and use a simple operator. • Goal: Come up with an appropriate operator, which results in lower variance.
Auto-tuning Compiler Transformations • Described as a series of Recipes • Recipes consist of a sequence of operations • permute([stmt],order): change the loop order • tile(stmt,loop,size,[outer-loop]): tile loop at level loop • unroll(stmt,loop,size): unroll stmt's loop at level loop • datacopy(stmt,loop,array,[index]): • Make a local copy of the data • split(stmt,loop,condition): split stmt's loop level loop into multiple loops • nonsingular(matrix)
CHiLL+ Active Harmony Generate and evaluate different optimizations that would have been prohibitively time consuming for a programmer to explore manually. Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, Jeffrey K. Hollingsworth, “A Scalable Auto-tuning Framework for Compiler Optimization,” IPDPS 2009, Rome, May 2009.
SMG2000 Optimization Outlined Code for (si = 0; si < stencil_size; si++) for (kk = 0; kk < hypre__mz; kk++) for (jj = 0; jj < hypre__my; jj++) for (ii = 0; ii < hypre__mx; ii++) rp[((ri+ii)+(jj*hypre__sy3))+(kk*hypre__sz3)] -= ((Ap_0[((ii+(jj*hypre__sy1))+ (kk*hypre__sz1))+ (((A->data_indices)[i])[si])])* (xp_0[((ii+(jj*hypre__sy2))+(kk*hypre__sz2))+(( *dxp_s)[si])])); CHiLL Transformation Recipe permute([2,3,1,4]) tile(0,4,TI) tile(0,3,TJ) tile(0,3,TK) unroll(0,6,US) unroll(0,7,UI) Constraints on Search 0 ≤ TI , TJ, TK ≤ 122 0 ≤ UI ≤ 16 0 ≤ US ≤ 10 compilers ∈ {gcc, icc} Search space: 1223x16x10x2 = 581M points
SMG2000 Search and Results Parallel search evaluates 490 points and converges in 20 steps Selected parameters: TI=122,TJ=106,TK=56,UI=8,US=3,Comp=gcc Performance gain on residual computation: 2.37X Performance gain on full app: 27.23% improvement
Compiling New Code Variants at Runtime PM1, PM2, … PMN Search Steps (SS) Harmony Timeline Active Harmony Outlined code-section Code Transformation Parameters SSN SS2 SS1 Code Server Code Generation Tools compiler compiler compiler READY Signal Application Execution timeline Application stall_phase Performance Measurements (PM) PM2 PMN PM1 v2s v1s.so vNs.so v1s vNs v2s.so
Online Code Generation Results Three platforms umd-cluster (64 nodes, Intel Xeon dual-core nodes) – myrinet interconnect Carver (1120 compute nodes, Intel Nehalem. two quad core processors) – infiniband interconnect Hopper – (5,312 cores – two quad core processors, Cray XT5) – seaStar interconnect Code servers UMD-cluster – local idle machines Carver & Hopper – outsourced to a machine at umd Codes PES - Poisson Solver (from Kelp distribution) PMLB - Parallel Multi-block Lattice Boltzman 21
Runtime Code Generation Results • All cases used 8 code servers • Net is spedupfactors in overhead of code generation cores • Post-harmony is a second run using best config found in first • X-axis is problem size PES - 128 cores, UMD cluster PMLB – 512 cores, Carver
Machine Specific Optimization • Optimize for one machine, then run on others • Results on speedups compared to base version • Program is PES, all runs were 64 cores
Auto-Tuning for Communication • Optimize overlap • Hide communication behind as much computation as possible • Tune how often to call poll routines • Messages don’t move unless call into MPI library • Calling too often results in slowdown • Portability • No support from special hardware for overlap • Optimize local computation • Improve cache performance with loop tiling • Parameterize & Auto-Tuning • Cope with the complex trade-off regarding the optimization techniques
Parallel 3-D FFT Performance • NEW: our method with overlap • TH: another method with overlap • FFTW: a popular library without overlap our NEW method 1.76x over FFTW 20483complexnumbers 256 cores on Hopper TH approach not so fast
Generalized Auto-Tuning Evaluated Performance Active Harmony Client Application REPORT 3 2 1 Search Strategy 2 3 1 FETCH Candidate Points
Onion Model Workflow • Allows for paired functionality, but either hook is optional. • Fetch hooks executed in ascending order. • Report hooks executed in descending order. • Point values cannot be modified (directly). REPORT Search Strategy 3 2 1 FETCH
Plug-in Example: Point Logger • logger_init() • { • outfd = open(“performance.log”); • } • logger_report(flow_t *flow, point_t *pt, doubleperf) • { • fprintf(outfd, “%s -> %lf\n”, string(pt), perf); • flow->type = ACCEPT; • } • logger_fini() • { • close(outfd); • }
Plug-in Example: Constraints • Support for non-rectangular parameter spaces. • Implemented as plug-in #2 using REJECT workflow. • Uses omega library Search Strategy 3 2 1
Easier Auto-tuning: Lower Barriers to Use • Problem: • Hard to get codes ready for auto-tuning • Tools can be complex • Tuna: auto-tuning shell • Hides details of auto-tuning • Use: • Program print performance as final output line • Auto-tuner invoked programs via command line arguments • Program can be shell script • including compilation • config file generation
Easier Auto-tuning: Tuna > ./tuna -i=tile,1,10,1 -i=unroll,2,12,2 -n=25 matrix_mult -t % -u % Parameter 1 Low: 1 High: 10 Step: 1 Parameter 2 Low: 2 High: 12 Step: 2 Command line to run % - replaced by paramters
Easier Auto-tuning: Language Help • Problem: • How to express tunable options • Chapel • Already supports Configuration Variables • Sort of like a constant • Defined in program • Given a value by programmer • But • User can change value at program launch • Proposed Extensions to Language • Add range to Configuration Variable • Include Range in –help output • Include machine readable version of –help
Tuna + Chapel > ./quicksort --help | tail -5 quicksort configvars: n: int(64) thresh: int(64) verbose: int(64) timing: bool Auto-tuning parameters Threads, Tasks: defaults of Chapel runtime Thresh: Configuration variable from program
Conclusions and Future Work • Ongoing Work • More end-to-end Application Studies • Continued Evaluation of Online Code Generation • Communication Optimization • Re-using data from prior runs • Conclusions • Auto tuning can be done at many levels • Offline – using training runs (choices fixed for an entire run) • Compiler options • Programmer supplied per-run tunable parameters • Compiler Transformations • Online –training or production (choices change during execution) • Programmer supplied per-timestep parameters • Compiler Transformations • It Works! • Real programs run faster