1 / 33

Friends Don’t Let Friends Tune Code

Friends Don’t Let Friends Tune Code. Jeffrey K. Hollingsworth University of Maryland hollings@cs.umd.edu. About the Title. Why Automate Performance Tuning?. Too many parameters that impact performance. Optimal performance for a given system depends on: Details of the processor

paniz
Download Presentation

Friends Don’t Let Friends Tune Code

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Friends Don’t Let Friends Tune Code Jeffrey K. Hollingsworth University of Maryland hollings@cs.umd.edu

  2. About the Title

  3. Why Automate Performance Tuning? • Too many parameters that impact performance. • Optimal performance for a given system depends on: • Details of the processor • Details of the inputs (workload) • Which nodes are assigned to the program • Other things running on the system • Parameters come from: • User code • Libraries • Compiler choices Automated Parameter tuning can be used for adaptive tuning in complex software.

  4. Automated Performance Tuning • Goal: Maximize achieved performance • Problems: • Large number of parameters to tune • Shape of objective function unknown • Multiple libraries and coupled applications • Analytical model may not be available • Shape of objective function could change during execution! • Requirements: • Runtime tuning for long running programs • Don’t run too many configurations • Don’t try really bad configurations

  5. I really mean don’t let people hand tune • Example, a dense matrix multiply kernel • Various Options: • Original program: 30.1 sec • Hand Tuned (by developer): 11.4 sec • Auto-tuned of hand-tuned: 15.9 sec • Auto-tuned original program: 8.5 sec • What Happened? • Hand tuning prevented analysis • Auto-tuned transformations were then not possible

  6. Active Harmony • Runtime performance optimization • Can also support training runs • Automatic library selection (code) • Monitor library performance • Switch library if necessary • Automatic performance tuning (parameter) • Monitor system performance • Adjust runtime parameters • Hooks for Compiler Frameworks • Integrated with USC/ISI Chill

  7. BasicClient API Overview • Only five core functions needed • Initialization/Teardown (4 functions) • Critical loop analysis (2 functions) • Available for C, C++, Fortran, and Chapel hdesc_t *harmony_init (appName, iomethod) intharmony_connect (hdesc, host, port) intharmony_bundle_file (hdesc, filename) harmony_bind_int(hdesc, varName, &ptr) intharmony_fetch (hdesc) intharmony_report (hdesc, performance)

  8. Harmony Bundle Definitions • External bundle definition file • Equivalent client API routines available [rchen@brood00 code_generation]$ cat bundle.cfg harmonyAppgemm { { harmonyBundle TI { int {2 500 2} global } } { harmonyBundle TJ { int {2 500 2} global } } { harmonyBundle TK { int {2 500 2} global } } { harmonyBundle UI { int {1 8 1} global } } { harmonyBundle UJ { int {1 8 1} global } } } Bundle Name Value Class Value Range Bundle Scope

  9. A Bit More About Harmony Search • Pre-execution • Sensitivity Discovery Phase • Used to Order not Eliminate search dimensions • Online Algorithm • Use Parallel Rank Order Search • Variation of Rank Oder Algorithm • Part of the class of Generating Set Algorithms • Different configurations on different nodes

  10. Parallel Rank Ordering Algorithm • All, but the best point of simplex moves. • Computations can be done in parallel.

  11. Performance for Two Programs High Performance Linpack POP Ocean Code Benefits of Searching in Parallel

  12. Concern: Performance Variability • Performance has two components: • Deterministic: Function of parameters • Stochastic: Function of parameters and high priority job throughput. • More of a problem in clusters

  13. Performance variability impact on optimization • We do not need the exact performance value. • We only need to compare different configurations. • Approach: Take multiple samples and use a simple operator. • Goal: Come up with an appropriate operator, which results in lower variance.

  14. Integrating Compiler Transformations 15

  15. Auto-tuning Compiler Transformations • Described as a series of Recipes • Recipes consist of a sequence of operations • permute([stmt],order): change the loop order • tile(stmt,loop,size,[outer-loop]): tile loop at level loop • unroll(stmt,loop,size): unroll stmt's loop at level loop • datacopy(stmt,loop,array,[index]): • Make a local copy of the data • split(stmt,loop,condition): split stmt's loop level loop into multiple loops • nonsingular(matrix)

  16. CHiLL+ Active Harmony Generate and evaluate different optimizations that would have been prohibitively time consuming for a programmer to explore manually. Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, Jeffrey K. Hollingsworth, “A Scalable Auto-tuning Framework for Compiler Optimization,” IPDPS 2009, Rome, May 2009.

  17. SMG2000 Optimization Outlined Code for (si = 0; si < stencil_size; si++) for (kk = 0; kk < hypre__mz; kk++) for (jj = 0; jj < hypre__my; jj++) for (ii = 0; ii < hypre__mx; ii++) rp[((ri+ii)+(jj*hypre__sy3))+(kk*hypre__sz3)] -= ((Ap_0[((ii+(jj*hypre__sy1))+ (kk*hypre__sz1))+ (((A->data_indices)[i])[si])])* (xp_0[((ii+(jj*hypre__sy2))+(kk*hypre__sz2))+(( *dxp_s)[si])])); CHiLL Transformation Recipe permute([2,3,1,4]) tile(0,4,TI) tile(0,3,TJ) tile(0,3,TK) unroll(0,6,US) unroll(0,7,UI) Constraints on Search 0 ≤ TI , TJ, TK ≤ 122 0 ≤ UI ≤ 16 0 ≤ US ≤ 10 compilers ∈ {gcc, icc} Search space: 1223x16x10x2 = 581M points

  18. SMG2000 Search and Results Parallel search evaluates 490 points and converges in 20 steps Selected parameters: TI=122,TJ=106,TK=56,UI=8,US=3,Comp=gcc Performance gain on residual computation: 2.37X Performance gain on full app: 27.23% improvement

  19. Compiling New Code Variants at Runtime PM1, PM2, … PMN Search Steps (SS) Harmony Timeline Active Harmony Outlined code-section Code Transformation Parameters SSN SS2 SS1 Code Server Code Generation Tools compiler compiler compiler READY Signal Application Execution timeline Application stall_phase Performance Measurements (PM) PM2 PMN PM1 v2s v1s.so vNs.so v1s vNs v2s.so

  20. Online Code Generation Results Three platforms umd-cluster (64 nodes, Intel Xeon dual-core nodes) – myrinet interconnect Carver (1120 compute nodes, Intel Nehalem. two quad core processors) – infiniband interconnect Hopper – (5,312 cores – two quad core processors, Cray XT5) – seaStar interconnect Code servers UMD-cluster – local idle machines Carver & Hopper – outsourced to a machine at umd Codes PES - Poisson Solver (from Kelp distribution) PMLB - Parallel Multi-block Lattice Boltzman 21

  21. Runtime Code Generation Results • All cases used 8 code servers • Net is spedupfactors in overhead of code generation cores • Post-harmony is a second run using best config found in first • X-axis is problem size PES - 128 cores, UMD cluster PMLB – 512 cores, Carver

  22. Machine Specific Optimization • Optimize for one machine, then run on others • Results on speedups compared to base version • Program is PES, all runs were 64 cores

  23. Auto-Tuning for Communication • Optimize overlap • Hide communication behind as much computation as possible • Tune how often to call poll routines • Messages don’t move unless call into MPI library • Calling too often results in slowdown • Portability • No support from special hardware for overlap • Optimize local computation • Improve cache performance with loop tiling • Parameterize & Auto-Tuning • Cope with the complex trade-off regarding the optimization techniques

  24. Parallel 3-D FFT Performance • NEW: our method with overlap • TH: another method with overlap • FFTW: a popular library without overlap our NEW method 1.76x over FFTW 20483complexnumbers 256 cores on Hopper TH approach not so fast

  25. Generalized Auto-Tuning Evaluated Performance Active Harmony Client Application REPORT 3 2 1 Search Strategy 2 3 1 FETCH Candidate Points

  26. Onion Model Workflow • Allows for paired functionality, but either hook is optional. • Fetch hooks executed in ascending order. • Report hooks executed in descending order. • Point values cannot be modified (directly). REPORT Search Strategy 3 2 1 FETCH

  27. Plug-in Example: Point Logger • logger_init() • { • outfd = open(“performance.log”); • } • logger_report(flow_t *flow, point_t *pt, doubleperf) • { • fprintf(outfd, “%s -> %lf\n”, string(pt), perf); • flow->type = ACCEPT; • } • logger_fini() • { • close(outfd); • }

  28. Plug-in Example: Constraints • Support for non-rectangular parameter spaces. • Implemented as plug-in #2 using REJECT workflow. • Uses omega library Search Strategy 3 2 1

  29. Easier Auto-tuning: Lower Barriers to Use • Problem: • Hard to get codes ready for auto-tuning • Tools can be complex • Tuna: auto-tuning shell • Hides details of auto-tuning • Use: • Program print performance as final output line • Auto-tuner invoked programs via command line arguments • Program can be shell script • including compilation • config file generation

  30. Easier Auto-tuning: Tuna > ./tuna -i=tile,1,10,1 -i=unroll,2,12,2 -n=25 matrix_mult -t % -u % Parameter 1 Low: 1 High: 10 Step: 1 Parameter 2 Low: 2 High: 12 Step: 2 Command line to run % - replaced by paramters

  31. Easier Auto-tuning: Language Help • Problem: • How to express tunable options • Chapel • Already supports Configuration Variables • Sort of like a constant • Defined in program • Given a value by programmer • But • User can change value at program launch • Proposed Extensions to Language • Add range to Configuration Variable • Include Range in –help output • Include machine readable version of –help

  32. Tuna + Chapel > ./quicksort --help | tail -5 quicksort configvars: n: int(64) thresh: int(64) verbose: int(64) timing: bool Auto-tuning parameters Threads, Tasks: defaults of Chapel runtime Thresh: Configuration variable from program

  33. Conclusions and Future Work • Ongoing Work • More end-to-end Application Studies • Continued Evaluation of Online Code Generation • Communication Optimization • Re-using data from prior runs • Conclusions • Auto tuning can be done at many levels • Offline – using training runs (choices fixed for an entire run) • Compiler options • Programmer supplied per-run tunable parameters • Compiler Transformations • Online –training or production (choices change during execution) • Programmer supplied per-timestep parameters • Compiler Transformations • It Works! • Real programs run faster

More Related