1 / 20

Online Adaptive Code Generation and Tuning

Online Adaptive Code Generation and Tuning. Jeffrey K. Hollingsworth and Ananta Tiwari. Why Automated Performance Tuning?. Software parameters impacts its performance. Optimal parameter values are variable and un-predictable. Parameters come from: User code Libraries Compiler choices

Download Presentation

Online Adaptive Code Generation and Tuning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Online Adaptive Code Generation and Tuning Jeffrey K. Hollingsworth and Ananta Tiwari

  2. Why Automated Performance Tuning? • Software parameters impacts its performance. • Optimal parameter values are variable and un-predictable. • Parameters come from: • User code • Libraries • Compiler choices • For long programs at scale, • Performance of the machine will change during one “execution” Automated Parameter tuning can be used for adaptive tuning in complex software.

  3. The framework – Active Harmony • Search based feedback driven empirical optimization • Provides a mechanism for applications to become adaptable by exporting tuning options • Monitors the program performance and suggests adaptation decisions • Decisions made by a central controller

  4. Online tuning challenges & requirements • Minimal overhead • Avoid “bad” regions • Precise specification of search space – Constraint Specification Language (CSL) • Express relationships between tunable parameters

  5. Parallel Rank Ordering Algorithm • All, but the best point of simplex moves • Computations can be done in parallel • N parallel evaluations for N+1 point simplex

  6. Parameter tuning algorithm • Initial simplex construction • User-guided (via CSL construct) • Exploratory iterations • Constraining to allowable regions – penalization technique • Add penalty factor to the configurations that violate constraints

  7. System design overview • Run-time management of: • Cost of search • Generating and compiling a set of code-variants • Multiple code-sections to tune • Code-generation utility • Allow selection of external code generation tools • Overhead reduction • Non-blocking relationship between code-generation and application execution

  8. System design PM1, PM2, … PMN Search Steps (SS) Harmony Timeline Active Harmony Outlined code-section Code Transformation Parameters SSN SS2 SS1 Code Server Code Generation Tools v2s v1s.so vNs v1s vNs.so v2s.so compiler compiler compiler READY Signal Application Execution timeline Application stall_phase Performance Measurements (PM) PM2 PMN PM1

  9. Code generation utility – Utah’s CHiLL • Polyhedral representation of loop-nests • Built Upon Omega Library Plus • CHiLL features • Provides a rich set of loop transformations • High-Level script interface to allow programmers and compilers to describe a set of code transformations • Recipe Library • Recipes can be evolved and reused over time

  10. Experimental Results • Two platforms • umd-cluster (64 nodes, Intel Xeon dual-core nodes) – myrinet interconnect • Carver (1120 compute nodes, Intel Nehalem. two quad core processors) – infiniband interconnect • Code servers • umd-cluster – local idle machines • Carver – outsourced to a machine at umd • Three programs • SMG2000 • Poisson’s equation solver (PES) • PMLB

  11. SMG2000 benchmark • Semi-coarsening multigrid on structured grids • Residual computation contains sparse matrix-vector multiply bottleneck, expressed in 4-deep loop nest • Key computation identified by HPCToolkit and outlined by ROSE Compiler for si = 0 to NS-1 for k = 0 to NZ-1 for j = 0 to NY-1 for i = 0 to NX-1 r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]] 46% of execution time Part of PERI auto-tuning effort.

  12. SMG2000 results • Online tuning for smg_residual function • Tiling and unrolling • Up to 1.4x speedup within a single run

  13. Poisson’s equation solver (PES) • Uses redblack successive over-relaxation method • Optimization method • Relaxation function: 7-point stencil • Triply nested loop – tiling the outermost two loops • Error function: sweeps through local grid to calculate L2-norm • Triply nested loop – tiling all loops and unrolling the innermost

  14. Parallel Multi-block Lattice Boltzmann • Lattice Boltzmann method • Widely used method to study fluid dynamic systems • Six kernels • Initialization, collision, communication, streaming and physical • Streaming kernel accounts for more than 75% of computation time • Consists of five triply nested loops • Loop-nest outlining • Lots of memory copy operations

  15. PMLB Optimization • Two phases of tuning • Loop fusion • First few iterations of the application evaluate different possibilities • Tiling and unrolling • Comparison to original untuned version (compiled with –O3) to • Harmonized application • Post-harmony runs

  16. Code server sensitivity • How many parallel code-servers are needed to ensure that the stall_phase is not dominant? • Control parameters: problem-size (10243), number of processors (128) • Up to 128 new variants are generated at each search step • A typical setup for remaining experiments presented in this talk • Future plans for more robust study

  17. Search evolution

  18. PMLB (Carver Results) Harmonized application runs, on average, 1.14 times faster than the original. Best speedup for all PMLB runs on Carver: 1.48. Post-harmony runs are, on average, 1.37 times faster.

  19. Cross-platform studies • Study how parameters differ for the two systems • Take harmony suggested parameters from one system and run a post-harmony run on another

  20. Future work • Exploit the spatial locality of PRO • A-priori generation of code-variants • Retire “unreachable” code variants • Reachability measured in terms of the # of search steps • Run experiments on larger processor counts • Hopper (Cray XT5 machine at NERSC) • Auto-tuning for larger applications • PFLOTRAN (collaboration with PERI folks) r=2

More Related