Solving the Convolution Problem in Performance Modeling

Allan Snavely, Laura Carrington, Mustafa Tikir PMaC Lab Roy Campbell ARL Tzu-Yi Chen, Pomona College Solving the Convolution Problem in Performance Modeling

Some Questions • Do supercomputers double in speed every 18 months? • How can one meaningfully rank supercomputers? • How can one reasonably procure supercomputers? • How can one design supercomputers to run real applications faster? • How well can simple benchmarks represent the performance of real applications?

The Convolution Hypothesis • The performance of HPC applications on supercomputers can be explained by some combination of low-level benchmarks combined with knowledge about the applications • Note that a hypothesis is something that can be tested and could be true or false!

Execution time=operation1‘+’operation2‘+’operation3 rate op1 rate op2 rate op3 A Framework for Performance Modeling Machine Profile: Rate at which a machine can perform different operations collecting: rate op1, op2, op3 Application Signature: Operations needed to be carried out by the application collecting: number of op1, op2, and op3 Convolution: Mapping of a machines performance (rates) to applications needed operations where ‘+’ operator could be + or MAX depending on operation overlap

Stride-one access L1 cache Stride-one access L2/L3 cache Stride-one access L1/L2 cache Stride-one access L3 cache/Main Memory Example: Convolving MEMbench data withMetasim Tracer Data • Memory bandwidth benchmark measures memory rates (MB/s) for different levels of cache and tracing reveals different access patterns

p å = p a m ij ik kj = 1 k Formally, let P(m,n)=A(m,p) • M(p,n) • P a matrix of runtimes: • Where the is (rows) are applications and the js (columns) are machines; an entry of P is the (real or predicted) runtime of application i on machine j • the rows of A are applications, columns are operation counts, read a row to get an application signature • the rows of M are benchmark-measured bandwidths, columns are machines, read a columns to get a machine profile

Investigating the Convolution Method • We have a multi-pronged investigation to find the limits of accuracy of this approach • How accurately can P be measured directly? • “Real” runtimes can vary 10% or more • Symbiotic job scheduling (Jon Weinberg) • How well can P be computed empirically as A • M? • We use Linear Optimization (Roy) approach as well as a Least Squares Fit (Yi Chen) • How well can P be computed ab initio from trace data and judicious convolving? (Laura Carrington) • Can the ab initio approach guide the empirical (vice-versa)?

How well can P be computed empirically as A • M? • De-convolution gives A=P/M • The big picture: • we are trying to discover if any linear combination of simple benchmarks can represent an HPC application from a performance standpoint • If YES, difficult full app benchmarking can be replaced by easy simple benchmarks and low-level performance charcteristics of machines can be related to expected application performance

Solving for A using Least Squares • Consider solving the matrix equality P = M A for A • We can solve for each column of A individually (i.e. Pi = M Ai) given the assumption ops counts of an application do not depend on other applications • We compute op counts that minimize the 2-norm of the residual of Pi – M Ai • nonneglsqin Matlab

Solving for A using LP

CPU# N CPU# 1 CPU# 2 Running Application Running Application Running Application Ab Initio: MetaSim Tracer-Memory & FP Trace with “processing” to get hit rates on PREDICTED MACHINE Address stream collecting trace Processing trace Expected cache hit rates for application on the user specified memory structure PREDICTED MACHINE Cache Simulator Entire address stream is processed through cache simulator. User Specified Memory Structure (Power 4, Power 3, Alpha, Itanium) PREDICTED MACHINE Final product is a table of average hit rates for each basic-block of the entire application. Parallel machine Less processing than cycle-accurate simulator-saves time and still accurate enough for predictions From sample application signature: BB#202: 2.0E9, load, 99%, 100%, stride-one BB#202: 1.9E3, FP BB#303: 2.2E10, load, 52%, 63%, random BB#303: 1.1E2, FP

Convolver rules trained by a human expert • Pick 40 loops from nature • Measure performance on 5 machines • Tune rules to predict performance accurately • Predict 10,000 lops from 5 apps, 2 inputs, 3 cpu counts, 9 machines (270 predictions)

Rules training

Scientific applications used in this study • AVUSwas developed by the Air Force Research Laboratory (AFRL) to determine the fluid flow and turbulence of projectiles and air vehicles. Its standard test case calculates 400 time-steps of fluid flow and turbulence for a wing, flap, and end plates using 7 million cells. Its large test case calculates 150 time-steps of fluid flow and turbulence for an unmanned aerial vehicle using 24 million cells. • The Naval Research Laboratory (NRL), Los Alamos National Laboratory (LANL), and the University of Miami developedHYCOMas an upgrade to MICOM (both well-known ocean modeling codes) by enhancing the vertical layer definitions within the model to better capture the underlying science. HYCOM's standard test case models all of the world's oceans as one global body of water at a resolution of one-fourth of a degree when measured at the Equator. • OVERFLOW-2was developed by NASA Langley and NASA Ames to solve CFD equations on a set of overlapping, adaptive grids, such that the grid resolution near an obstacle is higher than that of other portions of the scene. This approach allows computation of both laminar and turbulent fluid flows over geometrically complex, non-stationary boundaries. The standard test case of OVERFLOW-2 models fluid flowing over five spheres of equal radius and calculates 600 time-steps using 30 million grid points.

More scientific apps used in this study • Sandia National Laboratories (SNL) developedCTHto model complex multidimensional, multiple-material scenarios involving large deformations or strong shock physics. RFCTH is a non-export-controlled version of CTH. The standard test case of RFCTH models a ten-material rod impacting an eight-material plate at an oblique angle, using adaptive mesh refinement with five levels of enhancement. • The WRF model is being developed as a collaborative effort among the NCAR Mesoscale and Microscale Meteorology Division (MMM), NCEP’s Environmental Modeling Center (EMC), FSL’s Forecast Research Division (FRD), the DoD Air Force Weather Agency (AFWA), the Center for the Analysis and Prediction of Storms (CAPS) at the University of Oklahoma, and the Federal Aviation Administration (FAA), along with the participation of a number of university scientists. Primary funding for MMM participation in WRF is provided by the NSF/USWRP, AFWA, FAA and the DoD High Performance Modernization Office. With this model, we seek to improve the forecast accuracy of significant weather features across scales ranging from cloud to synoptic, with priority emphasis on horizontal grids of 1-10 kilometers.

Most recent results LS & LP methods

Most Recent Results: ab initio

How can one reasonably procure supercomputers? • Assistance to DoD HPCMO, SDSC Petascale, DOE NERSC procurements • Form performance models of strategic applications, verify against existing HPC assets, use to predict performance of proposed systems • Of course performance is just one criterian (price, power, size, colling, reliability, diversity etc.)

Different machines are better at different things and the space is complicated

What one needs is performance sensitivities of applications - how much faster my app for:

Performance Sensitivities

Pieces of Performance Prediction Frameworkeach model consists of: • Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application. • Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine. Combine Machine Profile and Application Signature using: • Convolution Methods - algebraic mappings of the Application Signatures on to the Machine profiles to arrive at a performance prediction.

Exe. time = Memory opFP op Exe. time = comm. op1comm. op2 Mem. rate FP rate op1 rate op2 rate Pieces of Performance Prediction Framework Parallel Processor Prediction Single-Processor Model Communication Model Machine Profile (Machine A) Characterization of memory performance capabilities of Machine A Application Signature (Application B) Characterization of memory operations needed to be performed by Application B Machine Profile (Machine A) Characterization of network performance capabilities of Machine A Application Signature (Application B) Characterization of network operations needed to be performed by Application B Convolution Method Mapping memory usage needs of Application B to the capabilities of Machine A Application B  Machine A Convolution Method Mapping network usage needs of Application B to the capabilities of Machine A Application B  Machine A Performance prediction of Application B on Machine A

Machine Profile – Single processor modelcollecting rates for Memory operations and FP operations • Tables of a machine’s performance/rates for different operations collected via benchmarks. Sample: MAPS data Currently set to theoretical peak

Application Signature* - Single processor modelcollecting type and number of Memory and FP operations then simulating in cache simulator • Trace of operations on the processor performed by an application (memory and FP ops on processor). Sample: Cache hit rates for the PREDICTED MACHINE for each basic-block of the application. This additional information requires “processing” by the MetaSim tracer not just straight memory tracing, hence the combination of the application signature and convolution components BB#202: 2.0E9, load, 99%, 100%, stride-one BB#202: 1.9E3, FP BB#303: 2.2E10, load, 52%, 63%, random BB#303: 1.1E2, FP Where the format is as follows: Basic-block #: # memory ref., type, hit rates, access stride • Trace of application is collected • and processed by the • MetaSim Tracer.

How can one meaningfully rank supercomputers? • Thresholded Inversions is a metric for evaluating rankings • Basically, when a machine higher on the list runs an application slower than a machine lower on the list, that is an inversion • We showed the Top500 list is rife with such inversions, 78% suboptimal compared to… • A best list obtainable by brute force for any set of applications • We used the framework to approach the quality of the best list by combining these simple HPC Challenge Benchmarks Random Access, STREAM, and HPL as guided by application traces

Solving the Convolution Problem in Performance Modeling