Loading in 2 Seconds...

Solving the Convolution Problem in Performance Modeling

Loading in 2 Seconds...

- 88 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Solving the Convolution Problem in Performance Modeling' - phyllis-chen

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Allan Snavely, Laura Carrington, Mustafa Tikir### Solving the Convolution Problem in Performance Modeling

PMaC Lab

Roy Campbell ARL

Tzu-Yi Chen, Pomona College

Some Questions

- Do supercomputers double in speed every 18 months?
- How can one meaningfully rank supercomputers?
- How can one reasonably procure supercomputers?
- How can one design supercomputers to run real applications faster?
- How well can simple benchmarks represent the performance of real applications?

The Convolution Hypothesis

- The performance of HPC applications on supercomputers can be explained by some combination of low-level benchmarks combined with knowledge about the applications
- Note that a hypothesis is something that can be tested and could be true or false!

Execution time=operation1‘+’operation2‘+’operation3

rate op1 rate op2 rate op3

A Framework for Performance ModelingMachine Profile:

Rate at which a machine can perform different operations collecting: rate op1, op2, op3

Application Signature: Operations needed to be carried out by the application collecting: number of op1, op2, and op3

Convolution: Mapping of a machines performance (rates) to applications needed operations

where ‘+’ operator could be + or MAX depending on operation overlap

L1 cache

Stride-one access

L2/L3 cache

Stride-one access

L1/L2 cache

Stride-one access

L3 cache/Main Memory

Example: Convolving MEMbench data withMetasim Tracer Data- Memory bandwidth benchmark measures memory rates (MB/s) for different levels of cache and tracing reveals different access patterns

å

=

p

a

m

ij

ik

kj

=

1

k

Formally, let P(m,n)=A(m,p) • M(p,n)- P a matrix of runtimes:

- Where the is (rows) are applications and the js (columns) are machines; an entry of P is the (real or predicted) runtime of application i on machine j
- the rows of A are applications, columns are operation counts, read a row to get an application signature
- the rows of M are benchmark-measured bandwidths, columns are machines, read a columns to get a machine profile

Investigating the Convolution Method

- We have a multi-pronged investigation to find the limits of accuracy of this approach
- How accurately can P be measured directly?
- “Real” runtimes can vary 10% or more
- Symbiotic job scheduling (Jon Weinberg)
- How well can P be computed empirically as A • M?
- We use Linear Optimization (Roy) approach as well as a Least Squares Fit (Yi Chen)
- How well can P be computed ab initio from trace data and judicious convolving? (Laura Carrington)
- Can the ab initio approach guide the empirical (vice-versa)?

How well can P be computed empirically as A • M?

- De-convolution gives A=P/M
- The big picture:
- we are trying to discover if any linear combination of simple benchmarks can represent an HPC application from a performance standpoint
- If YES, difficult full app benchmarking can be replaced by easy simple benchmarks and low-level performance charcteristics of machines can be related to expected application performance

Solving for A using Least Squares

- Consider solving the matrix equality P = M A for A
- We can solve for each column of A individually (i.e. Pi = M Ai) given the assumption ops counts of an application do not depend on other applications
- We compute op counts that minimize the 2-norm of the residual of Pi – M Ai
- nonneglsqin Matlab

CPU# 1

CPU# 2

Running Application

Running Application

Running Application

Ab Initio: MetaSim Tracer-Memory & FP Trace with “processing” to get hit rates on PREDICTED MACHINEAddress stream

collecting trace

Processing trace

Expected cache hit rates for

application on the user

specified memory structure

PREDICTED MACHINE

Cache

Simulator

Entire address stream is processed through cache simulator.

User Specified

Memory Structure

(Power 4, Power 3,

Alpha, Itanium)

PREDICTED

MACHINE

Final product is a table of average hit rates for each basic-block of the entire application.

Parallel machine

Less processing than cycle-accurate

simulator-saves time and still

accurate enough for predictions

From sample application signature:

BB#202: 2.0E9, load, 99%, 100%, stride-one

BB#202: 1.9E3, FP

BB#303: 2.2E10, load, 52%, 63%, random

BB#303: 1.1E2, FP

Convolver rules trained by a human expert

- Pick 40 loops from nature
- Measure performance on 5 machines
- Tune rules to predict performance accurately
- Predict 10,000 lops from 5 apps, 2 inputs, 3 cpu counts, 9 machines (270 predictions)

Scientific applications used in this study

- AVUSwas developed by the Air Force Research Laboratory (AFRL) to determine the fluid flow and turbulence of projectiles and air vehicles. Its standard test case calculates 400 time-steps of fluid flow and turbulence for a wing, flap, and end plates using 7 million cells. Its large test case calculates 150 time-steps of fluid flow and turbulence for an unmanned aerial vehicle using 24 million cells.
- The Naval Research Laboratory (NRL), Los Alamos National Laboratory (LANL), and the University of Miami developedHYCOMas an upgrade to MICOM (both well-known ocean modeling codes) by enhancing the vertical layer definitions within the model to better capture the underlying science. HYCOM's standard test case models all of the world's oceans as one global body of water at a resolution of one-fourth of a degree when measured at the Equator.
- OVERFLOW-2was developed by NASA Langley and NASA Ames to solve CFD equations on a set of overlapping, adaptive grids, such that the grid resolution near an obstacle is higher than that of other portions of the scene. This approach allows computation of both laminar and turbulent fluid flows over geometrically complex, non-stationary boundaries. The standard test case of OVERFLOW-2 models fluid flowing over five spheres of equal radius and calculates 600 time-steps using 30 million grid points.

More scientific apps used in this study

- Sandia National Laboratories (SNL) developedCTHto model complex multidimensional, multiple-material scenarios involving large deformations or strong shock physics. RFCTH is a non-export-controlled version of CTH. The standard test case of RFCTH models a ten-material rod impacting an eight-material plate at an oblique angle, using adaptive mesh refinement with five levels of enhancement.
- The WRF model is being developed as a collaborative effort among the NCAR Mesoscale and Microscale Meteorology Division (MMM), NCEP’s Environmental Modeling Center (EMC), FSL’s Forecast Research Division (FRD), the DoD Air Force Weather Agency (AFWA), the Center for the Analysis and Prediction of Storms (CAPS) at the University of Oklahoma, and the Federal Aviation Administration (FAA), along with the participation of a number of university scientists. Primary funding for MMM participation in WRF is provided by the NSF/USWRP, AFWA, FAA and the DoD High Performance Modernization Office. With this model, we seek to improve the forecast accuracy of significant weather features across scales ranging from cloud to synoptic, with priority emphasis on horizontal grids of 1-10 kilometers.

How can one reasonably procure supercomputers?

- Assistance to DoD HPCMO, SDSC Petascale, DOE NERSC procurements
- Form performance models of strategic applications, verify against existing HPC assets, use to predict performance of proposed systems
- Of course performance is just one criterian (price, power, size, colling, reliability, diversity etc.)

Different machines are better at different things and the space is complicated

What one needs is performance sensitivities of applications - how much faster my app for:

Pieces of Performance Prediction Frameworkeach model consists of:

- Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application.
- Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine.

Combine Machine Profile and Application Signature using:

- Convolution Methods - algebraic mappings of the Application Signatures on to the Machine profiles to arrive at a performance prediction.

Exe. time = comm. op1comm. op2

Mem. rate FP rate

op1 rate op2 rate

Pieces of Performance Prediction FrameworkParallel Processor Prediction

Single-Processor Model

Communication Model

Machine Profile (Machine A)

Characterization of memory performance capabilities of Machine A

Application Signature (Application B)

Characterization of memory operations needed to be performed by Application B

Machine Profile (Machine A)

Characterization of network performance capabilities of Machine A

Application Signature (Application B)

Characterization of network operations needed to be performed by Application B

Convolution Method

Mapping memory usage needs of Application B

to the capabilities of Machine A

Application B Machine A

Convolution Method

Mapping network usage needs of Application B

to the capabilities of Machine A

Application B Machine A

Performance prediction of

Application B on Machine A

Machine Profile – Single processor modelcollecting rates for Memory operations and FP operations

- Tables of a machine’s performance/rates for different operations collected via benchmarks. Sample:

MAPS

data

Currently set

to theoretical

peak

Application Signature* - Single processor modelcollecting type and number of Memory and FP operations then simulating in cache simulator

- Trace of operations on the processor performed by an application (memory and FP ops on processor). Sample:

Cache hit rates for the

PREDICTED MACHINE

for each basic-block of the application.

This additional information

requires “processing” by

the MetaSim tracer not just straight memory tracing, hence the combination of the application signature and convolution components

BB#202: 2.0E9, load, 99%, 100%, stride-one

BB#202: 1.9E3, FP

BB#303: 2.2E10, load, 52%, 63%, random

BB#303: 1.1E2, FP

Where the format is as follows:

Basic-block #: # memory ref., type, hit rates, access stride

- Trace of application is collected
- and processed by the
- MetaSim Tracer.

How can one meaningfully rank supercomputers?

- Thresholded Inversions is a metric for evaluating rankings
- Basically, when a machine higher on the list runs an application slower than a machine lower on the list, that is an inversion
- We showed the Top500 list is rife with such inversions, 78% suboptimal compared to…
- A best list obtainable by brute force for any set of applications
- We used the framework to approach the quality of the best list by combining these simple HPC Challenge Benchmarks Random Access, STREAM, and HPL as guided by application traces

Download Presentation

Connecting to Server..