The politics and economics of parallel computing performance
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

The Politics and Economics of Parallel Computing Performance PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

The Politics and Economics of Parallel Computing Performance. Allan Snavely UCSD Computer Science Dept. & SDSC. Computnik. Not many of us (not even me) are old enough to remember Sputnik But recently U.S. technology received a similar shock. Japanese Earth Simulator.

Download Presentation

The Politics and Economics of Parallel Computing Performance

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The politics and economics of parallel computing performance

The Politics and Economics of Parallel Computing Performance

Allan Snavely

UCSD Computer Science Dept.

&

SDSC


Computnik

Computnik

  • Not many of us (not even me) are old enough to remember Sputnik

  • But recently U.S. technology received a similar shock


Japanese earth simulator

Japanese Earth Simulator

  • The world’s mot powerful computer


Top500 org

Top500.org

  • HIGHLIGHTS FROM THE TOP 10

    • The Earth Simulator, built by NEC, remains the unchallenged #1, > 30 TFlops

    • The cost is conservatively $500M


The politics and economics of parallel computing performance

  • ASCI Q at Los Alamos is at #2 at 13.88 TFlop/s.

    • The third system ever to exceed the 10 TFflop/s mark is Virgina Tech's X measured at 10.28 TFlop/s. This cluster is built with the Apple G5 as building blocks and is often referred to as the 'SuperMac‘.

    • The fourth system is also a cluster. The Tungsten cluster at NCSA is a Dell PowerEdge-based system using a Myrinet interconnect. It just missed the 10 TFlop/s mark with a measured 9.82 TFlop/s.


More top 500

More top 500

  • The list of clusters in the TOP10 continues with the upgraded Itanium2-based Hewlett-Packard system, located at DOE's Pacific Northwest National Laboratory, which uses a Quadrics interconnect.

  • #6 is the first system in the TOP500 based on AMD's Opteron chip. It was installed by Linux Networx at the Los Alamos National Laboratory and also uses a Myrinet interconnect. T

  • With the exception of the leading Earth Simulator, all other TOP10 systems are installed in the U.S.

  • The performance of the #10 system jumped to 6.6 TFlop/s.


The fine print

The fine print

  • But how is performance measured?

  • Linpack is very compute intensive and not very memory or communications inten sive and it scales perfectly!


Axiom you get what you ask for or what you measure for

Measures of goodness:

Macho image

Big gas tank

Cargo space

Drive it offroad

Arnold drives one

Measures of goodness:

Trendy Euro image

Fuel efficiency

Parking space

Drive it on narrow streets

Herr Schroeder drives one

Axiom: You get what you ask for(or what you measure for)


Hpc users forum and metrics

HPC Users Forum and metrics

  • From the beginning we dealt with

    • Political issues

      • You get what you ask for (Top500 Macho Flops)

      • Policy makers need a number (Macho Flops)

      • You measure what makes you look good (Macho Flops)

    • Technical issues

      • Recent reports (HECRTF, SCALES) echo our earlier consensus that time-to-solution (TTS) is the HPC metric

      • But TTS is complicated and problem dependent ( and policy makers need a number)

      • Is it even technically feasible to encompass TTS in one or a few low-level metrics?


A science of performance

Performance Modeling and Characterization Lab

San Diego Supercomputer Center

A science of performance

  • A model is a calculable explanation of why a {program, application,input,…} tuple performs as it does

  • Should yield a prediction (quantifiable objective)

    • Accurate predictions of observable performance points give you some confidence in methods (as for example to allay fears of perturbation via intrusion)

  • Performance models embody understanding of the factors that affect performance

    • Inform the tuning process (of application and machine)

    • Guide applications to the best machine

    • Enable applications driven architecture design

    • Extrapolate to the performance of future systems

PMaC


Goals for performance modeling tools and methods

Performance Modeling and Characterization Lab

San Diego Supercomputer Center

Goals for performance modeling tools and methods

  • Performance should map back to a small set of orthogonal benchmarks

  • Generation of performance models should be automated, or at least as regular and systemized as possible

  • Performance models must be time-tractable

  • Error is acceptable if it is bounded and allows meeting these objectives

  • Taking these principles to extremes would allow dynamic, automatic performance improvement via adaption (this is open research)

PMaC


A useful framework

Performance Modeling and Characterization Lab

San Diego Supercomputer Center

A useful framework

  • Machine Profiles - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application

  • Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine

    Combine Machine Profile and Application Signature using:

  • Convolution Methods - algebraic mappings of the Application Signatures on to the Machine profiles to arrive at a performance prediction

PMaC


Pmac hpc benchmark suite

PMaC HPC Benchmark Suite

  • The goal is develop means to infer execution time of full applications at scale from low-level metrics taken on (smaller) prototype systems

    • To do this in a systematic, even automated way

      • To be able to compare apples and oranges

      • To enable wide workload characterizations

    • To keep number of metrics compact

      • Add metrics only to increase resolution

        Go to web page www.sdsc.edu/PMaC


Machine profiles single processor component maps

Machine Profiles Single Processor Component – MAPS

  • Machine Profiles useful for :

    • revealing underlying capability of the machine

    • comparing machines

  • Machine Profiles produced by:

    MAPS (Memory Access Pattern Signature) along with the rest of the PMaC HPC Benchmark Suite is available at www.sdsc.edu/PMaC


Convolutions put the two together modeling deep memory hierarchies

Convolutions put the two togethermodeling deep memory hierarchies

MetaSim trace collected on PETSc Matrix-Vector code 4 CPUs with user supplied memory parameters for PSC’s TCSini

  • Single-processor or per-processor performance:

  • Machine profile for processor (Machine A)

  • Application Signature for application (App. #1)

  • The relative “per-processor” performance of

  • App. #1 on Machine A is represented as the

  • MetaSim Number=


Metasim cpu events convolver pick simple models to apply to each basic block

Metasim: cpu events convolverpick simple models to apply to each basic block

Output:

5 different

convolutions.

Meta1:

Mem. time

Meta2:

Mem. time+FP time

Meta3:

MAX(mem.time,FP time)

Meta4:

.5Mem. time+.5FP time

Meta5:

.9Mem. time+.1FP time


The politics and economics of parallel computing performance

Dimemas: communications events convolver

Simple communication models applied to each communication event


Pop results graphically

Performance Modeling and Characterization Lab

San Diego Supercomputer Center

POP results graphically

  • Seconds per simulation day

PMaC


Quality of model predictions for pop

Performance Modeling and Characterization Lab

San Diego Supercomputer Center

Quality of model predictions for POP

PMaC


Explaining relative performance of pop

Explaining Relative Performance ofPOP


Pop performance sensitivity

POP Performance Sensitivity

Processor Performance

1/Execution Time

Latency Performance Normalized

BW Performance Normalized


Practical uses

Practical uses

  • DoD HPCMO procurement cycle

    • Identify strategic applications

    • Identify candidate machines

    • Run PMaC HPC Benchmark Probes on (prototypes of) machines

    • Use tools to model applications on exemplary inputs

    • Generate performance expectations

    • Input to solver that factors in performance, cost, architectural diversity, whim of program director 

  • DARPA HPCS program

    • Help vendors evaluate performance impacts of proposed architectural features


Acknowledgments

Acknowledgments

  • This work was sponsored in part by the Department of Energy Office of Science through SciDAC award “High-End Computer System Performance: Science and Engineering”. This work was sponsored in part by the Department of Defense High Performance Computing Modernization Program office through award “HPC Applications Benchmarking”. This research was sponsored in part by DARPA through award “HEC Metrics”. This research was supported in part by NSF cooperative agreement ACI-9619020 through computing resources provided by the National Partnership for Advanced Computational Infrastructure at the San Diego Supercomputer Center. Computer time was provided by the Pittsburgh Supercomputer Center and the Texas Advanced Computing Center and Oak Ridge National laboratory and ERDC. We would like to thank Francesc Escale of CEPBA for all his help with Dimemas, and Pat Worley for all his help with POP.


  • Login