Adaptive mpi intelligent runtime strategies and performance prediction via simulation
1 / 61

- PowerPoint PPT Presentation

  • Uploaded on

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign PPL Mission and Approach

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '' - albert

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Adaptive mpi intelligent runtime strategies and performance prediction via simulation l.jpg

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

Laxmikant Kale

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign


Ppl mission and approach l.jpg
PPL Mission and Approach performance prediction via simulation

  • To enhance Performance and Productivity in programming complex parallel applications

    • Performance: scalable to thousands of processors

    • Productivity: of human programmers

    • Complex: irregular structure, dynamic variations

  • Approach: Application Oriented yet CS centered research

    • Develop enabling technology, for a wide collection of apps.

    • Develop, use and test it in the context of real applications

  • How?

    • Develop novel Parallel programming techniques

    • Embody them into easy to use abstractions

    • So, application scientist can use advanced techniques with ease

    • Enabling technology: reused across many apps


Slide3 l.jpg

Develop abstractions in context of full-scale applications performance prediction via simulation

Protein Folding

Quantum Chemistry (QM/MM)

Molecular Dynamics

Computational Cosmology

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Crack Propagation

Dendritic Growth

Space-time meshes

Rocket Simulation

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE


Migratable objects aka processor virtualization l.jpg

Benefits performance prediction via simulation

  • Software engineering

    • Number of virtual processors can be independently controlled

    • Separate VPs for different modules

  • Message driven execution

    • Adaptive overlap of communication

    • Predictability :

      • Automatic out-of-core

    • Asynchronous reductions

  • Dynamic mapping

    • Heterogeneous clusters

      • Vacate, adjust to speed, share

    • Automatic checkpointing

    • Change set of processors used

    • Automatic dynamic load balancing

    • Communication optimization

System implementation

User View

Migratable Objects (aka Processor Virtualization)

Programmer: [Over] decomposition into virtual processors

Runtime:Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI


Outline l.jpg

Adaptive MPI performance prediction via simulation

Load Balancing

Fault tolerance


performance analysis

Performance prediction




Ampi mpi with virtualization l.jpg

MPI processes performance prediction via simulation

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Real Processors

AMPI: MPI with Virtualization

  • Each virtual process implemented as a user-level thread embedded in a Charm object


Making ampi work l.jpg
Making AMPI work performance prediction via simulation

  • Multiple user-level threads per processor:

    • Problems with global variable

    • Solution I:

      • Automatic: switch GOT pointer at context switch

      • Available on most machines

    • Solution 2: Manual: replace global variables

    • Solution 3: automatic: via compiler support (AMPIzer)

  • Migrating Stacks

    • Use isomalloc technique (Mehaut et al)

    • Use memory files and mmap()

  • Heap data:

    • Isomalloc heaps

    • OR user supplied pack/unpack functions for the heap data


Elf and global variables l.jpg
ELF and global variables performance prediction via simulation

  • The Executable and Linking Format (ELF)

    • Executable has a Global Offset Table containing global data

    • GOT pointer stored at %ebx register

    • Switch this pointer when switching between threads

    • Support on Linux, Solaris 2.x, and more

  • Integrated in Charm++/AMPI

    • Invoke by compile time option -swapglobal


Adaptive overlap and modules l.jpg
Adaptive overlap and modules performance prediction via simulation

SPMD and Message-Driven Modules

(From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995


Benefit of adaptive overlap l.jpg
Benefit of Adaptive Overlap performance prediction via simulation

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.


Comparison with native mpi l.jpg

Performance performance prediction via simulation

Slightly worse w/o optimization

Being improved


Small number of PE available

Special requirement by algorithm

Comparison with Native MPI

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.


Ampi extensions l.jpg
AMPI Extensions performance prediction via simulation

  • Automatic load balancing

  • Non-blocking collectives

  • Checkpoint/restart mechanism

  • Multi-module programming


Load balancing in ampi l.jpg
Load Balancing in AMPI performance prediction via simulation

  • Automatic load balancing: MPI_Migrate()

    • Collective call informing the load balancer that the thread is ready to be migrated, if needed.


Load balancing steps l.jpg
Load Balancing Steps performance prediction via simulation

Regular Timesteps

Detailed, aggressive Load Balancing : object migration

Instrumented Timesteps

Refinement Load Balancing


Slide15 l.jpg

Load Balancing performance prediction via simulation

Aggressive Load Balancing

Refinement Load Balancing

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.


Shrink expand l.jpg
Shrink/Expand performance prediction via simulation

  • Problem: Availability of computing platform may change

  • Fitting applications on the platform by object migration

Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600


Slide17 l.jpg

Radix Sort performance prediction via simulation

Optimized All-to-all “Surprise”

Completion time vs. computation overhead

76 bytes all-to-all on Lemieux

CPU is free during most of the time taken by a collective operation





Led to the development of Asynchronous Collectives now supported in AMPI

AAPC Completion Time(ms)












Message Size (bytes)




Asynchronous collectives l.jpg
Asynchronous Collectives performance prediction via simulation

  • Our implementation is asynchronous

    • Collective operation posted

    • Test/wait for its completion

    • Meanwhile useful computation can utilize CPU

      MPI_Ialltoall( … , &req);

      /* other computation */



Slide19 l.jpg

Fault Tolerance performance prediction via simulation


Motivation l.jpg
Motivation performance prediction via simulation

  • Applications need fast, low cost and scalable fault tolerance support

  • As machines grow in size

    • MTBF decreases

    • Applications have to tolerate faults

  • Our research

    • Disk based Checkpoint/Restart

    • In Memory Double Checkpointing/Restart

    • Sender based Message Logging

    • Proactive response to fault prediction

      • (impending fault response)


Checkpoint restart mechanism l.jpg
Checkpoint/Restart Mechanism performance prediction via simulation

  • Automatic Checkpointing for AMPI and Charm++

    • Migrate objects to disk!

    • Automatic fault detection and restart

    • Now available in distribution version of AMPI and Charm++

  • Blocking Co-ordinated Checkpoint

    • States of chares are checkpointed to disk

    • Collective call MPI_Checkpoint(DIRNAME)

  • The entire job is restarted

    • Virtualization allows restarting on different # of processors

    • Runtime option

      • > ./charmrun pgm +p4 +vp16 +restart DIRNAME

  • Simple but effective for common cases


In memory double checkpoint l.jpg
In-memory Double Checkpoint performance prediction via simulation

  • In-memory checkpoint

    • Faster than disk

  • Co-ordinated checkpoint

    • Simple MPI_MemCheckpoint(void)

    • User can decide what makes up useful state

  • Double checkpointing

    • Each object maintains 2 checkpoints:

    • Local physical processor

    • Remote “buddy” processor

  • For jobs with large memory

    • Use local disks!

32 processors with 1.5GB memory each


Restart l.jpg
Restart performance prediction via simulation

  • A “Dummy” process is created:

    • Need not have application data or checkpoint

    • Necessary for runtime

    • Starts recovery on all other Processors

  • Other processors:

    • Remove all chares

    • Restore checkpoints lost on the crashed PE

    • Restore chares from local checkpoints

  • Load balance after restart


Restart performance l.jpg
Restart Performance performance prediction via simulation

  • 10 crashes

  • 128 processors

  • Checkpoint every 10 time steps


Scalable fault tolerance l.jpg
Scalable Fault Tolerance performance prediction via simulation

  • How?

    • Sender-side message logging

    • Asynchronous Checkpoints on buddy processors

    • Latency tolerance mitigates costs

    • Restart can be speeded up by spreading out objects from failed processor

  • Current progress

    • Basic scheme idea implemented and tested in simple programs

    • General purpose implementation in progress

Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!

Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue”


Recovery performance l.jpg
Recovery Performance performance prediction via simulation

Execution Time with increasing number of faults on 8 processors

(Checkpoint period 30s)


Slide27 l.jpg

Projections: performance prediction via simulation

Performance visualization and analysis tool


An introduction to projections l.jpg
An Introduction to Projections performance prediction via simulation

  • Performance Analysis tool for Charm++ based applications.

  • Automatic trace instrumentation.

  • Post-mortem visualization.

  • Multiple advanced features that support:

    • Data volume control

    • Generation of additional user data.


Trace generation l.jpg
Trace Generation performance prediction via simulation

  • Automatic instrumentation by runtime system

  • Detailed

    • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.

  • Summary

    • Reduces the size of output files and memory overhead.

    • Produces a few lines of output data per processor.

    • This data is recorded in bins corresponding to intervals of size 1 ms by default.

  • Flexible

    • APIs and runtime options for instrumenting user events and data generation control.


The summary view l.jpg
The Summary View performance prediction via simulation

  • Provides a view of the overall utilization of the application.

  • Very quick to load.


Graph view l.jpg
Graph View performance prediction via simulation

  • Features:

    • Selectively view Entry points.

    • Convenient means to switch to between axes data types.


Timeline l.jpg
Timeline performance prediction via simulation

  • The most detailed view in Projections.

  • Useful for understanding critical path issues or unusual entry point behaviors at specific times.


Animations l.jpg
Animations performance prediction via simulation


Slide34 l.jpg

ICL@UTK performance prediction via simulation

Slide35 l.jpg

ICL@UTK performance prediction via simulation

Slide36 l.jpg

ICL@UTK performance prediction via simulation

Time profile l.jpg
Time Profile performance prediction via simulation

  • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool.

Solved by prioritizing entry methods


Slide38 l.jpg

ICL@UTK performance prediction via simulation

Slide39 l.jpg

ICL@UTK performance prediction via simulation

Slide40 l.jpg

Overview: performance prediction via simulation one line for each processor, time on X-axis

White: busy,


Red: intermediate


Slide41 l.jpg

A boring but good-performance overview performance prediction via simulation


Slide42 l.jpg

An interesting but pathetic overview performance prediction via simulation


Stretch removal l.jpg
Stretch Removal performance prediction via simulation

Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis

After Optimizations

About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

Before Optimizations

Over 16 large stretched calls


Miscellaneous features color selection l.jpg
Miscellaneous Features - performance prediction via simulation Color Selection

  • Colors are automatically supplied by default.

  • We allow users to select their own colors and save them.

  • These colors can then be restored the next time Projections loads.


User apis l.jpg
User APIs performance prediction via simulation

  • Controlling trace generation

    • void traceBegin()

    • void traceEnd()

  • Tracing User (Events

    • int traceRegisterUserEvent(char *, int)

    • void traceUserEvent(char *)

    • void traceUserBracketEvent(int, double, double)

    • double CmiWallTimer()

  • Runtime options:

    • +traceoff

    • +traceroot <directory>

    • Projections mode only:

      • +logsize <# entries>

      • +gz-trace

    • Summary mode only:

      • +bincount <# of intervals>

      • +binsize <interval time quanta (us)>


Slide46 l.jpg

Performance Prediction performance prediction via simulation


Parallel Simulation


Bigsim performance prediction l.jpg
BigSim: Performance Prediction performance prediction via simulation

  • Extremely large parallel machines are around already/about to be available:

    • ASCI Purple (12k, 100TF)

    • Bluegene/L (64k, 360TF)

    • Bluegene/C (1M, 1PF)

  • How to write a petascale application?

  • What will be the Performance like?

  • Would existing parallel applications scale?

    • Machines are not there

    • Parallel Performance is hard to model without actually running the program


Objectives and simualtion model l.jpg
Objectives and Simualtion Model performance prediction via simulation

  • Objectives:

    • Develop techniques to facilitate the development of efficient peta-scale applications

    • Based on performance prediction of applications on large simulated parallel machines

  • Simulation-based Performance Prediction:

    • Focus on Charm++ and AMPI programming models Performance prediction based on PDES

    • Supports varying levels of fidelity

      • processor prediction, network prediction.

    • Modes of execution :

      • online and post-mortem mode


Blue gene emulator simulator l.jpg
Blue Gene Emulator/Simulator performance prediction via simulation

  • Actually: BigSim, for simulation of any large machine using smaller parallel machines

  • Emulator:

    • Allows development of programming environment and algorithms before the machine is built

    • Allowed us to port Charm++ to real BG/L in 1-2 days

  • Simulator:

    • Allows performance analysis of programs running on large machines, without access to the large machines

    • Uses Parallel Discrete Event Simulation


Architecture of bignetsim l.jpg
Architecture of BigNetSim performance prediction via simulation


Simulation details l.jpg
Simulation Details performance prediction via simulation

  • Emulate large parallel machines on smaller existing parallel machines – run a program with multi-million way parallelism (implemented using user-threads).

    • Consider memory and stack-size limitations

  • Ensure time-stamp correction

  • Emulator layer API is built on top of machine layer

  • Charm++/AMPI implemented on top of emulator, like any other machine layer

  • Emulator layer supports all Charm features:

    • Load-balancing

    • Coomunication optimizations


Performance prediction l.jpg
Performance Prediction performance prediction via simulation

  • Usefulness of performance prediction:

    • Application developer (making small modifications)

      • Difficult to get runtimes on huge current machines

      • For future machines, simulation is the only possibility

      • Performance debugging cycle can be considerably reduced

      • Even approximate predictions can identify performance issues such as load imbalance, serial bottlenecks, communication bottlenecks, etc

    • Machine architecture designer

      • Knowledge of how target applications behave on it, can help identify problems with machine design early

  • Record traces during parallel emulation

  • Run trace-driven simulation (PDES)


Performance prediction contd l.jpg
Performance Prediction (contd.) performance prediction via simulation

  • Predicting time of sequential code:

    • User supplied time for every code block

    • Wall-clock measurements on simulating machine can be used via a suitable multiplier

    • Hardware performance counters to count floating point, integer, branch instructions, etc

      • Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio

    • Instruction level simulation (not implemented)

  • Predicting Network performance:

    • No contention, time based on topology & other network parameters

    • Back-patching, modifies comm time using amount of comm activity

    • Network-simulation, modelling the netowrk entirely


Performance prediction validation l.jpg
Performance Prediction Validation performance prediction via simulation

  • 7-point stencil program with 3D decomposition

    • Run on 32 real processors, simulating 64, 128,... PEs

  • NAMD benchmark Apo-Lipoprotein A1 atom dataset with 92k atoms, running for 15 timesteps

    • For large processors, because of cache and memory effects, the predicted value seems to diverge from actual value


Performance on large machines l.jpg
Performance on Large Machines performance prediction via simulation

  • Problem:

    • How to predict performance of applications on future machines? (E.g. BG/L)

    • How to do performance tuning without continuous access to large machine?

  • Solution:

    • Leverage virtualization

    • Develop machine emulator

    • Simulator: accurate time modeling

    • Run program on “100,000 processors” using only hundreds of processors

  • Analysis:

    • Use performance viz. suite (projections)

  • Molecular Dynamics Benchmark

  • ER-GRE: 36573 atoms

    • 1.6 million objects

    • 8 step simulation

    • 16k BG processors


Projections performance visualization l.jpg
Projections: Performance visualization performance prediction via simulation


Network simulation l.jpg
NetWork Simulation performance prediction via simulation

  • Detailed implementation of interconnection networks

  • Configurable network parameters

    • Topology / Routing

    • Input / Output VC seclection

    • Bandwidth / Latency

    • NIC parameters

    • Buffer / Message size, etc

  • Support for hardware collectives in network layer


Higher level programming l.jpg
Higher level programming performance prediction via simulation

  • Orchestration language

    • Allows expressing global control flow in a charm program

    • HPF like flavor, but Charm++-like processor virtualization, and explicit communication

  • Multiphase Shared Arrays

    • Provides a disciplined use of shared address space

    • Each array can be accessed only in one of the following modes:

      • ReadOnly, Write-by-One-Thread, Accumulate-only

    • Access mode can change from phase to phase

    • Phases delineated by per-array “sync”


Other projects l.jpg

Faucets performance prediction via simulation

Flexible cluster scheduler

resource management across clusters

Multi-cluster applications

Load balancing strategies

Commn optimization

POSE: Parallel discrete even simulation


Parallel framework for Unstructured mesh

Invite collaborations:

Virtualization of other languages and libraries

New load balancing strategies


Other projects


Some active collaborations l.jpg

Biophysics: Molecular Dynamics (NIH, ..) performance prediction via simulation

Long standing, 91-, Klaus Schulten, Bob Skeel

Gordon bell award in 2002,

Production program used by biophysicists

Quantum Chemistry (NSF)

QM/MM via Car-Parinello method +

Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman,

Nick Nystrom, Josep Torrelas, Laxmikant Kale

Material simulation (NSF)

Dendritic growth, quenching, space-time meshes, QM/FEM

R. Haber, D. Johnson, J. Dantzig, +

Rocket simulation (DOE)

DOE, funded ASCI center

Mike Heath, +30 faculty

Computational Cosmology (NSF, NASA)


Scalable Visualization:


Simulation of Plasma


Some Active Collaborations


Summary l.jpg
Summary performance prediction via simulation

  • We are pursuing a broad agenda

    • aimed at productivity and performance in parallel programming

    • Intelligent Runtime System for adaptive strategies

    • Charm++/AMPI are production level systems

    • Support dynamic load balancing, communication optimizations

    • Performance prediction capabiltiees based on simulation

    • Basic Fault tolerance, performance viz tools are part of the suite

    • Application-oriented yet Computer Science centered research

Workshop on Charm++ and Applications: Oct 18-20 , UIUC