Adaptive mpi intelligent runtime strategies and performance prediction via simulation
1 / 61

[PPT] - PowerPoint PPT Presentation

  • Uploaded on

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign PPL Mission and Approach

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '[PPT]' - albert

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Adaptive mpi intelligent runtime strategies and performance prediction via simulation l.jpg

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

Laxmikant Kale

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

[email protected]

Ppl mission and approach l.jpg
PPL Mission and Approach performance prediction via simulation

  • To enhance Performance and Productivity in programming complex parallel applications

    • Performance: scalable to thousands of processors

    • Productivity: of human programmers

    • Complex: irregular structure, dynamic variations

  • Approach: Application Oriented yet CS centered research

    • Develop enabling technology, for a wide collection of apps.

    • Develop, use and test it in the context of real applications

  • How?

    • Develop novel Parallel programming techniques

    • Embody them into easy to use abstractions

    • So, application scientist can use advanced techniques with ease

    • Enabling technology: reused across many apps

[email protected]

Slide3 l.jpg

Develop abstractions in context of full-scale applications performance prediction via simulation

Protein Folding

Quantum Chemistry (QM/MM)

Molecular Dynamics

Computational Cosmology

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Crack Propagation

Dendritic Growth

Space-time meshes

Rocket Simulation

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

[email protected]

Migratable objects aka processor virtualization l.jpg

Benefits performance prediction via simulation

  • Software engineering

    • Number of virtual processors can be independently controlled

    • Separate VPs for different modules

  • Message driven execution

    • Adaptive overlap of communication

    • Predictability :

      • Automatic out-of-core

    • Asynchronous reductions

  • Dynamic mapping

    • Heterogeneous clusters

      • Vacate, adjust to speed, share

    • Automatic checkpointing

    • Change set of processors used

    • Automatic dynamic load balancing

    • Communication optimization

System implementation

User View

Migratable Objects (aka Processor Virtualization)

Programmer: [Over] decomposition into virtual processors

Runtime:Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

[email protected]

Outline l.jpg

Adaptive MPI performance prediction via simulation

Load Balancing

Fault tolerance


performance analysis

Performance prediction



[email protected]

Ampi mpi with virtualization l.jpg

MPI processes performance prediction via simulation

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Real Processors

AMPI: MPI with Virtualization

  • Each virtual process implemented as a user-level thread embedded in a Charm object

[email protected]

Making ampi work l.jpg
Making AMPI work performance prediction via simulation

  • Multiple user-level threads per processor:

    • Problems with global variable

    • Solution I:

      • Automatic: switch GOT pointer at context switch

      • Available on most machines

    • Solution 2: Manual: replace global variables

    • Solution 3: automatic: via compiler support (AMPIzer)

  • Migrating Stacks

    • Use isomalloc technique (Mehaut et al)

    • Use memory files and mmap()

  • Heap data:

    • Isomalloc heaps

    • OR user supplied pack/unpack functions for the heap data

[email protected]

Elf and global variables l.jpg
ELF and global variables performance prediction via simulation

  • The Executable and Linking Format (ELF)

    • Executable has a Global Offset Table containing global data

    • GOT pointer stored at %ebx register

    • Switch this pointer when switching between threads

    • Support on Linux, Solaris 2.x, and more

  • Integrated in Charm++/AMPI

    • Invoke by compile time option -swapglobal

[email protected]

Adaptive overlap and modules l.jpg
Adaptive overlap and modules performance prediction via simulation

SPMD and Message-Driven Modules

(From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995

[email protected]

Benefit of adaptive overlap l.jpg
Benefit of Adaptive Overlap performance prediction via simulation

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

[email protected]

Comparison with native mpi l.jpg

Performance performance prediction via simulation

Slightly worse w/o optimization

Being improved


Small number of PE available

Special requirement by algorithm

Comparison with Native MPI

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.

[email protected]

Ampi extensions l.jpg
AMPI Extensions performance prediction via simulation

  • Automatic load balancing

  • Non-blocking collectives

  • Checkpoint/restart mechanism

  • Multi-module programming

[email protected]

Load balancing in ampi l.jpg
Load Balancing in AMPI performance prediction via simulation

  • Automatic load balancing: MPI_Migrate()

    • Collective call informing the load balancer that the thread is ready to be migrated, if needed.

[email protected]

Load balancing steps l.jpg
Load Balancing Steps performance prediction via simulation

Regular Timesteps

Detailed, aggressive Load Balancing : object migration

Instrumented Timesteps

Refinement Load Balancing

[email protected]

Slide15 l.jpg

Load Balancing performance prediction via simulation

Aggressive Load Balancing

Refinement Load Balancing

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.

[email protected]

Shrink expand l.jpg
Shrink/Expand performance prediction via simulation

  • Problem: Availability of computing platform may change

  • Fitting applications on the platform by object migration

Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

[email protected]

Slide17 l.jpg

Radix Sort performance prediction via simulation

Optimized All-to-all “Surprise”

Completion time vs. computation overhead

76 bytes all-to-all on Lemieux

CPU is free during most of the time taken by a collective operation





Led to the development of Asynchronous Collectives now supported in AMPI

AAPC Completion Time(ms)












Message Size (bytes)



[email protected]

Asynchronous collectives l.jpg
Asynchronous Collectives performance prediction via simulation

  • Our implementation is asynchronous

    • Collective operation posted

    • Test/wait for its completion

    • Meanwhile useful computation can utilize CPU

      MPI_Ialltoall( … , &req);

      /* other computation */


[email protected]

Slide19 l.jpg

Fault Tolerance performance prediction via simulation

[email protected]

Motivation l.jpg
Motivation performance prediction via simulation

  • Applications need fast, low cost and scalable fault tolerance support

  • As machines grow in size

    • MTBF decreases

    • Applications have to tolerate faults

  • Our research

    • Disk based Checkpoint/Restart

    • In Memory Double Checkpointing/Restart

    • Sender based Message Logging

    • Proactive response to fault prediction

      • (impending fault response)

[email protected]

Checkpoint restart mechanism l.jpg
Checkpoint/Restart Mechanism performance prediction via simulation

  • Automatic Checkpointing for AMPI and Charm++

    • Migrate objects to disk!

    • Automatic fault detection and restart

    • Now available in distribution version of AMPI and Charm++

  • Blocking Co-ordinated Checkpoint

    • States of chares are checkpointed to disk

    • Collective call MPI_Checkpoint(DIRNAME)

  • The entire job is restarted

    • Virtualization allows restarting on different # of processors

    • Runtime option

      • > ./charmrun pgm +p4 +vp16 +restart DIRNAME

  • Simple but effective for common cases

[email protected]

In memory double checkpoint l.jpg
In-memory Double Checkpoint performance prediction via simulation

  • In-memory checkpoint

    • Faster than disk

  • Co-ordinated checkpoint

    • Simple MPI_MemCheckpoint(void)

    • User can decide what makes up useful state

  • Double checkpointing

    • Each object maintains 2 checkpoints:

    • Local physical processor

    • Remote “buddy” processor

  • For jobs with large memory

    • Use local disks!

32 processors with 1.5GB memory each

[email protected]

Restart l.jpg
Restart performance prediction via simulation

  • A “Dummy” process is created:

    • Need not have application data or checkpoint

    • Necessary for runtime

    • Starts recovery on all other Processors

  • Other processors:

    • Remove all chares

    • Restore checkpoints lost on the crashed PE

    • Restore chares from local checkpoints

  • Load balance after restart

[email protected]

Restart performance l.jpg
Restart Performance performance prediction via simulation

  • 10 crashes

  • 128 processors

  • Checkpoint every 10 time steps

[email protected]

Scalable fault tolerance l.jpg
Scalable Fault Tolerance performance prediction via simulation

  • How?

    • Sender-side message logging

    • Asynchronous Checkpoints on buddy processors

    • Latency tolerance mitigates costs

    • Restart can be speeded up by spreading out objects from failed processor

  • Current progress

    • Basic scheme idea implemented and tested in simple programs

    • General purpose implementation in progress

Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!

Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue”

[email protected]

Recovery performance l.jpg
Recovery Performance performance prediction via simulation

Execution Time with increasing number of faults on 8 processors

(Checkpoint period 30s)

[email protected]

Slide27 l.jpg

Projections: performance prediction via simulation

Performance visualization and analysis tool

[email protected]

An introduction to projections l.jpg
An Introduction to Projections performance prediction via simulation

  • Performance Analysis tool for Charm++ based applications.

  • Automatic trace instrumentation.

  • Post-mortem visualization.

  • Multiple advanced features that support:

    • Data volume control

    • Generation of additional user data.

[email protected]

Trace generation l.jpg
Trace Generation performance prediction via simulation

  • Automatic instrumentation by runtime system

  • Detailed

    • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.

  • Summary

    • Reduces the size of output files and memory overhead.

    • Produces a few lines of output data per processor.

    • This data is recorded in bins corresponding to intervals of size 1 ms by default.

  • Flexible

    • APIs and runtime options for instrumenting user events and data generation control.

[email protected]

The summary view l.jpg
The Summary View performance prediction via simulation

  • Provides a view of the overall utilization of the application.

  • Very quick to load.

[email protected]

Graph view l.jpg
Graph View performance prediction via simulation

  • Features:

    • Selectively view Entry points.

    • Convenient means to switch to between axes data types.

[email protected]

Timeline l.jpg
Timeline performance prediction via simulation

  • The most detailed view in Projections.

  • Useful for understanding critical path issues or unusual entry point behaviors at specific times.

[email protected]

Animations l.jpg
Animations performance prediction via simulation

[email protected]

Slide34 l.jpg

[email protected] performance prediction via simulation

Slide35 l.jpg

[email protected] performance prediction via simulation

Slide36 l.jpg

[email protected] performance prediction via simulation

Time profile l.jpg
Time Profile performance prediction via simulation

  • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool.

Solved by prioritizing entry methods

[email protected]

Slide38 l.jpg

[email protected] performance prediction via simulation

Slide39 l.jpg

[email protected] performance prediction via simulation

Slide40 l.jpg

Overview: performance prediction via simulation one line for each processor, time on X-axis

White: busy,


Red: intermediate

[email protected]

Slide41 l.jpg

A boring but good-performance overview performance prediction via simulation

[email protected]

Slide42 l.jpg

An interesting but pathetic overview performance prediction via simulation

[email protected]

Stretch removal l.jpg
Stretch Removal performance prediction via simulation

Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis

After Optimizations

About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

Before Optimizations

Over 16 large stretched calls

[email protected]

Miscellaneous features color selection l.jpg
Miscellaneous Features - performance prediction via simulation Color Selection

  • Colors are automatically supplied by default.

  • We allow users to select their own colors and save them.

  • These colors can then be restored the next time Projections loads.

[email protected]

User apis l.jpg
User APIs performance prediction via simulation

  • Controlling trace generation

    • void traceBegin()

    • void traceEnd()

  • Tracing User (Events

    • int traceRegisterUserEvent(char *, int)

    • void traceUserEvent(char *)

    • void traceUserBracketEvent(int, double, double)

    • double CmiWallTimer()

  • Runtime options:

    • +traceoff

    • +traceroot <directory>

    • Projections mode only:

      • +logsize <# entries>

      • +gz-trace

    • Summary mode only:

      • +bincount <# of intervals>

      • +binsize <interval time quanta (us)>

[email protected]

Slide46 l.jpg

Performance Prediction performance prediction via simulation


Parallel Simulation

[email protected]

Bigsim performance prediction l.jpg
BigSim: Performance Prediction performance prediction via simulation

  • Extremely large parallel machines are around already/about to be available:

    • ASCI Purple (12k, 100TF)

    • Bluegene/L (64k, 360TF)

    • Bluegene/C (1M, 1PF)

  • How to write a petascale application?

  • What will be the Performance like?

  • Would existing parallel applications scale?

    • Machines are not there

    • Parallel Performance is hard to model without actually running the program

[email protected]

Objectives and simualtion model l.jpg
Objectives and Simualtion Model performance prediction via simulation

  • Objectives:

    • Develop techniques to facilitate the development of efficient peta-scale applications

    • Based on performance prediction of applications on large simulated parallel machines

  • Simulation-based Performance Prediction:

    • Focus on Charm++ and AMPI programming models Performance prediction based on PDES

    • Supports varying levels of fidelity

      • processor prediction, network prediction.

    • Modes of execution :

      • online and post-mortem mode

[email protected]

Blue gene emulator simulator l.jpg
Blue Gene Emulator/Simulator performance prediction via simulation

  • Actually: BigSim, for simulation of any large machine using smaller parallel machines

  • Emulator:

    • Allows development of programming environment and algorithms before the machine is built

    • Allowed us to port Charm++ to real BG/L in 1-2 days

  • Simulator:

    • Allows performance analysis of programs running on large machines, without access to the large machines

    • Uses Parallel Discrete Event Simulation

[email protected]

Architecture of bignetsim l.jpg
Architecture of BigNetSim performance prediction via simulation

[email protected]

Simulation details l.jpg
Simulation Details performance prediction via simulation

  • Emulate large parallel machines on smaller existing parallel machines – run a program with multi-million way parallelism (implemented using user-threads).

    • Consider memory and stack-size limitations

  • Ensure time-stamp correction

  • Emulator layer API is built on top of machine layer

  • Charm++/AMPI implemented on top of emulator, like any other machine layer

  • Emulator layer supports all Charm features:

    • Load-balancing

    • Coomunication optimizations

[email protected]

Performance prediction l.jpg
Performance Prediction performance prediction via simulation

  • Usefulness of performance prediction:

    • Application developer (making small modifications)

      • Difficult to get runtimes on huge current machines

      • For future machines, simulation is the only possibility

      • Performance debugging cycle can be considerably reduced

      • Even approximate predictions can identify performance issues such as load imbalance, serial bottlenecks, communication bottlenecks, etc

    • Machine architecture designer

      • Knowledge of how target applications behave on it, can help identify problems with machine design early

  • Record traces during parallel emulation

  • Run trace-driven simulation (PDES)

[email protected]

Performance prediction contd l.jpg
Performance Prediction (contd.) performance prediction via simulation

  • Predicting time of sequential code:

    • User supplied time for every code block

    • Wall-clock measurements on simulating machine can be used via a suitable multiplier

    • Hardware performance counters to count floating point, integer, branch instructions, etc

      • Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio

    • Instruction level simulation (not implemented)

  • Predicting Network performance:

    • No contention, time based on topology & other network parameters

    • Back-patching, modifies comm time using amount of comm activity

    • Network-simulation, modelling the netowrk entirely

[email protected]

Performance prediction validation l.jpg
Performance Prediction Validation performance prediction via simulation

  • 7-point stencil program with 3D decomposition

    • Run on 32 real processors, simulating 64, 128,... PEs

  • NAMD benchmark Apo-Lipoprotein A1 atom dataset with 92k atoms, running for 15 timesteps

    • For large processors, because of cache and memory effects, the predicted value seems to diverge from actual value

[email protected]

Performance on large machines l.jpg
Performance on Large Machines performance prediction via simulation

  • Problem:

    • How to predict performance of applications on future machines? (E.g. BG/L)

    • How to do performance tuning without continuous access to large machine?

  • Solution:

    • Leverage virtualization

    • Develop machine emulator

    • Simulator: accurate time modeling

    • Run program on “100,000 processors” using only hundreds of processors

  • Analysis:

    • Use performance viz. suite (projections)

  • Molecular Dynamics Benchmark

  • ER-GRE: 36573 atoms

    • 1.6 million objects

    • 8 step simulation

    • 16k BG processors

[email protected]

Projections performance visualization l.jpg
Projections: Performance visualization performance prediction via simulation

[email protected]

Network simulation l.jpg
NetWork Simulation performance prediction via simulation

  • Detailed implementation of interconnection networks

  • Configurable network parameters

    • Topology / Routing

    • Input / Output VC seclection

    • Bandwidth / Latency

    • NIC parameters

    • Buffer / Message size, etc

  • Support for hardware collectives in network layer

[email protected]

Higher level programming l.jpg
Higher level programming performance prediction via simulation

  • Orchestration language

    • Allows expressing global control flow in a charm program

    • HPF like flavor, but Charm++-like processor virtualization, and explicit communication

  • Multiphase Shared Arrays

    • Provides a disciplined use of shared address space

    • Each array can be accessed only in one of the following modes:

      • ReadOnly, Write-by-One-Thread, Accumulate-only

    • Access mode can change from phase to phase

    • Phases delineated by per-array “sync”

[email protected]

Other projects l.jpg

Faucets performance prediction via simulation

Flexible cluster scheduler

resource management across clusters

Multi-cluster applications

Load balancing strategies

Commn optimization

POSE: Parallel discrete even simulation


Parallel framework for Unstructured mesh

Invite collaborations:

Virtualization of other languages and libraries

New load balancing strategies


Other projects

[email protected]

Some active collaborations l.jpg

Biophysics: Molecular Dynamics (NIH, ..) performance prediction via simulation

Long standing, 91-, Klaus Schulten, Bob Skeel

Gordon bell award in 2002,

Production program used by biophysicists

Quantum Chemistry (NSF)

QM/MM via Car-Parinello method +

Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman,

Nick Nystrom, Josep Torrelas, Laxmikant Kale

Material simulation (NSF)

Dendritic growth, quenching, space-time meshes, QM/FEM

R. Haber, D. Johnson, J. Dantzig, +

Rocket simulation (DOE)

DOE, funded ASCI center

Mike Heath, +30 faculty

Computational Cosmology (NSF, NASA)


Scalable Visualization:


Simulation of Plasma


Some Active Collaborations

[email protected]

Summary l.jpg
Summary performance prediction via simulation

  • We are pursuing a broad agenda

    • aimed at productivity and performance in parallel programming

    • Intelligent Runtime System for adaptive strategies

    • Charm++/AMPI are production level systems

    • Support dynamic load balancing, communication optimizations

    • Performance prediction capabiltiees based on simulation

    • Basic Fault tolerance, performance viz tools are part of the suite

    • Application-oriented yet Computer Science centered research

Workshop on Charm++ and Applications: Oct 18-20 , UIUC

[email protected]