Adaptive mpi intelligent runtime strategies and performance prediction via simulation l.jpg
This presentation is the property of its rightful owner.
1 / 61

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation PowerPoint PPT Presentation

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign PPL Mission and Approach

Download Presentation

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Adaptive mpi intelligent runtime strategies and performance prediction via simulation l.jpg

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

Laxmikant Kale

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

[email protected]

Ppl mission and approach l.jpg

PPL Mission and Approach

  • To enhance Performance and Productivity in programming complex parallel applications

    • Performance: scalable to thousands of processors

    • Productivity: of human programmers

    • Complex: irregular structure, dynamic variations

  • Approach: Application Oriented yet CS centered research

    • Develop enabling technology, for a wide collection of apps.

    • Develop, use and test it in the context of real applications

  • How?

    • Develop novel Parallel programming techniques

    • Embody them into easy to use abstractions

    • So, application scientist can use advanced techniques with ease

    • Enabling technology: reused across many apps

[email protected]

Slide3 l.jpg

Develop abstractions in context of full-scale applications

Protein Folding

Quantum Chemistry (QM/MM)

Molecular Dynamics

Computational Cosmology

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Crack Propagation

Dendritic Growth

Space-time meshes

Rocket Simulation

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

[email protected]

Migratable objects aka processor virtualization l.jpg


  • Software engineering

    • Number of virtual processors can be independently controlled

    • Separate VPs for different modules

  • Message driven execution

    • Adaptive overlap of communication

    • Predictability :

      • Automatic out-of-core

    • Asynchronous reductions

  • Dynamic mapping

    • Heterogeneous clusters

      • Vacate, adjust to speed, share

    • Automatic checkpointing

    • Change set of processors used

    • Automatic dynamic load balancing

    • Communication optimization

System implementation

User View

Migratable Objects (aka Processor Virtualization)

Programmer: [Over] decomposition into virtual processors

Runtime:Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

[email protected]

Outline l.jpg

Adaptive MPI

Load Balancing

Fault tolerance


performance analysis

Performance prediction



[email protected]

Ampi mpi with virtualization l.jpg

MPI processes

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Real Processors

AMPI: MPI with Virtualization

  • Each virtual process implemented as a user-level thread embedded in a Charm object

[email protected]

Making ampi work l.jpg

Making AMPI work

  • Multiple user-level threads per processor:

    • Problems with global variable

    • Solution I:

      • Automatic: switch GOT pointer at context switch

      • Available on most machines

    • Solution 2: Manual: replace global variables

    • Solution 3: automatic: via compiler support (AMPIzer)

  • Migrating Stacks

    • Use isomalloc technique (Mehaut et al)

    • Use memory files and mmap()

  • Heap data:

    • Isomalloc heaps

    • OR user supplied pack/unpack functions for the heap data

[email protected]

Elf and global variables l.jpg

ELF and global variables

  • The Executable and Linking Format (ELF)

    • Executable has a Global Offset Table containing global data

    • GOT pointer stored at %ebx register

    • Switch this pointer when switching between threads

    • Support on Linux, Solaris 2.x, and more

  • Integrated in Charm++/AMPI

    • Invoke by compile time option -swapglobal

[email protected]

Adaptive overlap and modules l.jpg

Adaptive overlap and modules

SPMD and Message-Driven Modules

(From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995

[email protected]

Benefit of adaptive overlap l.jpg

Benefit of Adaptive Overlap

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

[email protected]

Comparison with native mpi l.jpg


Slightly worse w/o optimization

Being improved


Small number of PE available

Special requirement by algorithm

Comparison with Native MPI

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.

[email protected]

Ampi extensions l.jpg

AMPI Extensions

  • Automatic load balancing

  • Non-blocking collectives

  • Checkpoint/restart mechanism

  • Multi-module programming

[email protected]

Load balancing in ampi l.jpg

Load Balancing in AMPI

  • Automatic load balancing: MPI_Migrate()

    • Collective call informing the load balancer that the thread is ready to be migrated, if needed.

[email protected]

Load balancing steps l.jpg

Load Balancing Steps

Regular Timesteps

Detailed, aggressive Load Balancing : object migration

Instrumented Timesteps

Refinement Load Balancing

[email protected]

Slide15 l.jpg

Load Balancing

Aggressive Load Balancing

Refinement Load Balancing

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.

[email protected]

Shrink expand l.jpg


  • Problem: Availability of computing platform may change

  • Fitting applications on the platform by object migration

Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

[email protected]

Slide17 l.jpg

Radix Sort

Optimized All-to-all “Surprise”

Completion time vs. computation overhead

76 bytes all-to-all on Lemieux

CPU is free during most of the time taken by a collective operation





Led to the development of Asynchronous Collectives now supported in AMPI

AAPC Completion Time(ms)












Message Size (bytes)



[email protected]

Asynchronous collectives l.jpg

Asynchronous Collectives

  • Our implementation is asynchronous

    • Collective operation posted

    • Test/wait for its completion

    • Meanwhile useful computation can utilize CPU

      MPI_Ialltoall( … , &req);

      /* other computation */


[email protected]

Slide19 l.jpg

Fault Tolerance

[email protected]

Motivation l.jpg


  • Applications need fast, low cost and scalable fault tolerance support

  • As machines grow in size

    • MTBF decreases

    • Applications have to tolerate faults

  • Our research

    • Disk based Checkpoint/Restart

    • In Memory Double Checkpointing/Restart

    • Sender based Message Logging

    • Proactive response to fault prediction

      • (impending fault response)

[email protected]

Checkpoint restart mechanism l.jpg

Checkpoint/Restart Mechanism

  • Automatic Checkpointing for AMPI and Charm++

    • Migrate objects to disk!

    • Automatic fault detection and restart

    • Now available in distribution version of AMPI and Charm++

  • Blocking Co-ordinated Checkpoint

    • States of chares are checkpointed to disk

    • Collective call MPI_Checkpoint(DIRNAME)

  • The entire job is restarted

    • Virtualization allows restarting on different # of processors

    • Runtime option

      • > ./charmrun pgm +p4 +vp16 +restart DIRNAME

  • Simple but effective for common cases

[email protected]

In memory double checkpoint l.jpg

In-memory Double Checkpoint

  • In-memory checkpoint

    • Faster than disk

  • Co-ordinated checkpoint

    • Simple MPI_MemCheckpoint(void)

    • User can decide what makes up useful state

  • Double checkpointing

    • Each object maintains 2 checkpoints:

    • Local physical processor

    • Remote “buddy” processor

  • For jobs with large memory

    • Use local disks!

32 processors with 1.5GB memory each

[email protected]

Restart l.jpg


  • A “Dummy” process is created:

    • Need not have application data or checkpoint

    • Necessary for runtime

    • Starts recovery on all other Processors

  • Other processors:

    • Remove all chares

    • Restore checkpoints lost on the crashed PE

    • Restore chares from local checkpoints

  • Load balance after restart

[email protected]

Restart performance l.jpg

Restart Performance

  • 10 crashes

  • 128 processors

  • Checkpoint every 10 time steps

[email protected]

Scalable fault tolerance l.jpg

Scalable Fault Tolerance

  • How?

    • Sender-side message logging

    • Asynchronous Checkpoints on buddy processors

    • Latency tolerance mitigates costs

    • Restart can be speeded up by spreading out objects from failed processor

  • Current progress

    • Basic scheme idea implemented and tested in simple programs

    • General purpose implementation in progress

Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!

Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue”

[email protected]

Recovery performance l.jpg

Recovery Performance

Execution Time with increasing number of faults on 8 processors

(Checkpoint period 30s)

[email protected]

Slide27 l.jpg


Performance visualization and analysis tool

[email protected]

An introduction to projections l.jpg

An Introduction to Projections

  • Performance Analysis tool for Charm++ based applications.

  • Automatic trace instrumentation.

  • Post-mortem visualization.

  • Multiple advanced features that support:

    • Data volume control

    • Generation of additional user data.

[email protected]

Trace generation l.jpg

Trace Generation

  • Automatic instrumentation by runtime system

  • Detailed

    • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.

  • Summary

    • Reduces the size of output files and memory overhead.

    • Produces a few lines of output data per processor.

    • This data is recorded in bins corresponding to intervals of size 1 ms by default.

  • Flexible

    • APIs and runtime options for instrumenting user events and data generation control.

[email protected]

The summary view l.jpg

The Summary View

  • Provides a view of the overall utilization of the application.

  • Very quick to load.

[email protected]

Graph view l.jpg

Graph View

  • Features:

    • Selectively view Entry points.

    • Convenient means to switch to between axes data types.

[email protected]

Timeline l.jpg


  • The most detailed view in Projections.

  • Useful for understanding critical path issues or unusual entry point behaviors at specific times.

[email protected]

Animations l.jpg


[email protected]

Slide34 l.jpg

[email protected]

Slide35 l.jpg

[email protected]

Slide36 l.jpg

[email protected]

Time profile l.jpg

Time Profile

  • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool.

Solved by prioritizing entry methods

[email protected]

Slide38 l.jpg

[email protected]

Slide39 l.jpg

[email protected]

Slide40 l.jpg

Overview: one line for each processor, time on X-axis

White: busy,


Red: intermediate

[email protected]

Slide41 l.jpg

A boring but good-performance overview

[email protected]

Slide42 l.jpg

An interesting but pathetic overview

[email protected]

Stretch removal l.jpg

Stretch Removal

Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis

After Optimizations

About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

Before Optimizations

Over 16 large stretched calls

[email protected]

Miscellaneous features color selection l.jpg

Miscellaneous Features -Color Selection

  • Colors are automatically supplied by default.

  • We allow users to select their own colors and save them.

  • These colors can then be restored the next time Projections loads.

[email protected]

User apis l.jpg

User APIs

  • Controlling trace generation

    • void traceBegin()

    • void traceEnd()

  • Tracing User (Events

    • int traceRegisterUserEvent(char *, int)

    • void traceUserEvent(char *)

    • void traceUserBracketEvent(int, double, double)

    • double CmiWallTimer()

  • Runtime options:

    • +traceoff

    • +traceroot <directory>

    • Projections mode only:

      • +logsize <# entries>

      • +gz-trace

    • Summary mode only:

      • +bincount <# of intervals>

      • +binsize <interval time quanta (us)>

[email protected]

Slide46 l.jpg

Performance Prediction


Parallel Simulation

[email protected]

Bigsim performance prediction l.jpg

BigSim: Performance Prediction

  • Extremely large parallel machines are around already/about to be available:

    • ASCI Purple (12k, 100TF)

    • Bluegene/L (64k, 360TF)

    • Bluegene/C (1M, 1PF)

  • How to write a petascale application?

  • What will be the Performance like?

  • Would existing parallel applications scale?

    • Machines are not there

    • Parallel Performance is hard to model without actually running the program

[email protected]

Objectives and simualtion model l.jpg

Objectives and Simualtion Model

  • Objectives:

    • Develop techniques to facilitate the development of efficient peta-scale applications

    • Based on performance prediction of applications on large simulated parallel machines

  • Simulation-based Performance Prediction:

    • Focus on Charm++ and AMPI programming models Performance prediction based on PDES

    • Supports varying levels of fidelity

      • processor prediction, network prediction.

    • Modes of execution :

      • online and post-mortem mode

[email protected]

Blue gene emulator simulator l.jpg

Blue Gene Emulator/Simulator

  • Actually: BigSim, for simulation of any large machine using smaller parallel machines

  • Emulator:

    • Allows development of programming environment and algorithms before the machine is built

    • Allowed us to port Charm++ to real BG/L in 1-2 days

  • Simulator:

    • Allows performance analysis of programs running on large machines, without access to the large machines

    • Uses Parallel Discrete Event Simulation

[email protected]

Architecture of bignetsim l.jpg

Architecture of BigNetSim

[email protected]

Simulation details l.jpg

Simulation Details

  • Emulate large parallel machines on smaller existing parallel machines – run a program with multi-million way parallelism (implemented using user-threads).

    • Consider memory and stack-size limitations

  • Ensure time-stamp correction

  • Emulator layer API is built on top of machine layer

  • Charm++/AMPI implemented on top of emulator, like any other machine layer

  • Emulator layer supports all Charm features:

    • Load-balancing

    • Coomunication optimizations

[email protected]

Performance prediction l.jpg

Performance Prediction

  • Usefulness of performance prediction:

    • Application developer (making small modifications)

      • Difficult to get runtimes on huge current machines

      • For future machines, simulation is the only possibility

      • Performance debugging cycle can be considerably reduced

      • Even approximate predictions can identify performance issues such as load imbalance, serial bottlenecks, communication bottlenecks, etc

    • Machine architecture designer

      • Knowledge of how target applications behave on it, can help identify problems with machine design early

  • Record traces during parallel emulation

  • Run trace-driven simulation (PDES)

[email protected]

Performance prediction contd l.jpg

Performance Prediction (contd.)

  • Predicting time of sequential code:

    • User supplied time for every code block

    • Wall-clock measurements on simulating machine can be used via a suitable multiplier

    • Hardware performance counters to count floating point, integer, branch instructions, etc

      • Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio

    • Instruction level simulation (not implemented)

  • Predicting Network performance:

    • No contention, time based on topology & other network parameters

    • Back-patching, modifies comm time using amount of comm activity

    • Network-simulation, modelling the netowrk entirely

[email protected]

Performance prediction validation l.jpg

Performance Prediction Validation

  • 7-point stencil program with 3D decomposition

    • Run on 32 real processors, simulating 64, 128,... PEs

  • NAMD benchmark Apo-Lipoprotein A1 atom dataset with 92k atoms, running for 15 timesteps

    • For large processors, because of cache and memory effects, the predicted value seems to diverge from actual value

[email protected]

Performance on large machines l.jpg

Performance on Large Machines

  • Problem:

    • How to predict performance of applications on future machines? (E.g. BG/L)

    • How to do performance tuning without continuous access to large machine?

  • Solution:

    • Leverage virtualization

    • Develop machine emulator

    • Simulator: accurate time modeling

    • Run program on “100,000 processors” using only hundreds of processors

  • Analysis:

    • Use performance viz. suite (projections)

  • Molecular Dynamics Benchmark

  • ER-GRE: 36573 atoms

    • 1.6 million objects

    • 8 step simulation

    • 16k BG processors

[email protected]

Projections performance visualization l.jpg

Projections: Performance visualization

[email protected]

Network simulation l.jpg

NetWork Simulation

  • Detailed implementation of interconnection networks

  • Configurable network parameters

    • Topology / Routing

    • Input / Output VC seclection

    • Bandwidth / Latency

    • NIC parameters

    • Buffer / Message size, etc

  • Support for hardware collectives in network layer

[email protected]

Higher level programming l.jpg

Higher level programming

  • Orchestration language

    • Allows expressing global control flow in a charm program

    • HPF like flavor, but Charm++-like processor virtualization, and explicit communication

  • Multiphase Shared Arrays

    • Provides a disciplined use of shared address space

    • Each array can be accessed only in one of the following modes:

      • ReadOnly, Write-by-One-Thread, Accumulate-only

    • Access mode can change from phase to phase

    • Phases delineated by per-array “sync”

[email protected]

Other projects l.jpg


Flexible cluster scheduler

resource management across clusters

Multi-cluster applications

Load balancing strategies

Commn optimization

POSE: Parallel discrete even simulation


Parallel framework for Unstructured mesh

Invite collaborations:

Virtualization of other languages and libraries

New load balancing strategies


Other projects

[email protected]

Some active collaborations l.jpg

Biophysics: Molecular Dynamics (NIH, ..)

Long standing, 91-, Klaus Schulten, Bob Skeel

Gordon bell award in 2002,

Production program used by biophysicists

Quantum Chemistry (NSF)

QM/MM via Car-Parinello method +

Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman,

Nick Nystrom, Josep Torrelas, Laxmikant Kale

Material simulation (NSF)

Dendritic growth, quenching, space-time meshes, QM/FEM

R. Haber, D. Johnson, J. Dantzig, +

Rocket simulation (DOE)

DOE, funded ASCI center

Mike Heath, +30 faculty

Computational Cosmology (NSF, NASA)


Scalable Visualization:


Simulation of Plasma


Some Active Collaborations

[email protected]

Summary l.jpg


  • We are pursuing a broad agenda

    • aimed at productivity and performance in parallel programming

    • Intelligent Runtime System for adaptive strategies

    • Charm++/AMPI are production level systems

    • Support dynamic load balancing, communication optimizations

    • Performance prediction capabiltiees based on simulation

    • Basic Fault tolerance, performance viz tools are part of the suite

    • Application-oriented yet Computer Science centered research

Workshop on Charm++ and Applications: Oct 18-20 , UIUC

[email protected]

  • Login