adaptive mpi intelligent runtime strategies and performance prediction via simulation l.
Skip this Video
Loading SlideShow in 5 Seconds..
Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation PowerPoint Presentation
Download Presentation
Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

Loading in 2 Seconds...

play fullscreen
1 / 61

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation - PowerPoint PPT Presentation

  • Uploaded on

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign PPL Mission and Approach

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
adaptive mpi intelligent runtime strategies and performance prediction via simulation

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

Laxmikant Kale

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign


ppl mission and approach
PPL Mission and Approach
  • To enhance Performance and Productivity in programming complex parallel applications
    • Performance: scalable to thousands of processors
    • Productivity: of human programmers
    • Complex: irregular structure, dynamic variations
  • Approach: Application Oriented yet CS centered research
    • Develop enabling technology, for a wide collection of apps.
    • Develop, use and test it in the context of real applications
  • How?
    • Develop novel Parallel programming techniques
    • Embody them into easy to use abstractions
    • So, application scientist can use advanced techniques with ease
    • Enabling technology: reused across many apps



Develop abstractions in context of full-scale applications

Protein Folding

Quantum Chemistry (QM/MM)

Molecular Dynamics

Computational Cosmology

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Crack Propagation

Dendritic Growth

Space-time meshes

Rocket Simulation

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE


migratable objects aka processor virtualization


  • Software engineering
    • Number of virtual processors can be independently controlled
    • Separate VPs for different modules
  • Message driven execution
    • Adaptive overlap of communication
    • Predictability :
      • Automatic out-of-core
    • Asynchronous reductions
  • Dynamic mapping
    • Heterogeneous clusters
      • Vacate, adjust to speed, share
    • Automatic checkpointing
    • Change set of processors used
    • Automatic dynamic load balancing
    • Communication optimization

System implementation

User View

Migratable Objects (aka Processor Virtualization)

Programmer: [Over] decomposition into virtual processors

Runtime:Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI


Adaptive MPI

Load Balancing

Fault tolerance


performance analysis

Performance prediction




ampi mpi with virtualization

MPI processes

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Real Processors

AMPI: MPI with Virtualization
  • Each virtual process implemented as a user-level thread embedded in a Charm object


making ampi work
Making AMPI work
  • Multiple user-level threads per processor:
    • Problems with global variable
    • Solution I:
      • Automatic: switch GOT pointer at context switch
      • Available on most machines
    • Solution 2: Manual: replace global variables
    • Solution 3: automatic: via compiler support (AMPIzer)
  • Migrating Stacks
    • Use isomalloc technique (Mehaut et al)
    • Use memory files and mmap()
  • Heap data:
    • Isomalloc heaps
    • OR user supplied pack/unpack functions for the heap data


elf and global variables
ELF and global variables
  • The Executable and Linking Format (ELF)
    • Executable has a Global Offset Table containing global data
    • GOT pointer stored at %ebx register
    • Switch this pointer when switching between threads
    • Support on Linux, Solaris 2.x, and more
  • Integrated in Charm++/AMPI
    • Invoke by compile time option -swapglobal


adaptive overlap and modules
Adaptive overlap and modules

SPMD and Message-Driven Modules

(From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995


benefit of adaptive overlap
Benefit of Adaptive Overlap

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.


comparison with native mpi

Slightly worse w/o optimization

Being improved


Small number of PE available

Special requirement by algorithm

Comparison with Native MPI

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.


ampi extensions
AMPI Extensions
  • Automatic load balancing
  • Non-blocking collectives
  • Checkpoint/restart mechanism
  • Multi-module programming


load balancing in ampi
Load Balancing in AMPI
  • Automatic load balancing: MPI_Migrate()
    • Collective call informing the load balancer that the thread is ready to be migrated, if needed.


load balancing steps
Load Balancing Steps

Regular Timesteps

Detailed, aggressive Load Balancing : object migration

Instrumented Timesteps

Refinement Load Balancing



Load Balancing

Aggressive Load Balancing

Refinement Load Balancing

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.


shrink expand
  • Problem: Availability of computing platform may change
  • Fitting applications on the platform by object migration

Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600



Radix Sort

Optimized All-to-all “Surprise”

Completion time vs. computation overhead

76 bytes all-to-all on Lemieux

CPU is free during most of the time taken by a collective operation





Led to the development of Asynchronous Collectives now supported in AMPI

AAPC Completion Time(ms)












Message Size (bytes)




asynchronous collectives
Asynchronous Collectives
  • Our implementation is asynchronous
    • Collective operation posted
    • Test/wait for its completion
    • Meanwhile useful computation can utilize CPU

MPI_Ialltoall( … , &req);

/* other computation */



  • Applications need fast, low cost and scalable fault tolerance support
  • As machines grow in size
    • MTBF decreases
    • Applications have to tolerate faults
  • Our research
    • Disk based Checkpoint/Restart
    • In Memory Double Checkpointing/Restart
    • Sender based Message Logging
    • Proactive response to fault prediction
      • (impending fault response)


checkpoint restart mechanism
Checkpoint/Restart Mechanism
  • Automatic Checkpointing for AMPI and Charm++
    • Migrate objects to disk!
    • Automatic fault detection and restart
    • Now available in distribution version of AMPI and Charm++
  • Blocking Co-ordinated Checkpoint
    • States of chares are checkpointed to disk
    • Collective call MPI_Checkpoint(DIRNAME)
  • The entire job is restarted
    • Virtualization allows restarting on different # of processors
    • Runtime option
      • > ./charmrun pgm +p4 +vp16 +restart DIRNAME
  • Simple but effective for common cases


in memory double checkpoint
In-memory Double Checkpoint
  • In-memory checkpoint
    • Faster than disk
  • Co-ordinated checkpoint
    • Simple MPI_MemCheckpoint(void)
    • User can decide what makes up useful state
  • Double checkpointing
    • Each object maintains 2 checkpoints:
    • Local physical processor
    • Remote “buddy” processor
  • For jobs with large memory
    • Use local disks!

32 processors with 1.5GB memory each


  • A “Dummy” process is created:
    • Need not have application data or checkpoint
    • Necessary for runtime
    • Starts recovery on all other Processors
  • Other processors:
    • Remove all chares
    • Restore checkpoints lost on the crashed PE
    • Restore chares from local checkpoints
  • Load balance after restart


restart performance
Restart Performance
  • 10 crashes
  • 128 processors
  • Checkpoint every 10 time steps


scalable fault tolerance
Scalable Fault Tolerance
  • How?
    • Sender-side message logging
    • Asynchronous Checkpoints on buddy processors
    • Latency tolerance mitigates costs
    • Restart can be speeded up by spreading out objects from failed processor
  • Current progress
    • Basic scheme idea implemented and tested in simple programs
    • General purpose implementation in progress

Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!

Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue”


recovery performance
Recovery Performance

Execution Time with increasing number of faults on 8 processors

(Checkpoint period 30s)




Performance visualization and analysis tool


an introduction to projections
An Introduction to Projections
  • Performance Analysis tool for Charm++ based applications.
  • Automatic trace instrumentation.
  • Post-mortem visualization.
  • Multiple advanced features that support:
    • Data volume control
    • Generation of additional user data.


trace generation
Trace Generation
  • Automatic instrumentation by runtime system
  • Detailed
    • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.
  • Summary
    • Reduces the size of output files and memory overhead.
    • Produces a few lines of output data per processor.
    • This data is recorded in bins corresponding to intervals of size 1 ms by default.
  • Flexible
    • APIs and runtime options for instrumenting user events and data generation control.


the summary view
The Summary View
  • Provides a view of the overall utilization of the application.
  • Very quick to load.


graph view
Graph View
  • Features:
    • Selectively view Entry points.
    • Convenient means to switch to between axes data types.


  • The most detailed view in Projections.
  • Useful for understanding critical path issues or unusual entry point behaviors at specific times.




time profile
Time Profile
  • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool.

Solved by prioritizing entry methods



Overview: one line for each processor, time on X-axis

White: busy,


Red: intermediate


stretch removal
Stretch Removal

Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis

After Optimizations

About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

Before Optimizations

Over 16 large stretched calls


miscellaneous features color selection
Miscellaneous Features -Color Selection
  • Colors are automatically supplied by default.
  • We allow users to select their own colors and save them.
  • These colors can then be restored the next time Projections loads.


user apis
User APIs
  • Controlling trace generation
    • void traceBegin()
    • void traceEnd()
  • Tracing User (Events
    • int traceRegisterUserEvent(char *, int)
    • void traceUserEvent(char *)
    • void traceUserBracketEvent(int, double, double)
    • double CmiWallTimer()
  • Runtime options:
    • +traceoff
    • +traceroot <directory>
    • Projections mode only:
      • +logsize <# entries>
      • +gz-trace
    • Summary mode only:
      • +bincount <# of intervals>
      • +binsize <interval time quanta (us)>



Performance Prediction


Parallel Simulation


bigsim performance prediction
BigSim: Performance Prediction
  • Extremely large parallel machines are around already/about to be available:
    • ASCI Purple (12k, 100TF)
    • Bluegene/L (64k, 360TF)
    • Bluegene/C (1M, 1PF)
  • How to write a petascale application?
  • What will be the Performance like?
  • Would existing parallel applications scale?
    • Machines are not there
    • Parallel Performance is hard to model without actually running the program


objectives and simualtion model
Objectives and Simualtion Model
  • Objectives:
    • Develop techniques to facilitate the development of efficient peta-scale applications
    • Based on performance prediction of applications on large simulated parallel machines
  • Simulation-based Performance Prediction:
    • Focus on Charm++ and AMPI programming models Performance prediction based on PDES
    • Supports varying levels of fidelity
      • processor prediction, network prediction.
    • Modes of execution :
      • online and post-mortem mode


blue gene emulator simulator
Blue Gene Emulator/Simulator
  • Actually: BigSim, for simulation of any large machine using smaller parallel machines
  • Emulator:
    • Allows development of programming environment and algorithms before the machine is built
    • Allowed us to port Charm++ to real BG/L in 1-2 days
  • Simulator:
    • Allows performance analysis of programs running on large machines, without access to the large machines
    • Uses Parallel Discrete Event Simulation


simulation details
Simulation Details
  • Emulate large parallel machines on smaller existing parallel machines – run a program with multi-million way parallelism (implemented using user-threads).
    • Consider memory and stack-size limitations
  • Ensure time-stamp correction
  • Emulator layer API is built on top of machine layer
  • Charm++/AMPI implemented on top of emulator, like any other machine layer
  • Emulator layer supports all Charm features:
    • Load-balancing
    • Coomunication optimizations


performance prediction
Performance Prediction
  • Usefulness of performance prediction:
    • Application developer (making small modifications)
      • Difficult to get runtimes on huge current machines
      • For future machines, simulation is the only possibility
      • Performance debugging cycle can be considerably reduced
      • Even approximate predictions can identify performance issues such as load imbalance, serial bottlenecks, communication bottlenecks, etc
    • Machine architecture designer
      • Knowledge of how target applications behave on it, can help identify problems with machine design early
  • Record traces during parallel emulation
  • Run trace-driven simulation (PDES)


performance prediction contd
Performance Prediction (contd.)
  • Predicting time of sequential code:
    • User supplied time for every code block
    • Wall-clock measurements on simulating machine can be used via a suitable multiplier
    • Hardware performance counters to count floating point, integer, branch instructions, etc
      • Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio
    • Instruction level simulation (not implemented)
  • Predicting Network performance:
    • No contention, time based on topology & other network parameters
    • Back-patching, modifies comm time using amount of comm activity
    • Network-simulation, modelling the netowrk entirely


performance prediction validation
Performance Prediction Validation
  • 7-point stencil program with 3D decomposition
    • Run on 32 real processors, simulating 64, 128,... PEs
  • NAMD benchmark Apo-Lipoprotein A1 atom dataset with 92k atoms, running for 15 timesteps
    • For large processors, because of cache and memory effects, the predicted value seems to diverge from actual value


performance on large machines
Performance on Large Machines
  • Problem:
    • How to predict performance of applications on future machines? (E.g. BG/L)
    • How to do performance tuning without continuous access to large machine?
  • Solution:
    • Leverage virtualization
    • Develop machine emulator
    • Simulator: accurate time modeling
    • Run program on “100,000 processors” using only hundreds of processors
  • Analysis:
    • Use performance viz. suite (projections)
  • Molecular Dynamics Benchmark
  • ER-GRE: 36573 atoms
    • 1.6 million objects
    • 8 step simulation
    • 16k BG processors


network simulation
NetWork Simulation
  • Detailed implementation of interconnection networks
  • Configurable network parameters
    • Topology / Routing
    • Input / Output VC seclection
    • Bandwidth / Latency
    • NIC parameters
    • Buffer / Message size, etc
  • Support for hardware collectives in network layer


higher level programming
Higher level programming
  • Orchestration language
    • Allows expressing global control flow in a charm program
    • HPF like flavor, but Charm++-like processor virtualization, and explicit communication
  • Multiphase Shared Arrays
    • Provides a disciplined use of shared address space
    • Each array can be accessed only in one of the following modes:
      • ReadOnly, Write-by-One-Thread, Accumulate-only
    • Access mode can change from phase to phase
    • Phases delineated by per-array “sync”


other projects

Flexible cluster scheduler

resource management across clusters

Multi-cluster applications

Load balancing strategies

Commn optimization

POSE: Parallel discrete even simulation


Parallel framework for Unstructured mesh

Invite collaborations:

Virtualization of other languages and libraries

New load balancing strategies


Other projects


some active collaborations
Biophysics: Molecular Dynamics (NIH, ..)

Long standing, 91-, Klaus Schulten, Bob Skeel

Gordon bell award in 2002,

Production program used by biophysicists

Quantum Chemistry (NSF)

QM/MM via Car-Parinello method +

Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman,

Nick Nystrom, Josep Torrelas, Laxmikant Kale

Material simulation (NSF)

Dendritic growth, quenching, space-time meshes, QM/FEM

R. Haber, D. Johnson, J. Dantzig, +

Rocket simulation (DOE)

DOE, funded ASCI center

Mike Heath, +30 faculty

Computational Cosmology (NSF, NASA)


Scalable Visualization:


Simulation of Plasma


Some Active Collaborations


  • We are pursuing a broad agenda
    • aimed at productivity and performance in parallel programming
    • Intelligent Runtime System for adaptive strategies
    • Charm++/AMPI are production level systems
    • Support dynamic load balancing, communication optimizations
    • Performance prediction capabiltiees based on simulation
    • Basic Fault tolerance, performance viz tools are part of the suite
    • Application-oriented yet Computer Science centered research

Workshop on Charm++ and Applications: Oct 18-20 , UIUC