Adaptive mpi intelligent runtime strategies and performance prediction via simulation
Download
1 / 61

- PowerPoint PPT Presentation


  • 690 Views
  • Uploaded on

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign PPL Mission and Approach

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - albert


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Adaptive mpi intelligent runtime strategies and performance prediction via simulation l.jpg

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

Laxmikant Kale

http://charm.cs.uiuc.edu

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

ICL@UTK


Ppl mission and approach l.jpg
PPL Mission and Approach performance prediction via simulation

  • To enhance Performance and Productivity in programming complex parallel applications

    • Performance: scalable to thousands of processors

    • Productivity: of human programmers

    • Complex: irregular structure, dynamic variations

  • Approach: Application Oriented yet CS centered research

    • Develop enabling technology, for a wide collection of apps.

    • Develop, use and test it in the context of real applications

  • How?

    • Develop novel Parallel programming techniques

    • Embody them into easy to use abstractions

    • So, application scientist can use advanced techniques with ease

    • Enabling technology: reused across many apps

ICL@UTK


Slide3 l.jpg

Develop abstractions in context of full-scale applications performance prediction via simulation

Protein Folding

Quantum Chemistry (QM/MM)

Molecular Dynamics

Computational Cosmology

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Crack Propagation

Dendritic Growth

Space-time meshes

Rocket Simulation

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

ICL@UTK


Migratable objects aka processor virtualization l.jpg

Benefits performance prediction via simulation

  • Software engineering

    • Number of virtual processors can be independently controlled

    • Separate VPs for different modules

  • Message driven execution

    • Adaptive overlap of communication

    • Predictability :

      • Automatic out-of-core

    • Asynchronous reductions

  • Dynamic mapping

    • Heterogeneous clusters

      • Vacate, adjust to speed, share

    • Automatic checkpointing

    • Change set of processors used

    • Automatic dynamic load balancing

    • Communication optimization

System implementation

User View

Migratable Objects (aka Processor Virtualization)

Programmer: [Over] decomposition into virtual processors

Runtime:Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

ICL@UTK


Outline l.jpg

Adaptive MPI performance prediction via simulation

Load Balancing

Fault tolerance

Projections:

performance analysis

Performance prediction

bigsim

Outline

ICL@UTK


Ampi mpi with virtualization l.jpg

MPI processes performance prediction via simulation

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Real Processors

AMPI: MPI with Virtualization

  • Each virtual process implemented as a user-level thread embedded in a Charm object

ICL@UTK


Making ampi work l.jpg
Making AMPI work performance prediction via simulation

  • Multiple user-level threads per processor:

    • Problems with global variable

    • Solution I:

      • Automatic: switch GOT pointer at context switch

      • Available on most machines

    • Solution 2: Manual: replace global variables

    • Solution 3: automatic: via compiler support (AMPIzer)

  • Migrating Stacks

    • Use isomalloc technique (Mehaut et al)

    • Use memory files and mmap()

  • Heap data:

    • Isomalloc heaps

    • OR user supplied pack/unpack functions for the heap data

ICL@UTK


Elf and global variables l.jpg
ELF and global variables performance prediction via simulation

  • The Executable and Linking Format (ELF)

    • Executable has a Global Offset Table containing global data

    • GOT pointer stored at %ebx register

    • Switch this pointer when switching between threads

    • Support on Linux, Solaris 2.x, and more

  • Integrated in Charm++/AMPI

    • Invoke by compile time option -swapglobal

ICL@UTK


Adaptive overlap and modules l.jpg
Adaptive overlap and modules performance prediction via simulation

SPMD and Message-Driven Modules

(From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995

ICL@UTK


Benefit of adaptive overlap l.jpg
Benefit of Adaptive Overlap performance prediction via simulation

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

ICL@UTK


Comparison with native mpi l.jpg

Performance performance prediction via simulation

Slightly worse w/o optimization

Being improved

Flexibility

Small number of PE available

Special requirement by algorithm

Comparison with Native MPI

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.

ICL@UTK


Ampi extensions l.jpg
AMPI Extensions performance prediction via simulation

  • Automatic load balancing

  • Non-blocking collectives

  • Checkpoint/restart mechanism

  • Multi-module programming

ICL@UTK


Load balancing in ampi l.jpg
Load Balancing in AMPI performance prediction via simulation

  • Automatic load balancing: MPI_Migrate()

    • Collective call informing the load balancer that the thread is ready to be migrated, if needed.

ICL@UTK


Load balancing steps l.jpg
Load Balancing Steps performance prediction via simulation

Regular Timesteps

Detailed, aggressive Load Balancing : object migration

Instrumented Timesteps

Refinement Load Balancing

ICL@UTK


Slide15 l.jpg

Load Balancing performance prediction via simulation

Aggressive Load Balancing

Refinement Load Balancing

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.

ICL@UTK


Shrink expand l.jpg
Shrink/Expand performance prediction via simulation

  • Problem: Availability of computing platform may change

  • Fitting applications on the platform by object migration

Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

ICL@UTK


Slide17 l.jpg

Radix Sort performance prediction via simulation

Optimized All-to-all “Surprise”

Completion time vs. computation overhead

76 bytes all-to-all on Lemieux

CPU is free during most of the time taken by a collective operation

900

800

700

600

Led to the development of Asynchronous Collectives now supported in AMPI

AAPC Completion Time(ms)

500

400

300

200

100

0

100B

200B

900B

4KB

8KB

Message Size (bytes)

Mesh

Direct

ICL@UTK


Asynchronous collectives l.jpg
Asynchronous Collectives performance prediction via simulation

  • Our implementation is asynchronous

    • Collective operation posted

    • Test/wait for its completion

    • Meanwhile useful computation can utilize CPU

      MPI_Ialltoall( … , &req);

      /* other computation */

      MPI_Wait(req);

ICL@UTK


Slide19 l.jpg

Fault Tolerance performance prediction via simulation

ICL@UTK


Motivation l.jpg
Motivation performance prediction via simulation

  • Applications need fast, low cost and scalable fault tolerance support

  • As machines grow in size

    • MTBF decreases

    • Applications have to tolerate faults

  • Our research

    • Disk based Checkpoint/Restart

    • In Memory Double Checkpointing/Restart

    • Sender based Message Logging

    • Proactive response to fault prediction

      • (impending fault response)

ICL@UTK


Checkpoint restart mechanism l.jpg
Checkpoint/Restart Mechanism performance prediction via simulation

  • Automatic Checkpointing for AMPI and Charm++

    • Migrate objects to disk!

    • Automatic fault detection and restart

    • Now available in distribution version of AMPI and Charm++

  • Blocking Co-ordinated Checkpoint

    • States of chares are checkpointed to disk

    • Collective call MPI_Checkpoint(DIRNAME)

  • The entire job is restarted

    • Virtualization allows restarting on different # of processors

    • Runtime option

      • > ./charmrun pgm +p4 +vp16 +restart DIRNAME

  • Simple but effective for common cases

ICL@UTK


In memory double checkpoint l.jpg
In-memory Double Checkpoint performance prediction via simulation

  • In-memory checkpoint

    • Faster than disk

  • Co-ordinated checkpoint

    • Simple MPI_MemCheckpoint(void)

    • User can decide what makes up useful state

  • Double checkpointing

    • Each object maintains 2 checkpoints:

    • Local physical processor

    • Remote “buddy” processor

  • For jobs with large memory

    • Use local disks!

32 processors with 1.5GB memory each

ICL@UTK


Restart l.jpg
Restart performance prediction via simulation

  • A “Dummy” process is created:

    • Need not have application data or checkpoint

    • Necessary for runtime

    • Starts recovery on all other Processors

  • Other processors:

    • Remove all chares

    • Restore checkpoints lost on the crashed PE

    • Restore chares from local checkpoints

  • Load balance after restart

ICL@UTK


Restart performance l.jpg
Restart Performance performance prediction via simulation

  • 10 crashes

  • 128 processors

  • Checkpoint every 10 time steps

ICL@UTK


Scalable fault tolerance l.jpg
Scalable Fault Tolerance performance prediction via simulation

  • How?

    • Sender-side message logging

    • Asynchronous Checkpoints on buddy processors

    • Latency tolerance mitigates costs

    • Restart can be speeded up by spreading out objects from failed processor

  • Current progress

    • Basic scheme idea implemented and tested in simple programs

    • General purpose implementation in progress

Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!

Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue”

ICL@UTK


Recovery performance l.jpg
Recovery Performance performance prediction via simulation

Execution Time with increasing number of faults on 8 processors

(Checkpoint period 30s)

ICL@UTK


Slide27 l.jpg

Projections: performance prediction via simulation

Performance visualization and analysis tool

ICL@UTK


An introduction to projections l.jpg
An Introduction to Projections performance prediction via simulation

  • Performance Analysis tool for Charm++ based applications.

  • Automatic trace instrumentation.

  • Post-mortem visualization.

  • Multiple advanced features that support:

    • Data volume control

    • Generation of additional user data.

ICL@UTK


Trace generation l.jpg
Trace Generation performance prediction via simulation

  • Automatic instrumentation by runtime system

  • Detailed

    • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.

  • Summary

    • Reduces the size of output files and memory overhead.

    • Produces a few lines of output data per processor.

    • This data is recorded in bins corresponding to intervals of size 1 ms by default.

  • Flexible

    • APIs and runtime options for instrumenting user events and data generation control.

ICL@UTK


The summary view l.jpg
The Summary View performance prediction via simulation

  • Provides a view of the overall utilization of the application.

  • Very quick to load.

ICL@UTK


Graph view l.jpg
Graph View performance prediction via simulation

  • Features:

    • Selectively view Entry points.

    • Convenient means to switch to between axes data types.

ICL@UTK


Timeline l.jpg
Timeline performance prediction via simulation

  • The most detailed view in Projections.

  • Useful for understanding critical path issues or unusual entry point behaviors at specific times.

ICL@UTK


Animations l.jpg
Animations performance prediction via simulation

ICL@UTK


Slide34 l.jpg

ICL@UTK performance prediction via simulation


Slide35 l.jpg

ICL@UTK performance prediction via simulation


Slide36 l.jpg

ICL@UTK performance prediction via simulation


Time profile l.jpg
Time Profile performance prediction via simulation

  • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool.

Solved by prioritizing entry methods

ICL@UTK


Slide38 l.jpg

ICL@UTK performance prediction via simulation


Slide39 l.jpg

ICL@UTK performance prediction via simulation


Slide40 l.jpg

Overview: performance prediction via simulation one line for each processor, time on X-axis

White: busy,

Black:idle

Red: intermediate

ICL@UTK


Slide41 l.jpg

A boring but good-performance overview performance prediction via simulation

ICL@UTK


Slide42 l.jpg

An interesting but pathetic overview performance prediction via simulation

ICL@UTK


Stretch removal l.jpg
Stretch Removal performance prediction via simulation

Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis

After Optimizations

About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

Before Optimizations

Over 16 large stretched calls

ICL@UTK


Miscellaneous features color selection l.jpg
Miscellaneous Features - performance prediction via simulation Color Selection

  • Colors are automatically supplied by default.

  • We allow users to select their own colors and save them.

  • These colors can then be restored the next time Projections loads.

ICL@UTK


User apis l.jpg
User APIs performance prediction via simulation

  • Controlling trace generation

    • void traceBegin()

    • void traceEnd()

  • Tracing User (Events

    • int traceRegisterUserEvent(char *, int)

    • void traceUserEvent(char *)

    • void traceUserBracketEvent(int, double, double)

    • double CmiWallTimer()

  • Runtime options:

    • +traceoff

    • +traceroot <directory>

    • Projections mode only:

      • +logsize <# entries>

      • +gz-trace

    • Summary mode only:

      • +bincount <# of intervals>

      • +binsize <interval time quanta (us)>

ICL@UTK


Slide46 l.jpg

Performance Prediction performance prediction via simulation

Via

Parallel Simulation

ICL@UTK


Bigsim performance prediction l.jpg
BigSim: Performance Prediction performance prediction via simulation

  • Extremely large parallel machines are around already/about to be available:

    • ASCI Purple (12k, 100TF)

    • Bluegene/L (64k, 360TF)

    • Bluegene/C (1M, 1PF)

  • How to write a petascale application?

  • What will be the Performance like?

  • Would existing parallel applications scale?

    • Machines are not there

    • Parallel Performance is hard to model without actually running the program

ICL@UTK


Objectives and simualtion model l.jpg
Objectives and Simualtion Model performance prediction via simulation

  • Objectives:

    • Develop techniques to facilitate the development of efficient peta-scale applications

    • Based on performance prediction of applications on large simulated parallel machines

  • Simulation-based Performance Prediction:

    • Focus on Charm++ and AMPI programming models Performance prediction based on PDES

    • Supports varying levels of fidelity

      • processor prediction, network prediction.

    • Modes of execution :

      • online and post-mortem mode

ICL@UTK


Blue gene emulator simulator l.jpg
Blue Gene Emulator/Simulator performance prediction via simulation

  • Actually: BigSim, for simulation of any large machine using smaller parallel machines

  • Emulator:

    • Allows development of programming environment and algorithms before the machine is built

    • Allowed us to port Charm++ to real BG/L in 1-2 days

  • Simulator:

    • Allows performance analysis of programs running on large machines, without access to the large machines

    • Uses Parallel Discrete Event Simulation

ICL@UTK


Architecture of bignetsim l.jpg
Architecture of BigNetSim performance prediction via simulation

ICL@UTK


Simulation details l.jpg
Simulation Details performance prediction via simulation

  • Emulate large parallel machines on smaller existing parallel machines – run a program with multi-million way parallelism (implemented using user-threads).

    • Consider memory and stack-size limitations

  • Ensure time-stamp correction

  • Emulator layer API is built on top of machine layer

  • Charm++/AMPI implemented on top of emulator, like any other machine layer

  • Emulator layer supports all Charm features:

    • Load-balancing

    • Coomunication optimizations

ICL@UTK


Performance prediction l.jpg
Performance Prediction performance prediction via simulation

  • Usefulness of performance prediction:

    • Application developer (making small modifications)

      • Difficult to get runtimes on huge current machines

      • For future machines, simulation is the only possibility

      • Performance debugging cycle can be considerably reduced

      • Even approximate predictions can identify performance issues such as load imbalance, serial bottlenecks, communication bottlenecks, etc

    • Machine architecture designer

      • Knowledge of how target applications behave on it, can help identify problems with machine design early

  • Record traces during parallel emulation

  • Run trace-driven simulation (PDES)

ICL@UTK


Performance prediction contd l.jpg
Performance Prediction (contd.) performance prediction via simulation

  • Predicting time of sequential code:

    • User supplied time for every code block

    • Wall-clock measurements on simulating machine can be used via a suitable multiplier

    • Hardware performance counters to count floating point, integer, branch instructions, etc

      • Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio

    • Instruction level simulation (not implemented)

  • Predicting Network performance:

    • No contention, time based on topology & other network parameters

    • Back-patching, modifies comm time using amount of comm activity

    • Network-simulation, modelling the netowrk entirely

ICL@UTK


Performance prediction validation l.jpg
Performance Prediction Validation performance prediction via simulation

  • 7-point stencil program with 3D decomposition

    • Run on 32 real processors, simulating 64, 128,... PEs

  • NAMD benchmark Apo-Lipoprotein A1 atom dataset with 92k atoms, running for 15 timesteps

    • For large processors, because of cache and memory effects, the predicted value seems to diverge from actual value

ICL@UTK


Performance on large machines l.jpg
Performance on Large Machines performance prediction via simulation

  • Problem:

    • How to predict performance of applications on future machines? (E.g. BG/L)

    • How to do performance tuning without continuous access to large machine?

  • Solution:

    • Leverage virtualization

    • Develop machine emulator

    • Simulator: accurate time modeling

    • Run program on “100,000 processors” using only hundreds of processors

  • Analysis:

    • Use performance viz. suite (projections)

  • Molecular Dynamics Benchmark

  • ER-GRE: 36573 atoms

    • 1.6 million objects

    • 8 step simulation

    • 16k BG processors

ICL@UTK


Projections performance visualization l.jpg
Projections: Performance visualization performance prediction via simulation

ICL@UTK


Network simulation l.jpg
NetWork Simulation performance prediction via simulation

  • Detailed implementation of interconnection networks

  • Configurable network parameters

    • Topology / Routing

    • Input / Output VC seclection

    • Bandwidth / Latency

    • NIC parameters

    • Buffer / Message size, etc

  • Support for hardware collectives in network layer

ICL@UTK


Higher level programming l.jpg
Higher level programming performance prediction via simulation

  • Orchestration language

    • Allows expressing global control flow in a charm program

    • HPF like flavor, but Charm++-like processor virtualization, and explicit communication

  • Multiphase Shared Arrays

    • Provides a disciplined use of shared address space

    • Each array can be accessed only in one of the following modes:

      • ReadOnly, Write-by-One-Thread, Accumulate-only

    • Access mode can change from phase to phase

    • Phases delineated by per-array “sync”

ICL@UTK


Other projects l.jpg

Faucets performance prediction via simulation

Flexible cluster scheduler

resource management across clusters

Multi-cluster applications

Load balancing strategies

Commn optimization

POSE: Parallel discrete even simulation

ParFUM:

Parallel framework for Unstructured mesh

Invite collaborations:

Virtualization of other languages and libraries

New load balancing strategies

Applications

Other projects

ICL@UTK


Some active collaborations l.jpg

Biophysics: Molecular Dynamics (NIH, ..) performance prediction via simulation

Long standing, 91-, Klaus Schulten, Bob Skeel

Gordon bell award in 2002,

Production program used by biophysicists

Quantum Chemistry (NSF)

QM/MM via Car-Parinello method +

Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman,

Nick Nystrom, Josep Torrelas, Laxmikant Kale

Material simulation (NSF)

Dendritic growth, quenching, space-time meshes, QM/FEM

R. Haber, D. Johnson, J. Dantzig, +

Rocket simulation (DOE)

DOE, funded ASCI center

Mike Heath, +30 faculty

Computational Cosmology (NSF, NASA)

Simulation:

Scalable Visualization:

Others

Simulation of Plasma

Electromagnetics

Some Active Collaborations

ICL@UTK


Summary l.jpg
Summary performance prediction via simulation

  • We are pursuing a broad agenda

    • aimed at productivity and performance in parallel programming

    • Intelligent Runtime System for adaptive strategies

    • Charm++/AMPI are production level systems

    • Support dynamic load balancing, communication optimizations

    • Performance prediction capabiltiees based on simulation

    • Basic Fault tolerance, performance viz tools are part of the suite

    • Application-oriented yet Computer Science centered research

Workshop on Charm++ and Applications: Oct 18-20 , UIUC

http://charm.cs.uiuc.edu

ICL@UTK


ad