adaptive mpi intelligent runtime strategies and performance prediction via simulation l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 61

- PowerPoint PPT Presentation


  • 699 Views
  • Uploaded on

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign PPL Mission and Approach

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - albert


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
adaptive mpi intelligent runtime strategies and performance prediction via simulation

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

Laxmikant Kale

http://charm.cs.uiuc.edu

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

ICL@UTK

ppl mission and approach
PPL Mission and Approach
  • To enhance Performance and Productivity in programming complex parallel applications
    • Performance: scalable to thousands of processors
    • Productivity: of human programmers
    • Complex: irregular structure, dynamic variations
  • Approach: Application Oriented yet CS centered research
    • Develop enabling technology, for a wide collection of apps.
    • Develop, use and test it in the context of real applications
  • How?
    • Develop novel Parallel programming techniques
    • Embody them into easy to use abstractions
    • So, application scientist can use advanced techniques with ease
    • Enabling technology: reused across many apps

ICL@UTK

slide3

Develop abstractions in context of full-scale applications

Protein Folding

Quantum Chemistry (QM/MM)

Molecular Dynamics

Computational Cosmology

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Crack Propagation

Dendritic Growth

Space-time meshes

Rocket Simulation

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

ICL@UTK

migratable objects aka processor virtualization

Benefits

  • Software engineering
    • Number of virtual processors can be independently controlled
    • Separate VPs for different modules
  • Message driven execution
    • Adaptive overlap of communication
    • Predictability :
      • Automatic out-of-core
    • Asynchronous reductions
  • Dynamic mapping
    • Heterogeneous clusters
      • Vacate, adjust to speed, share
    • Automatic checkpointing
    • Change set of processors used
    • Automatic dynamic load balancing
    • Communication optimization

System implementation

User View

Migratable Objects (aka Processor Virtualization)

Programmer: [Over] decomposition into virtual processors

Runtime:Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

ICL@UTK

outline
Adaptive MPI

Load Balancing

Fault tolerance

Projections:

performance analysis

Performance prediction

bigsim

Outline

ICL@UTK

ampi mpi with virtualization

MPI processes

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Real Processors

AMPI: MPI with Virtualization
  • Each virtual process implemented as a user-level thread embedded in a Charm object

ICL@UTK

making ampi work
Making AMPI work
  • Multiple user-level threads per processor:
    • Problems with global variable
    • Solution I:
      • Automatic: switch GOT pointer at context switch
      • Available on most machines
    • Solution 2: Manual: replace global variables
    • Solution 3: automatic: via compiler support (AMPIzer)
  • Migrating Stacks
    • Use isomalloc technique (Mehaut et al)
    • Use memory files and mmap()
  • Heap data:
    • Isomalloc heaps
    • OR user supplied pack/unpack functions for the heap data

ICL@UTK

elf and global variables
ELF and global variables
  • The Executable and Linking Format (ELF)
    • Executable has a Global Offset Table containing global data
    • GOT pointer stored at %ebx register
    • Switch this pointer when switching between threads
    • Support on Linux, Solaris 2.x, and more
  • Integrated in Charm++/AMPI
    • Invoke by compile time option -swapglobal

ICL@UTK

adaptive overlap and modules
Adaptive overlap and modules

SPMD and Message-Driven Modules

(From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995

ICL@UTK

benefit of adaptive overlap
Benefit of Adaptive Overlap

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

ICL@UTK

comparison with native mpi
Performance

Slightly worse w/o optimization

Being improved

Flexibility

Small number of PE available

Special requirement by algorithm

Comparison with Native MPI

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.

ICL@UTK

ampi extensions
AMPI Extensions
  • Automatic load balancing
  • Non-blocking collectives
  • Checkpoint/restart mechanism
  • Multi-module programming

ICL@UTK

load balancing in ampi
Load Balancing in AMPI
  • Automatic load balancing: MPI_Migrate()
    • Collective call informing the load balancer that the thread is ready to be migrated, if needed.

ICL@UTK

load balancing steps
Load Balancing Steps

Regular Timesteps

Detailed, aggressive Load Balancing : object migration

Instrumented Timesteps

Refinement Load Balancing

ICL@UTK

slide15

Load Balancing

Aggressive Load Balancing

Refinement Load Balancing

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.

ICL@UTK

shrink expand
Shrink/Expand
  • Problem: Availability of computing platform may change
  • Fitting applications on the platform by object migration

Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

ICL@UTK

slide17

Radix Sort

Optimized All-to-all “Surprise”

Completion time vs. computation overhead

76 bytes all-to-all on Lemieux

CPU is free during most of the time taken by a collective operation

900

800

700

600

Led to the development of Asynchronous Collectives now supported in AMPI

AAPC Completion Time(ms)

500

400

300

200

100

0

100B

200B

900B

4KB

8KB

Message Size (bytes)

Mesh

Direct

ICL@UTK

asynchronous collectives
Asynchronous Collectives
  • Our implementation is asynchronous
    • Collective operation posted
    • Test/wait for its completion
    • Meanwhile useful computation can utilize CPU

MPI_Ialltoall( … , &req);

/* other computation */

MPI_Wait(req);

ICL@UTK

motivation
Motivation
  • Applications need fast, low cost and scalable fault tolerance support
  • As machines grow in size
    • MTBF decreases
    • Applications have to tolerate faults
  • Our research
    • Disk based Checkpoint/Restart
    • In Memory Double Checkpointing/Restart
    • Sender based Message Logging
    • Proactive response to fault prediction
      • (impending fault response)

ICL@UTK

checkpoint restart mechanism
Checkpoint/Restart Mechanism
  • Automatic Checkpointing for AMPI and Charm++
    • Migrate objects to disk!
    • Automatic fault detection and restart
    • Now available in distribution version of AMPI and Charm++
  • Blocking Co-ordinated Checkpoint
    • States of chares are checkpointed to disk
    • Collective call MPI_Checkpoint(DIRNAME)
  • The entire job is restarted
    • Virtualization allows restarting on different # of processors
    • Runtime option
      • > ./charmrun pgm +p4 +vp16 +restart DIRNAME
  • Simple but effective for common cases

ICL@UTK

in memory double checkpoint
In-memory Double Checkpoint
  • In-memory checkpoint
    • Faster than disk
  • Co-ordinated checkpoint
    • Simple MPI_MemCheckpoint(void)
    • User can decide what makes up useful state
  • Double checkpointing
    • Each object maintains 2 checkpoints:
    • Local physical processor
    • Remote “buddy” processor
  • For jobs with large memory
    • Use local disks!

32 processors with 1.5GB memory each

ICL@UTK

restart
Restart
  • A “Dummy” process is created:
    • Need not have application data or checkpoint
    • Necessary for runtime
    • Starts recovery on all other Processors
  • Other processors:
    • Remove all chares
    • Restore checkpoints lost on the crashed PE
    • Restore chares from local checkpoints
  • Load balance after restart

ICL@UTK

restart performance
Restart Performance
  • 10 crashes
  • 128 processors
  • Checkpoint every 10 time steps

ICL@UTK

scalable fault tolerance
Scalable Fault Tolerance
  • How?
    • Sender-side message logging
    • Asynchronous Checkpoints on buddy processors
    • Latency tolerance mitigates costs
    • Restart can be speeded up by spreading out objects from failed processor
  • Current progress
    • Basic scheme idea implemented and tested in simple programs
    • General purpose implementation in progress

Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!

Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue”

ICL@UTK

recovery performance
Recovery Performance

Execution Time with increasing number of faults on 8 processors

(Checkpoint period 30s)

ICL@UTK

slide27

Projections:

Performance visualization and analysis tool

ICL@UTK

an introduction to projections
An Introduction to Projections
  • Performance Analysis tool for Charm++ based applications.
  • Automatic trace instrumentation.
  • Post-mortem visualization.
  • Multiple advanced features that support:
    • Data volume control
    • Generation of additional user data.

ICL@UTK

trace generation
Trace Generation
  • Automatic instrumentation by runtime system
  • Detailed
    • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.
  • Summary
    • Reduces the size of output files and memory overhead.
    • Produces a few lines of output data per processor.
    • This data is recorded in bins corresponding to intervals of size 1 ms by default.
  • Flexible
    • APIs and runtime options for instrumenting user events and data generation control.

ICL@UTK

the summary view
The Summary View
  • Provides a view of the overall utilization of the application.
  • Very quick to load.

ICL@UTK

graph view
Graph View
  • Features:
    • Selectively view Entry points.
    • Convenient means to switch to between axes data types.

ICL@UTK

timeline
Timeline
  • The most detailed view in Projections.
  • Useful for understanding critical path issues or unusual entry point behaviors at specific times.

ICL@UTK

animations
Animations

ICL@UTK

time profile
Time Profile
  • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool.

Solved by prioritizing entry methods

ICL@UTK

slide40

Overview: one line for each processor, time on X-axis

White: busy,

Black:idle

Red: intermediate

ICL@UTK

stretch removal
Stretch Removal

Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis

After Optimizations

About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

Before Optimizations

Over 16 large stretched calls

ICL@UTK

miscellaneous features color selection
Miscellaneous Features -Color Selection
  • Colors are automatically supplied by default.
  • We allow users to select their own colors and save them.
  • These colors can then be restored the next time Projections loads.

ICL@UTK

user apis
User APIs
  • Controlling trace generation
    • void traceBegin()
    • void traceEnd()
  • Tracing User (Events
    • int traceRegisterUserEvent(char *, int)
    • void traceUserEvent(char *)
    • void traceUserBracketEvent(int, double, double)
    • double CmiWallTimer()
  • Runtime options:
    • +traceoff
    • +traceroot <directory>
    • Projections mode only:
      • +logsize <# entries>
      • +gz-trace
    • Summary mode only:
      • +bincount <# of intervals>
      • +binsize <interval time quanta (us)>

ICL@UTK

slide46

Performance Prediction

Via

Parallel Simulation

ICL@UTK

bigsim performance prediction
BigSim: Performance Prediction
  • Extremely large parallel machines are around already/about to be available:
    • ASCI Purple (12k, 100TF)
    • Bluegene/L (64k, 360TF)
    • Bluegene/C (1M, 1PF)
  • How to write a petascale application?
  • What will be the Performance like?
  • Would existing parallel applications scale?
    • Machines are not there
    • Parallel Performance is hard to model without actually running the program

ICL@UTK

objectives and simualtion model
Objectives and Simualtion Model
  • Objectives:
    • Develop techniques to facilitate the development of efficient peta-scale applications
    • Based on performance prediction of applications on large simulated parallel machines
  • Simulation-based Performance Prediction:
    • Focus on Charm++ and AMPI programming models Performance prediction based on PDES
    • Supports varying levels of fidelity
      • processor prediction, network prediction.
    • Modes of execution :
      • online and post-mortem mode

ICL@UTK

blue gene emulator simulator
Blue Gene Emulator/Simulator
  • Actually: BigSim, for simulation of any large machine using smaller parallel machines
  • Emulator:
    • Allows development of programming environment and algorithms before the machine is built
    • Allowed us to port Charm++ to real BG/L in 1-2 days
  • Simulator:
    • Allows performance analysis of programs running on large machines, without access to the large machines
    • Uses Parallel Discrete Event Simulation

ICL@UTK

simulation details
Simulation Details
  • Emulate large parallel machines on smaller existing parallel machines – run a program with multi-million way parallelism (implemented using user-threads).
    • Consider memory and stack-size limitations
  • Ensure time-stamp correction
  • Emulator layer API is built on top of machine layer
  • Charm++/AMPI implemented on top of emulator, like any other machine layer
  • Emulator layer supports all Charm features:
    • Load-balancing
    • Coomunication optimizations

ICL@UTK

performance prediction
Performance Prediction
  • Usefulness of performance prediction:
    • Application developer (making small modifications)
      • Difficult to get runtimes on huge current machines
      • For future machines, simulation is the only possibility
      • Performance debugging cycle can be considerably reduced
      • Even approximate predictions can identify performance issues such as load imbalance, serial bottlenecks, communication bottlenecks, etc
    • Machine architecture designer
      • Knowledge of how target applications behave on it, can help identify problems with machine design early
  • Record traces during parallel emulation
  • Run trace-driven simulation (PDES)

ICL@UTK

performance prediction contd
Performance Prediction (contd.)
  • Predicting time of sequential code:
    • User supplied time for every code block
    • Wall-clock measurements on simulating machine can be used via a suitable multiplier
    • Hardware performance counters to count floating point, integer, branch instructions, etc
      • Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio
    • Instruction level simulation (not implemented)
  • Predicting Network performance:
    • No contention, time based on topology & other network parameters
    • Back-patching, modifies comm time using amount of comm activity
    • Network-simulation, modelling the netowrk entirely

ICL@UTK

performance prediction validation
Performance Prediction Validation
  • 7-point stencil program with 3D decomposition
    • Run on 32 real processors, simulating 64, 128,... PEs
  • NAMD benchmark Apo-Lipoprotein A1 atom dataset with 92k atoms, running for 15 timesteps
    • For large processors, because of cache and memory effects, the predicted value seems to diverge from actual value

ICL@UTK

performance on large machines
Performance on Large Machines
  • Problem:
    • How to predict performance of applications on future machines? (E.g. BG/L)
    • How to do performance tuning without continuous access to large machine?
  • Solution:
    • Leverage virtualization
    • Develop machine emulator
    • Simulator: accurate time modeling
    • Run program on “100,000 processors” using only hundreds of processors
  • Analysis:
    • Use performance viz. suite (projections)
  • Molecular Dynamics Benchmark
  • ER-GRE: 36573 atoms
    • 1.6 million objects
    • 8 step simulation
    • 16k BG processors

ICL@UTK

network simulation
NetWork Simulation
  • Detailed implementation of interconnection networks
  • Configurable network parameters
    • Topology / Routing
    • Input / Output VC seclection
    • Bandwidth / Latency
    • NIC parameters
    • Buffer / Message size, etc
  • Support for hardware collectives in network layer

ICL@UTK

higher level programming
Higher level programming
  • Orchestration language
    • Allows expressing global control flow in a charm program
    • HPF like flavor, but Charm++-like processor virtualization, and explicit communication
  • Multiphase Shared Arrays
    • Provides a disciplined use of shared address space
    • Each array can be accessed only in one of the following modes:
      • ReadOnly, Write-by-One-Thread, Accumulate-only
    • Access mode can change from phase to phase
    • Phases delineated by per-array “sync”

ICL@UTK

other projects
Faucets

Flexible cluster scheduler

resource management across clusters

Multi-cluster applications

Load balancing strategies

Commn optimization

POSE: Parallel discrete even simulation

ParFUM:

Parallel framework for Unstructured mesh

Invite collaborations:

Virtualization of other languages and libraries

New load balancing strategies

Applications

Other projects

ICL@UTK

some active collaborations
Biophysics: Molecular Dynamics (NIH, ..)

Long standing, 91-, Klaus Schulten, Bob Skeel

Gordon bell award in 2002,

Production program used by biophysicists

Quantum Chemistry (NSF)

QM/MM via Car-Parinello method +

Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman,

Nick Nystrom, Josep Torrelas, Laxmikant Kale

Material simulation (NSF)

Dendritic growth, quenching, space-time meshes, QM/FEM

R. Haber, D. Johnson, J. Dantzig, +

Rocket simulation (DOE)

DOE, funded ASCI center

Mike Heath, +30 faculty

Computational Cosmology (NSF, NASA)

Simulation:

Scalable Visualization:

Others

Simulation of Plasma

Electromagnetics

Some Active Collaborations

ICL@UTK

summary
Summary
  • We are pursuing a broad agenda
    • aimed at productivity and performance in parallel programming
    • Intelligent Runtime System for adaptive strategies
    • Charm++/AMPI are production level systems
    • Support dynamic load balancing, communication optimizations
    • Performance prediction capabiltiees based on simulation
    • Basic Fault tolerance, performance viz tools are part of the suite
    • Application-oriented yet Computer Science centered research

Workshop on Charm++ and Applications: Oct 18-20 , UIUC

http://charm.cs.uiuc.edu

ICL@UTK