Adaptive mpi intelligent runtime strategies and performance prediction via simulation l.jpg
Advertisement
This presentation is the property of its rightful owner.
1 / 61

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation PowerPoint PPT Presentation

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign PPL Mission and Approach

Download Presentation

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Adaptive mpi intelligent runtime strategies and performance prediction via simulation l.jpg

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

Laxmikant Kale

http://charm.cs.uiuc.edu

Parallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

[email protected]


Ppl mission and approach l.jpg

PPL Mission and Approach

  • To enhance Performance and Productivity in programming complex parallel applications

    • Performance: scalable to thousands of processors

    • Productivity: of human programmers

    • Complex: irregular structure, dynamic variations

  • Approach: Application Oriented yet CS centered research

    • Develop enabling technology, for a wide collection of apps.

    • Develop, use and test it in the context of real applications

  • How?

    • Develop novel Parallel programming techniques

    • Embody them into easy to use abstractions

    • So, application scientist can use advanced techniques with ease

    • Enabling technology: reused across many apps

[email protected]


Slide3 l.jpg

Develop abstractions in context of full-scale applications

Protein Folding

Quantum Chemistry (QM/MM)

Molecular Dynamics

Computational Cosmology

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Crack Propagation

Dendritic Growth

Space-time meshes

Rocket Simulation

The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

[email protected]


Migratable objects aka processor virtualization l.jpg

Benefits

  • Software engineering

    • Number of virtual processors can be independently controlled

    • Separate VPs for different modules

  • Message driven execution

    • Adaptive overlap of communication

    • Predictability :

      • Automatic out-of-core

    • Asynchronous reductions

  • Dynamic mapping

    • Heterogeneous clusters

      • Vacate, adjust to speed, share

    • Automatic checkpointing

    • Change set of processors used

    • Automatic dynamic load balancing

    • Communication optimization

System implementation

User View

Migratable Objects (aka Processor Virtualization)

Programmer: [Over] decomposition into virtual processors

Runtime:Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

[email protected]


Outline l.jpg

Adaptive MPI

Load Balancing

Fault tolerance

Projections:

performance analysis

Performance prediction

bigsim

Outline

[email protected]


Ampi mpi with virtualization l.jpg

MPI processes

MPI “processes”

Implemented as virtual processes (user-level migratable threads)

Real Processors

AMPI: MPI with Virtualization

  • Each virtual process implemented as a user-level thread embedded in a Charm object

[email protected]


Making ampi work l.jpg

Making AMPI work

  • Multiple user-level threads per processor:

    • Problems with global variable

    • Solution I:

      • Automatic: switch GOT pointer at context switch

      • Available on most machines

    • Solution 2: Manual: replace global variables

    • Solution 3: automatic: via compiler support (AMPIzer)

  • Migrating Stacks

    • Use isomalloc technique (Mehaut et al)

    • Use memory files and mmap()

  • Heap data:

    • Isomalloc heaps

    • OR user supplied pack/unpack functions for the heap data

[email protected]


Elf and global variables l.jpg

ELF and global variables

  • The Executable and Linking Format (ELF)

    • Executable has a Global Offset Table containing global data

    • GOT pointer stored at %ebx register

    • Switch this pointer when switching between threads

    • Support on Linux, Solaris 2.x, and more

  • Integrated in Charm++/AMPI

    • Invoke by compile time option -swapglobal

[email protected]


Adaptive overlap and modules l.jpg

Adaptive overlap and modules

SPMD and Message-Driven Modules

(From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995

[email protected]


Benefit of adaptive overlap l.jpg

Benefit of Adaptive Overlap

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

[email protected]


Comparison with native mpi l.jpg

Performance

Slightly worse w/o optimization

Being improved

Flexibility

Small number of PE available

Special requirement by algorithm

Comparison with Native MPI

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.

[email protected]


Ampi extensions l.jpg

AMPI Extensions

  • Automatic load balancing

  • Non-blocking collectives

  • Checkpoint/restart mechanism

  • Multi-module programming

[email protected]


Load balancing in ampi l.jpg

Load Balancing in AMPI

  • Automatic load balancing: MPI_Migrate()

    • Collective call informing the load balancer that the thread is ready to be migrated, if needed.

[email protected]


Load balancing steps l.jpg

Load Balancing Steps

Regular Timesteps

Detailed, aggressive Load Balancing : object migration

Instrumented Timesteps

Refinement Load Balancing

[email protected]


Slide15 l.jpg

Load Balancing

Aggressive Load Balancing

Refinement Load Balancing

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.

[email protected]


Shrink expand l.jpg

Shrink/Expand

  • Problem: Availability of computing platform may change

  • Fitting applications on the platform by object migration

Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

[email protected]


Slide17 l.jpg

Radix Sort

Optimized All-to-all “Surprise”

Completion time vs. computation overhead

76 bytes all-to-all on Lemieux

CPU is free during most of the time taken by a collective operation

900

800

700

600

Led to the development of Asynchronous Collectives now supported in AMPI

AAPC Completion Time(ms)

500

400

300

200

100

0

100B

200B

900B

4KB

8KB

Message Size (bytes)

Mesh

Direct

[email protected]


Asynchronous collectives l.jpg

Asynchronous Collectives

  • Our implementation is asynchronous

    • Collective operation posted

    • Test/wait for its completion

    • Meanwhile useful computation can utilize CPU

      MPI_Ialltoall( … , &req);

      /* other computation */

      MPI_Wait(req);

[email protected]


Slide19 l.jpg

Fault Tolerance

[email protected]


Motivation l.jpg

Motivation

  • Applications need fast, low cost and scalable fault tolerance support

  • As machines grow in size

    • MTBF decreases

    • Applications have to tolerate faults

  • Our research

    • Disk based Checkpoint/Restart

    • In Memory Double Checkpointing/Restart

    • Sender based Message Logging

    • Proactive response to fault prediction

      • (impending fault response)

[email protected]


Checkpoint restart mechanism l.jpg

Checkpoint/Restart Mechanism

  • Automatic Checkpointing for AMPI and Charm++

    • Migrate objects to disk!

    • Automatic fault detection and restart

    • Now available in distribution version of AMPI and Charm++

  • Blocking Co-ordinated Checkpoint

    • States of chares are checkpointed to disk

    • Collective call MPI_Checkpoint(DIRNAME)

  • The entire job is restarted

    • Virtualization allows restarting on different # of processors

    • Runtime option

      • > ./charmrun pgm +p4 +vp16 +restart DIRNAME

  • Simple but effective for common cases

[email protected]


In memory double checkpoint l.jpg

In-memory Double Checkpoint

  • In-memory checkpoint

    • Faster than disk

  • Co-ordinated checkpoint

    • Simple MPI_MemCheckpoint(void)

    • User can decide what makes up useful state

  • Double checkpointing

    • Each object maintains 2 checkpoints:

    • Local physical processor

    • Remote “buddy” processor

  • For jobs with large memory

    • Use local disks!

32 processors with 1.5GB memory each

[email protected]


Restart l.jpg

Restart

  • A “Dummy” process is created:

    • Need not have application data or checkpoint

    • Necessary for runtime

    • Starts recovery on all other Processors

  • Other processors:

    • Remove all chares

    • Restore checkpoints lost on the crashed PE

    • Restore chares from local checkpoints

  • Load balance after restart

[email protected]


Restart performance l.jpg

Restart Performance

  • 10 crashes

  • 128 processors

  • Checkpoint every 10 time steps

[email protected]


Scalable fault tolerance l.jpg

Scalable Fault Tolerance

  • How?

    • Sender-side message logging

    • Asynchronous Checkpoints on buddy processors

    • Latency tolerance mitigates costs

    • Restart can be speeded up by spreading out objects from failed processor

  • Current progress

    • Basic scheme idea implemented and tested in simple programs

    • General purpose implementation in progress

Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints!

Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue”

[email protected]


Recovery performance l.jpg

Recovery Performance

Execution Time with increasing number of faults on 8 processors

(Checkpoint period 30s)

[email protected]


Slide27 l.jpg

Projections:

Performance visualization and analysis tool

[email protected]


An introduction to projections l.jpg

An Introduction to Projections

  • Performance Analysis tool for Charm++ based applications.

  • Automatic trace instrumentation.

  • Post-mortem visualization.

  • Multiple advanced features that support:

    • Data volume control

    • Generation of additional user data.

[email protected]


Trace generation l.jpg

Trace Generation

  • Automatic instrumentation by runtime system

  • Detailed

    • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.

  • Summary

    • Reduces the size of output files and memory overhead.

    • Produces a few lines of output data per processor.

    • This data is recorded in bins corresponding to intervals of size 1 ms by default.

  • Flexible

    • APIs and runtime options for instrumenting user events and data generation control.

[email protected]


The summary view l.jpg

The Summary View

  • Provides a view of the overall utilization of the application.

  • Very quick to load.

[email protected]


Graph view l.jpg

Graph View

  • Features:

    • Selectively view Entry points.

    • Convenient means to switch to between axes data types.

[email protected]


Timeline l.jpg

Timeline

  • The most detailed view in Projections.

  • Useful for understanding critical path issues or unusual entry point behaviors at specific times.

[email protected]


Animations l.jpg

Animations

[email protected]


Slide34 l.jpg

[email protected]


Slide35 l.jpg

[email protected]


Slide36 l.jpg

[email protected]


Time profile l.jpg

Time Profile

  • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool.

Solved by prioritizing entry methods

[email protected]


Slide38 l.jpg

[email protected]


Slide39 l.jpg

[email protected]


Slide40 l.jpg

Overview: one line for each processor, time on X-axis

White: busy,

Black:idle

Red: intermediate

[email protected]


Slide41 l.jpg

A boring but good-performance overview

[email protected]


Slide42 l.jpg

An interesting but pathetic overview

[email protected]


Stretch removal l.jpg

Stretch Removal

Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis

After Optimizations

About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

Before Optimizations

Over 16 large stretched calls

[email protected]


Miscellaneous features color selection l.jpg

Miscellaneous Features -Color Selection

  • Colors are automatically supplied by default.

  • We allow users to select their own colors and save them.

  • These colors can then be restored the next time Projections loads.

[email protected]


User apis l.jpg

User APIs

  • Controlling trace generation

    • void traceBegin()

    • void traceEnd()

  • Tracing User (Events

    • int traceRegisterUserEvent(char *, int)

    • void traceUserEvent(char *)

    • void traceUserBracketEvent(int, double, double)

    • double CmiWallTimer()

  • Runtime options:

    • +traceoff

    • +traceroot <directory>

    • Projections mode only:

      • +logsize <# entries>

      • +gz-trace

    • Summary mode only:

      • +bincount <# of intervals>

      • +binsize <interval time quanta (us)>

[email protected]


Slide46 l.jpg

Performance Prediction

Via

Parallel Simulation

[email protected]


Bigsim performance prediction l.jpg

BigSim: Performance Prediction

  • Extremely large parallel machines are around already/about to be available:

    • ASCI Purple (12k, 100TF)

    • Bluegene/L (64k, 360TF)

    • Bluegene/C (1M, 1PF)

  • How to write a petascale application?

  • What will be the Performance like?

  • Would existing parallel applications scale?

    • Machines are not there

    • Parallel Performance is hard to model without actually running the program

[email protected]


Objectives and simualtion model l.jpg

Objectives and Simualtion Model

  • Objectives:

    • Develop techniques to facilitate the development of efficient peta-scale applications

    • Based on performance prediction of applications on large simulated parallel machines

  • Simulation-based Performance Prediction:

    • Focus on Charm++ and AMPI programming models Performance prediction based on PDES

    • Supports varying levels of fidelity

      • processor prediction, network prediction.

    • Modes of execution :

      • online and post-mortem mode

[email protected]


Blue gene emulator simulator l.jpg

Blue Gene Emulator/Simulator

  • Actually: BigSim, for simulation of any large machine using smaller parallel machines

  • Emulator:

    • Allows development of programming environment and algorithms before the machine is built

    • Allowed us to port Charm++ to real BG/L in 1-2 days

  • Simulator:

    • Allows performance analysis of programs running on large machines, without access to the large machines

    • Uses Parallel Discrete Event Simulation

[email protected]


Architecture of bignetsim l.jpg

Architecture of BigNetSim

[email protected]


Simulation details l.jpg

Simulation Details

  • Emulate large parallel machines on smaller existing parallel machines – run a program with multi-million way parallelism (implemented using user-threads).

    • Consider memory and stack-size limitations

  • Ensure time-stamp correction

  • Emulator layer API is built on top of machine layer

  • Charm++/AMPI implemented on top of emulator, like any other machine layer

  • Emulator layer supports all Charm features:

    • Load-balancing

    • Coomunication optimizations

[email protected]


Performance prediction l.jpg

Performance Prediction

  • Usefulness of performance prediction:

    • Application developer (making small modifications)

      • Difficult to get runtimes on huge current machines

      • For future machines, simulation is the only possibility

      • Performance debugging cycle can be considerably reduced

      • Even approximate predictions can identify performance issues such as load imbalance, serial bottlenecks, communication bottlenecks, etc

    • Machine architecture designer

      • Knowledge of how target applications behave on it, can help identify problems with machine design early

  • Record traces during parallel emulation

  • Run trace-driven simulation (PDES)

[email protected]


Performance prediction contd l.jpg

Performance Prediction (contd.)

  • Predicting time of sequential code:

    • User supplied time for every code block

    • Wall-clock measurements on simulating machine can be used via a suitable multiplier

    • Hardware performance counters to count floating point, integer, branch instructions, etc

      • Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio

    • Instruction level simulation (not implemented)

  • Predicting Network performance:

    • No contention, time based on topology & other network parameters

    • Back-patching, modifies comm time using amount of comm activity

    • Network-simulation, modelling the netowrk entirely

[email protected]


Performance prediction validation l.jpg

Performance Prediction Validation

  • 7-point stencil program with 3D decomposition

    • Run on 32 real processors, simulating 64, 128,... PEs

  • NAMD benchmark Apo-Lipoprotein A1 atom dataset with 92k atoms, running for 15 timesteps

    • For large processors, because of cache and memory effects, the predicted value seems to diverge from actual value

[email protected]


Performance on large machines l.jpg

Performance on Large Machines

  • Problem:

    • How to predict performance of applications on future machines? (E.g. BG/L)

    • How to do performance tuning without continuous access to large machine?

  • Solution:

    • Leverage virtualization

    • Develop machine emulator

    • Simulator: accurate time modeling

    • Run program on “100,000 processors” using only hundreds of processors

  • Analysis:

    • Use performance viz. suite (projections)

  • Molecular Dynamics Benchmark

  • ER-GRE: 36573 atoms

    • 1.6 million objects

    • 8 step simulation

    • 16k BG processors

[email protected]


Projections performance visualization l.jpg

Projections: Performance visualization

[email protected]


Network simulation l.jpg

NetWork Simulation

  • Detailed implementation of interconnection networks

  • Configurable network parameters

    • Topology / Routing

    • Input / Output VC seclection

    • Bandwidth / Latency

    • NIC parameters

    • Buffer / Message size, etc

  • Support for hardware collectives in network layer

[email protected]


Higher level programming l.jpg

Higher level programming

  • Orchestration language

    • Allows expressing global control flow in a charm program

    • HPF like flavor, but Charm++-like processor virtualization, and explicit communication

  • Multiphase Shared Arrays

    • Provides a disciplined use of shared address space

    • Each array can be accessed only in one of the following modes:

      • ReadOnly, Write-by-One-Thread, Accumulate-only

    • Access mode can change from phase to phase

    • Phases delineated by per-array “sync”

[email protected]


Other projects l.jpg

Faucets

Flexible cluster scheduler

resource management across clusters

Multi-cluster applications

Load balancing strategies

Commn optimization

POSE: Parallel discrete even simulation

ParFUM:

Parallel framework for Unstructured mesh

Invite collaborations:

Virtualization of other languages and libraries

New load balancing strategies

Applications

Other projects

[email protected]


Some active collaborations l.jpg

Biophysics: Molecular Dynamics (NIH, ..)

Long standing, 91-, Klaus Schulten, Bob Skeel

Gordon bell award in 2002,

Production program used by biophysicists

Quantum Chemistry (NSF)

QM/MM via Car-Parinello method +

Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman,

Nick Nystrom, Josep Torrelas, Laxmikant Kale

Material simulation (NSF)

Dendritic growth, quenching, space-time meshes, QM/FEM

R. Haber, D. Johnson, J. Dantzig, +

Rocket simulation (DOE)

DOE, funded ASCI center

Mike Heath, +30 faculty

Computational Cosmology (NSF, NASA)

Simulation:

Scalable Visualization:

Others

Simulation of Plasma

Electromagnetics

Some Active Collaborations

[email protected]


Summary l.jpg

Summary

  • We are pursuing a broad agenda

    • aimed at productivity and performance in parallel programming

    • Intelligent Runtime System for adaptive strategies

    • Charm++/AMPI are production level systems

    • Support dynamic load balancing, communication optimizations

    • Performance prediction capabiltiees based on simulation

    • Basic Fault tolerance, performance viz tools are part of the suite

    • Application-oriented yet Computer Science centered research

Workshop on Charm++ and Applications: Oct 18-20 , UIUC

http://charm.cs.uiuc.edu

[email protected]


  • Login