bigsim simulating petaflops supercomputers n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
BigSim: Simulating PetaFLOPS Supercomputers PowerPoint Presentation
Download Presentation
BigSim: Simulating PetaFLOPS Supercomputers

Loading in 2 Seconds...

play fullscreen
1 / 21

BigSim: Simulating PetaFLOPS Supercomputers - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

BigSim: Simulating PetaFLOPS Supercomputers. Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign. Introduction. Petascale machines are being built It is hard to understand parallel application without actually running it

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'BigSim: Simulating PetaFLOPS Supercomputers' - zola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bigsim simulating petaflops supercomputers

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng

Parallel Programming Laboratory

University of Illinois at Urbana-Champaign

introduction
Introduction
  • Petascale machines are being built
  • It is hard to understand parallel application without actually running it
  • Tools that allow one to predict the performance of applications before machines are built, also help debugging and tuning
  • Allows easier "offline" experimentation
  • Provides a feedback for machine designers

5th Annual Workshop on Charm++ and its Applications

overview
Overview
  • Prediction accuracy
    • Sequential performance
    • Parallel performance (message passing)
    • Flexibility - different levels of fidelity/tradeoffs
  • Challenges for usability
    • Memory constraints
    • Time cost
    • How to analyze result - scalable performance analysis tools
    • Portability
    • Scalability
  • BigSim system

5th Annual Workshop on Charm++ and its Applications

summary of our approach
Summary of Our Approach
  • Parallel Discrete Event Simulation (PDES)
    • the operation of a system is represented as a chronological sequence of events
  • Event-driven simulation method
  • Two phase simulation
    • BigSim Emulator, capable of running application
      • Charm++, Adaptive MPI / MPI
      • Predict sequential execution blocks
      • Generate trace logs
    • Parallel Discrete Event Simulator (POSE-based)
      • Predict message passing performance
  • Implemented on Charm++

5th Annual Workshop on Charm++ and its Applications

bigsim system architecture
BigSim System Architecture

Performance visualization (Projections)

Offline PDES

Network Simulator

BigSim Simulator

Simulation output trace logs

Charm++ Runtime

Load Balancing Module

Performance counters

Instruction Sim (RSim, IBM, ..)

Simple Network Model

BigSim Emulator

Charm++ and MPI applications

5th Annual Workshop on Charm++ and its Applications

bigsim emulator the first step
BigSim Emulator – the first step
  • Actual execution of real applications on much smaller machines
    • Model each target processor as virtual processor
    • Using light-weight (Converse) threads simulating each target processor
  • Charm++/AMPI applications without change
    • No global variables/swap-globals/copy-globals

5th Annual Workshop on Charm++ and its Applications

emulation on a parallel machine

Hardware thread

Simulating (Host) Processor

Emulation on a Parallel Machine

Target Nodes

5th Annual Workshop on Charm++ and its Applications

predicting sequential performance
Predicting Sequential Performance
  • Different level of fidelity:
    • Direct mapping
    • Performance counter
    • Instruction-level simulator
  • Tradeoff between time cost and accuracy
  • Exploit inherent determinacy of Charm++ application
    • Out of order messages do not change the computation behavior (structured dagger)

5th Annual Workshop on Charm++ and its Applications

trace logs
[22] name:msgep (srcnode:0 msgID:21) ep:1

[[ recvtime:0.000498 startTime:0.000498 endTime:0.000498 ]]

backward:

forward: [23]

[23] name:Chunk_atomic_0 (srcnode:-1 msgID:-1) ep:0

[[ recvtime:-1.000000 startTime:0.000498 endTime:0.000503 ]]

msgID:3 sent:0.000498 recvtime:0.000499 dstPe:7 size:208

msgID:4 sent:0.000500 recvtime:0.000501 dstPe:1 size:208

backward: [22]

forward: [24]

[24] name:Chunk_overlap_0 (srcnode:-1 msgID:-1) ep:0

[[ recvtime:-1.000000 startTime:0.000503 endTime:0.000503 ]]

backward: [0x80a7af0 23]

forward: [25] [28]

Event logs are generated for each target processor

Predicted time on execution blocks with timestamps

Event dependencies

Message sending events

Trace Logs

5th Annual Workshop on Charm++ and its Applications

predicting message passing performance second step
Predicting message passing performance – second step
  • Parallel Discrete Event Simulation
    • Needed for correcting causality errors due to out-of-order messages
    • Timestamps are re-adjusted without rerunning the actual application
  • Simulate network behaviours: packetization, routing, contention, etc
    • simple latency based network model, and contention-based network model

5th Annual Workshop on Charm++ and its Applications

pdes and charm
PDES and Charm++
  • PDES is a naturally message-driven activity
  • Large number of entities naturally maps to virtual processors
  • Fine grained PDES
  • Migratable PDES entities and load balancing

5th Annual Workshop on Charm++ and its Applications

network simulator design
Network Simulator Design

5th Annual Workshop on Charm++ and its Applications

network simulator data flow
Network Simulator: Data Flow

5th Annual Workshop on Charm++ and its Applications

network communication pattern analysis
Network Communication Pattern Analysis
  • NAMD with apoa1
  • 15 timestep

5th Annual Workshop on Charm++ and its Applications

network communication pattern analysis1
Network Communication Pattern Analysis

Data transferred (KB) in a single time step

5th Annual Workshop on Charm++ and its Applications

namd network simulation scalability
NAMD network simulation (scalability)
  • Simulating NAMD on 2048 processors

with 3D Torus topology

  • Tests ran on Turing Apple cluster

with Myrinet

Terry L. Wilmarth, Gengbin Zheng, Eric J. Bohm, Yogesh Mehta, Praveen Jagadishprasad, Laxmikant V. Kalé,

``Performance Prediction using Simulation of Large-scale Interconnection Networks in POSE'', PADS 2005

5th Annual Workshop on Charm++ and its Applications

leanmd performance analysis
LeanMD Performance Analysis
        • Benchmark 3-away ER-GRE
          • 36573 atoms
        • 1.6 million objects
        • 8 step simulation
        • 64k BG processors
  • Running on PSC Lemieux

5th Annual Workshop on Charm++ and its Applications

integration with instruction level simulators
Integration with Instruction level simulators
  • Instruction level simulators typically are
    • Standalone applications
    • Sequential execution
    • Very slow
  • An interpolation-based scheme
    • Use result of a smaller scale instruction level simulation to interpolate for large dataset
      • do a least-squares fit to determine the coefficients of an approximation polynomial function

5th Annual Workshop on Charm++ and its Applications

case study bigsim mambo
Case study: BigSim / Mambo

void func( )

{

StartBigSim( )

EndBigSim( )

}

Mambo

Prediction

for

Target System

Cycle-accurate prediction

of sequential blocks on

POWER7 processor

BigSim

Parallel

Emulation

BigSim

Parallel

Simulation

Interpolation

+

Replace sequential timing

Trace files

Parameter files for

sequential blocks

Adjusted trace files

5th Annual Workshop on Charm++ and its Applications

future work
Future Work
  • Use performance counter to predict sequential timing
  • Out-of-core execution for memory bound applications

5th Annual Workshop on Charm++ and its Applications

out of core emulation
Out-of-core Emulation
  • Motivation
    • Physical memory is shared
    • VM system would not handle well
  • Message driven execution
    • Peek msg queue => what execute next? (prefetch)

5th Annual Workshop on Charm++ and its Applications