slide1
Download
Skip this Video
Download Presentation
ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers

Loading in 2 Seconds...

play fullscreen
1 / 27

ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

Christopher D. Carothers Department of Computer Science Rensselaer Polytechnic Institute [email protected] ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers. Outline. Motivation for PDES Overview of HPC Platforms ROSS Implementation Performance Results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers' - tacy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Christopher D. Carothers

Department of Computer Science

Rensselaer Polytechnic Institute

[email protected]

ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers
outline
Outline
  • Motivation for PDES
  • Overview of HPC Platforms
  • ROSS Implementation
  • Performance Results
  • Summary
motivation
Motivation

Why Parallel Discrete-Event Simulation (DES)?

  • Large-scale systems are difficult to understand
  • Analytical models are often constrained

Parallel DES simulation offers:

  • Dramatically shrinks model’s execution-time
  • Prediction of future “what-if” systems performance
  • Potential for real-time decision support
    • Minutes instead of days
    • Analysis can be done right away
  • Example models: national air space (NAS), ISP backbone(s), distributed content caches, next generation supercomputer systems.
model a 10 pf supercomputer
Model a 10 PF Supercomputer
  • Suppose we want to model a 10 PF supercomputer at the MPI message level
  • How long excute DES model?
    • 10% flop rate  1 PF sustained
    • @ .2 bytes/sec per flop @ 1% usage  2 TB/sec
    • @ 1K size MPI msgs  2 billion msgs per simulated second
    • @ 8 hops per msg  16 billion “events” per simulated second
    • @ 1000 simulated seconds  16 trillion events for DES model
    • No I/O included !!
    • Nominal seq. DES simulator  100K events/sec
  • 16 trillion events @ 100K ev/sec

5+ years!!!

Need massively parallel simulation to make tractable

blue gene l layout
Blue Gene /L Layout
  • CCNI “fen”
  • 32K cores/ 16 racks
  • 12 TB / 8 TB usable RAM
  • ~1 PB of disk over GPFS
  • Custom OS kernel
blue gene p layout
Blue Gene /P Layout
  • ALCF/ANL “Intrepid”
  • 163K cores/ 40 racks
  • ~80TB RAM
  • ~8 PB of disk over GPFS
  • Custom OS kernel
slide8

parallel time-stepped simulation:

lock-step execution

parallel discrete-event simulation:

must allow for sparse, irregular

event computations

barrier

Problem: events arriving

in the past

Solution: Time Warp

Virtual

Time

Virtual

Time

PE 2

PE 3

PE 1

PE 2

PE 3

PE 1

How to Synchronize Parallel Simulations?

processed event

“straggler” event

slide9

Massively Parallel Discrete-Event Simulation Via Time Warp

Local Control Mechanism:

error detection and rollback

Global Control Mechanism:

compute Global Virtual Time (GVT)

V

i

r

t

u

a

l

T

i

m

e

V

i

r

t

u

a

l

T

i

m

e

collect versions

of state / events

& perform I/O

operations

that are < GVT

(1) undo

state D’s

(2) cancel

“sent” events

GVT

LP 2

LP 3

LP 1

LP 2

LP 3

LP 1

unprocessed event

processed event

“straggler” event

“committed” event

our solution reverse computation
Our Solution: Reverse Computation...
  • Use Reverse Computation (RC)
    • automatically generate reverse code from model source
    • undo by executing reverse code
  • Delivers better performance
    • negligible overhead for forward computation
    • significantly lower memory utilization

Original Code

Compiler

Modified Code

Reverse Code

ex simple network switch

Forward

Reverse

if( qlen < B )

b1 = 1

qlen++

delays[qlen]++

else

b1 = 0

lost++

if( b1 == 1 )

delays[qlen]--

qlen--

else

lost--

Ex: Simple Network Switch

Original

N

if( qlen < B )

qlen++

delays[qlen]++

else

lost++

B

on packet arrival...

beneficial application properties
Beneficial Application Properties

1. Majority of operations are constructive

  • e.g., ++, --, etc.

2. Size of control state< size of data state

  • e.g., size of b1 < size of qlen, sent, lost, etc.

3. Perfectly reversible high-level operations

gleaned from irreversible smaller operations

  • e.g., random number generation
destructive assignment
Destructive Assignment...
  • Destructive assignment (DA):
    • examples: x = y; x %= y;
    • requires all modified bytes to be saved
  • Caveat:
    • reversing technique forDA’s can degenerate to traditional incremental state saving
  • Good news:
    • certain collections of DA’s are perfectly reversible!
    • queueing network models contain collections of easily/perfectly reversible DA’s
      • queue handling (swap, shift, tree insert/delete, … )
      • statistics collection (increment, decrement, …)
      • random number generation (reversible RNGs)
rc applications

Original

if( qlen < B )

qlen++

delays[qlen]++

else

lost++

B

packet arrival...

Forward

Reverse

if( qlen < B )

b1 = 1

qlen++

delays[qlen]++

else

b1 = 0

lost++

if( b1 == 1 )

delays[qlen]--

qlen--

else

lost--

RC Applications
  • PDES applications include:
    • Wireless telephone networks
    • Distributed content caches
    • Large-scale Internet models –
      • TCP over AT&T backbone
      • Leverges RC “swaps”
    • Hodgkin-Huxley neuron models
    • Plasma physics models using PIC
    • Pose -- UIUC
  • Non-DES include:
    • Debugging
    • PISA – Reversible instruction set architecture for low power computing
    • Quantum computing
local control implementation
Local Control Implementation

Local Control Mechanism:

error detection and rollback

V

i

r

t

u

a

l

T

i

m

e

  • MPI_ISend/MPI_Irecv used to send/recv off core events
  • Event & Network memory is managed directly.
    • Pool is allocated @ startup
  • Event list keep sorted using a Splay Tree (logN)
  • LP-2-Core mapping tables are computed and not stored to avoid the need for large global LP maps.

(1) undo

state D’s

(2) cancel

“sent” events

LP 2

LP 3

LP 1

global control implementation
Global Control Implementation

Global Control Mechanism:

compute Global Virtual Time (GVT)

V

i

r

t

u

a

l

T

i

m

e

GVT (kicks off when memory is low):

  • Each core counts #sent, #recv
  • Recv all pending MPI msgs.
  • MPI_Allreduce Sum on (#sent - #recv)
  • If #sent - #recv != 0 goto 2
  • Compute local core’s lower bound time-stamp (LVT).
  • GVT = MPI_Allreduce Min on LVTs

Algorithms needs efficient MPI collective

LC/GC can be very sensitive to OS jitter

collect versions

of state / events

& perform I/O

operations

that are < GVT

GVT

LP 2

LP 3

LP 1

So, how does this translate into Time Warp performance on BG/L & BG/P?

performance results setup
Performance Results: Setup
  • PHOLD
    • Synthetic benchmark model
    • 1024x1024 grid of LPs
    • Each LP has 10 initial events
    • Event routed randomly among all LPs based on a configurable “percent remote” parameter
    • Time stamps are exponentially distributed with a mean of 1.0 (i.e., lookahead is 0).
  • TLM – Tranmission Line Matrix
    • Discrete electromagnetic propagation wave model
    • Used model the physical layer of MANETs
    • As accurate as previous “ray tracing” models, but dramatically faster…
    • Considers wave attenuation effects
    • Event populations grows cubically outward from the single “radio” source.
  • ROSS parameters
    • GVT_Interval number of times thru “scheduler” loop before computing GVT.
    • Batch number of local events to process before “check” network for new events.
      • Batch X GVT_Interval events processed per GVT epoch
    • KPs kernel processes that hold the aggregated processed event lists for LPs to lower search overheads for fossil collection of “old” events.
    • Send/Recv Buffers – number of network events for “sending” or “recv’ing”. Used as a flow control mechanism.
slide18

7.5 billion ev/sec for 10% remote on 32,768 cores!!

2.7 billion ev/sec for 100% remote on 32,768 cores!!

Stable performance across processor configurations attributed to near noiseless OS…

slide20

12.27 billion ev/sec for 10% remote on 65,536 cores!!

4 billion ev/sec for 100% remote on 65,536 cores!!

slide23

Model a 10 PF Supercomputer (revisited)

  • Suppose we want to model a 10 PF supercomputer at the MPI message level
  • How long excute parallel DES model?
  • 16 trillion events @ 10 billion ev/sec

~27 mins

observations
Observations…
  • ROSS on Blue Gene indicates billion-events per second model are feasible today!
    • Yields significant TIME COMPRESSION of current models..
  • LP to PE mapping less of a concern…
    • Past systems where very sensitive to this
  • ~90 TF systems can yield “Giga-scale” event rates.
  • Tera-event models require teraflop systems.
    • Assumes most of event processing time is spent in event-list management (splay tree enqueue/dequeue).
  • Potential: 10 PF supercomputers will be able to model near peta-event systems
    • 100 trillion to 1 quadrillion events in less than 1.4 to 14 hours
    • Current “testbed” emulators don’t come close to this for Network Modeling and Simulation..
future models enabled by x scale computing
Future Models Enabled by X-Scale Computing
  • Discrete “transistor” level models for whole multi-core architectures…
    • Potential for more rapid improvements in processor technology…
  • Model nearly whole U.S. Internet at packet level…
    • Potential to radically improve overall QoS for all
  • Model all C4I network/systems for a whole theatre of war faster than real-time many time over..
    • Enables the real-time“active” network control..
slide26

Future Models Enabled by X-Scale Computing

  • Realistic discrete model the human brain
    • 100 billion neurons w/ 100 trillion synapes (e.g. connections – huge fan-out)
    • Potential for several exa-events per run
  • Detailed “discrete” agent-based model for every human on the earth for..
    • Global economic modeling
    • pandemic flu/disease modeling
    • food / water / energy usage modeling…

But to get there investments must be made in code that are COMPLETELY parallel from start to finish!!

thank you
Thank you!!
  • Additional Acknowledgments
    • David Bauer – HPTi
    • David Jefferson – LLNL for helping us get discretionary access to “Intrepid” @ ALCF
    • Sysadmins: Ray Loy (ANL), Tisha Stacey (ANL) and Adam Todorski (CCNI)
  • ROSS Sponsers
    • NSF PetaApps, NeTS & CAREER programs
    • ALFC/ANL
ad