Christopher D. Carothers
Download
1 / 27

ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Christopher D. Carothers Department of Computer Science Rensselaer Polytechnic Institute [email protected] ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers. Outline. Motivation for PDES Overview of HPC Platforms ROSS Implementation Performance Results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers' - tacy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Christopher D. Carothers

Department of Computer Science

Rensselaer Polytechnic Institute

[email protected]

ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers


Outline
Outline

  • Motivation for PDES

  • Overview of HPC Platforms

  • ROSS Implementation

  • Performance Results

  • Summary


Motivation
Motivation

Why Parallel Discrete-Event Simulation (DES)?

  • Large-scale systems are difficult to understand

  • Analytical models are often constrained

    Parallel DES simulation offers:

  • Dramatically shrinks model’s execution-time

  • Prediction of future “what-if” systems performance

  • Potential for real-time decision support

    • Minutes instead of days

    • Analysis can be done right away

  • Example models: national air space (NAS), ISP backbone(s), distributed content caches, next generation supercomputer systems.


Model a 10 pf supercomputer
Model a 10 PF Supercomputer

  • Suppose we want to model a 10 PF supercomputer at the MPI message level

  • How long excute DES model?

    • 10% flop rate  1 PF sustained

    • @ .2 bytes/sec per flop @ 1% usage  2 TB/sec

    • @ 1K size MPI msgs  2 billion msgs per simulated second

    • @ 8 hops per msg  16 billion “events” per simulated second

    • @ 1000 simulated seconds  16 trillion events for DES model

    • No I/O included !!

    • Nominal seq. DES simulator  100K events/sec

  • 16 trillion events @ 100K ev/sec

    5+ years!!!

    Need massively parallel simulation to make tractable


Blue gene l layout
Blue Gene /L Layout

  • CCNI “fen”

  • 32K cores/ 16 racks

  • 12 TB / 8 TB usable RAM

  • ~1 PB of disk over GPFS

  • Custom OS kernel


Blue gene p layout
Blue Gene /P Layout

  • ALCF/ANL “Intrepid”

  • 163K cores/ 40 racks

  • ~80TB RAM

  • ~8 PB of disk over GPFS

  • Custom OS kernel



parallel time-stepped simulation:

lock-step execution

parallel discrete-event simulation:

must allow for sparse, irregular

event computations

barrier

Problem: events arriving

in the past

Solution: Time Warp

Virtual

Time

Virtual

Time

PE 2

PE 3

PE 1

PE 2

PE 3

PE 1

How to Synchronize Parallel Simulations?

processed event

“straggler” event


Massively Parallel Discrete-Event Simulation Via Time Warp

Local Control Mechanism:

error detection and rollback

Global Control Mechanism:

compute Global Virtual Time (GVT)

V

i

r

t

u

a

l

T

i

m

e

V

i

r

t

u

a

l

T

i

m

e

collect versions

of state / events

& perform I/O

operations

that are < GVT

(1) undo

state D’s

(2) cancel

“sent” events

GVT

LP 2

LP 3

LP 1

LP 2

LP 3

LP 1

unprocessed event

processed event

“straggler” event

“committed” event


Our solution reverse computation
Our Solution: Reverse Computation...

  • Use Reverse Computation (RC)

    • automatically generate reverse code from model source

    • undo by executing reverse code

  • Delivers better performance

    • negligible overhead for forward computation

    • significantly lower memory utilization

Original Code

Compiler

Modified Code

Reverse Code


Ex simple network switch

Forward

Reverse

if( qlen < B )

b1 = 1

qlen++

delays[qlen]++

else

b1 = 0

lost++

if( b1 == 1 )

delays[qlen]--

qlen--

else

lost--

Ex: Simple Network Switch

Original

N

if( qlen < B )

qlen++

delays[qlen]++

else

lost++

B

on packet arrival...


Beneficial application properties
Beneficial Application Properties

1. Majority of operations are constructive

  • e.g., ++, --, etc.

    2. Size of control state< size of data state

  • e.g., size of b1 < size of qlen, sent, lost, etc.

    3. Perfectly reversible high-level operations

    gleaned from irreversible smaller operations

  • e.g., random number generation


Destructive assignment
Destructive Assignment...

  • Destructive assignment (DA):

    • examples: x = y; x %= y;

    • requires all modified bytes to be saved

  • Caveat:

    • reversing technique forDA’s can degenerate to traditional incremental state saving

  • Good news:

    • certain collections of DA’s are perfectly reversible!

    • queueing network models contain collections of easily/perfectly reversible DA’s

      • queue handling (swap, shift, tree insert/delete, … )

      • statistics collection (increment, decrement, …)

      • random number generation (reversible RNGs)


Rc applications

Original

if( qlen < B )

qlen++

delays[qlen]++

else

lost++

B

packet arrival...

Forward

Reverse

if( qlen < B )

b1 = 1

qlen++

delays[qlen]++

else

b1 = 0

lost++

if( b1 == 1 )

delays[qlen]--

qlen--

else

lost--

RC Applications

  • PDES applications include:

    • Wireless telephone networks

    • Distributed content caches

    • Large-scale Internet models –

      • TCP over AT&T backbone

      • Leverges RC “swaps”

    • Hodgkin-Huxley neuron models

    • Plasma physics models using PIC

    • Pose -- UIUC

  • Non-DES include:

    • Debugging

    • PISA – Reversible instruction set architecture for low power computing

    • Quantum computing


Local control implementation
Local Control Implementation

Local Control Mechanism:

error detection and rollback

V

i

r

t

u

a

l

T

i

m

e

  • MPI_ISend/MPI_Irecv used to send/recv off core events

  • Event & Network memory is managed directly.

    • Pool is allocated @ startup

  • Event list keep sorted using a Splay Tree (logN)

  • LP-2-Core mapping tables are computed and not stored to avoid the need for large global LP maps.

(1) undo

state D’s

(2) cancel

“sent” events

LP 2

LP 3

LP 1


Global control implementation
Global Control Implementation

Global Control Mechanism:

compute Global Virtual Time (GVT)

V

i

r

t

u

a

l

T

i

m

e

GVT (kicks off when memory is low):

  • Each core counts #sent, #recv

  • Recv all pending MPI msgs.

  • MPI_Allreduce Sum on (#sent - #recv)

  • If #sent - #recv != 0 goto 2

  • Compute local core’s lower bound time-stamp (LVT).

  • GVT = MPI_Allreduce Min on LVTs

    Algorithms needs efficient MPI collective

    LC/GC can be very sensitive to OS jitter

collect versions

of state / events

& perform I/O

operations

that are < GVT

GVT

LP 2

LP 3

LP 1

So, how does this translate into Time Warp performance on BG/L & BG/P?


Performance results setup
Performance Results: Setup

  • PHOLD

    • Synthetic benchmark model

    • 1024x1024 grid of LPs

    • Each LP has 10 initial events

    • Event routed randomly among all LPs based on a configurable “percent remote” parameter

    • Time stamps are exponentially distributed with a mean of 1.0 (i.e., lookahead is 0).

  • TLM – Tranmission Line Matrix

    • Discrete electromagnetic propagation wave model

    • Used model the physical layer of MANETs

    • As accurate as previous “ray tracing” models, but dramatically faster…

    • Considers wave attenuation effects

    • Event populations grows cubically outward from the single “radio” source.

  • ROSS parameters

    • GVT_Interval number of times thru “scheduler” loop before computing GVT.

    • Batch number of local events to process before “check” network for new events.

      • Batch X GVT_Interval events processed per GVT epoch

    • KPs kernel processes that hold the aggregated processed event lists for LPs to lower search overheads for fossil collection of “old” events.

    • Send/Recv Buffers – number of network events for “sending” or “recv’ing”. Used as a flow control mechanism.


7.5 billion ev/sec for 10% remote on 32,768 cores!!

2.7 billion ev/sec for 100% remote on 32,768 cores!!

Stable performance across processor configurations attributed to near noiseless OS…



12.27 billion ev/sec for 10% remote on 65,536 cores!! cluster w/ Gigabit Eithernet

4 billion ev/sec for 100% remote on 65,536 cores!!


Rollback Efficiency = 1 - E cluster w/ Gigabit Eithernetrb /Enet


Model a 10 PF Supercomputer (revisited) cluster w/ Gigabit Eithernet

  • Suppose we want to model a 10 PF supercomputer at the MPI message level

  • How long excute parallel DES model?

  • 16 trillion events @ 10 billion ev/sec

    ~27 mins


Observations
Observations… cluster w/ Gigabit Eithernet

  • ROSS on Blue Gene indicates billion-events per second model are feasible today!

    • Yields significant TIME COMPRESSION of current models..

  • LP to PE mapping less of a concern…

    • Past systems where very sensitive to this

  • ~90 TF systems can yield “Giga-scale” event rates.

  • Tera-event models require teraflop systems.

    • Assumes most of event processing time is spent in event-list management (splay tree enqueue/dequeue).

  • Potential: 10 PF supercomputers will be able to model near peta-event systems

    • 100 trillion to 1 quadrillion events in less than 1.4 to 14 hours

    • Current “testbed” emulators don’t come close to this for Network Modeling and Simulation..


Future models enabled by x scale computing
Future cluster w/ Gigabit EithernetModels Enabled by X-Scale Computing

  • Discrete “transistor” level models for whole multi-core architectures…

    • Potential for more rapid improvements in processor technology…

  • Model nearly whole U.S. Internet at packet level…

    • Potential to radically improve overall QoS for all

  • Model all C4I network/systems for a whole theatre of war faster than real-time many time over..

    • Enables the real-time“active” network control..


Future cluster w/ Gigabit EithernetModels Enabled by X-Scale Computing

  • Realistic discrete model the human brain

    • 100 billion neurons w/ 100 trillion synapes (e.g. connections – huge fan-out)

    • Potential for several exa-events per run

  • Detailed “discrete” agent-based model for every human on the earth for..

    • Global economic modeling

    • pandemic flu/disease modeling

    • food / water / energy usage modeling…

But to get there investments must be made in code that are COMPLETELY parallel from start to finish!!


Thank you
Thank you!! cluster w/ Gigabit Eithernet

  • Additional Acknowledgments

    • David Bauer – HPTi

    • David Jefferson – LLNL for helping us get discretionary access to “Intrepid” @ ALCF

    • Sysadmins: Ray Loy (ANL), Tisha Stacey (ANL) and Adam Todorski (CCNI)

  • ROSS Sponsers

    • NSF PetaApps, NeTS & CAREER programs

    • ALFC/ANL


ad