Efficient optimistic parallel simulations using reverse computation
Download
1 / 27

Efficient Optimistic Parallel Simulations Using Reverse Computation - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

Efficient Optimistic Parallel Simulations Using Reverse Computation. Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan Permulla and Richard M. Fujimoto College of Computing Georgia Institute of Technology. Why Parallel/Distributed Simulation?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Efficient Optimistic Parallel Simulations Using Reverse Computation' - eamon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient optimistic parallel simulations using reverse computation

Efficient Optimistic Parallel Simulations Using Reverse Computation

Chris Carothers

Department of Computer Science

Rensselaer Polytechnic Institute

Kalyan Permulla

and

Richard M. Fujimoto

College of Computing

Georgia Institute of Technology


Why Parallel/Distributed Simulation? Computation

  • Goal: speed up discrete-event simulation programs using multiple processors

  • Enabling technology for…

    • intractable simulation models tractable

    • off-line decision aides on-line aides for time critical situation analysis

  • DPAT: A distributed simulation success story

    • simulation model of the National Airspace

    • developed @ MITRE using Georgia Tech Time Warp (GTW)

    • simulates 50,000 flights in < 1 minute, which use to take 1.5 hours.

    • web based user-interface

    • to be used in the FAA Command Center for on-line “what if” planning

  • Parallel/distributed simulation has the potential to improve how “what if” planning strategies are evaluated


How to Synchronize Distributed Simulations Computation?

parallel time-stepped simulation:

lock-step execution

parallel discrete-event simulation:

must allow for sparse, irregular

event computations

barrier

Problem: events arriving

in the past

Solution: Time Warp

Virtual

Time

Virtual

Time

PE 2

PE 3

PE 1

PE 2

PE 3

PE 1

processed event

“straggler” event


Time Warp... Computation

Local Control Mechanism:

error detection and rollback

Global Control Mechanism:

compute Global Virtual Time (GVT)

V

i

r

t

u

a

l

T

i

m

e

V

i

r

t

u

a

l

T

i

m

e

collect versions

of state / events

& perform I/O

operations

that are < GVT

(1) undo

state D’s

(2) cancel

“sent” events

GVT

LP 2

LP 3

LP 1

LP 2

LP 3

LP 1

unprocessed event

processed event

“straggler” event

“committed” event


Time Warp Computation

P

P

P

P

P

P

P

P

P

Shared Memory or High Speed Network

Challenge: Efficient Implementation?

  • Advantages:

  • automatically finds available parallelism

  • makes development easier

  • outperforms conservative schemes by a factor of N

  • Disadvantages:

  • Large memory requirements to support rollback operation

  • State-saving incurs high overheads for fine-grain event computations

  • Time Warp is out of “performance” envelop for many applications

Our Solution: Reverse Computation


Outline... Computation

  • Reverse Computation

    • Example: ATM Multiplexor

    • Beneficial Application Properties

    • Rules for Automation

    • Reversible Random Number Generator

  • Experimental Results

  • Conclusions

  • Future Work


Our solution reverse computation

Compiler Computation

Original Code

Modified Code

Reverse Code

Our Solution: Reverse Computation...

  • Use Reverse Computation (RC)

    • automatically generate reverse code from model source

    • undo by executing reverse code

  • Delivers better performance

    • negligible overhead for forward computation

    • significantly lower memory utilization


Example atm multiplexor

Forward Computation

Reverse

if( qlen < B )

b1 = 1

qlen++ delays[qlen]++

else

b1 = 0

lost++

if( b1 == 1 )

delays[qlen]--

qlen--

else

lost--

Example: ATM Multiplexor

Original

N

if( qlen < B )

qlen++ delays[qlen]++

else

lost++

B

on cell arrival...


Gains
Gains…. Computation

  • State size reduction

    • from B+2 words to 1 word

    • e.g. B=100 => 100x reduction!

  • Negligible overhead in forward computation

    • removed from forward computation

    • moved to rollback phase

  • Result

    • significant increase in speed

    • significant decrease in memory

  • How?...


Beneficial application properties
Beneficial Application Properties Computation

1. Majority of operations are constructive

  • e.g., ++, --, etc.

    2. Size of control state< size of data state

  • e.g., size of b1 < size of qlen, sent, lost, etc.

    3. Perfectly reversible high-level operations

    gleaned from irreversible smaller operations

  • e.g., random number generation


Rules for automation
Rules for Automation... Computation

Generation rules, and upper-bounds on bit requirements for various statement types


Destructive assignment
Destructive Assignment... Computation

  • Destructive assignment (DA):

    • examples: x = y; x %= y;

    • requires all modified bytes to be saved

  • Caveat:

    • reversing technique forDA’s can degenerate to traditional incremental state saving

  • Good news:

    • certain collections of DA’s are perfectly reversible!

    • queueing network models contain collections of easily/perfectly reversible DA’s

      • queue handling (swap, shift, tree insert/delete, … )

      • statistics collection (increment, decrement, …)

      • random number generation (reversible RNGs)


Reversing an rng
Reversing an RNG? Computation

double RNGGenVal(Generator g)

{

long k,s;

double u;

u = 0.0;

s = Cg [0][g]; k = s / 46693;

s = 45991 * (s - k * 46693) - k * 25884;

if (s < 0) s = s + 2147483647;

Cg [0][g] = s;

u = u + 4.65661287524579692e-10 * s;

s = Cg [1][g]; k = s / 10339;

s = 207707 * (s - k * 10339) - k * 870;

if (s < 0) s = s + 2147483543;

Cg [1][g] = s;

u = u - 4.65661310075985993e-10 * s;

if (u < 0) u = u + 1.0;

s = Cg [2][g]; k = s / 15499;

s = 138556 * (s - k * 15499) - k * 3979;

if (s < 0.0) s = s + 2147483423;

Cg [2][g] = s;

u = u + 4.65661336096842131e-10 * s;

if (u >= 1.0) u = u - 1.0;

s = Cg [3][g]; k = s / 43218;

s = 49689 * (s - k * 43218) - k * 24121;

if (s < 0) s = s + 2147483323;

Cg [3][g] = s;

u = u - 4.65661357780891134e-10 * s;

if (u < 0) u = u + 1.0;

return (u);

}

Observation: k = s / 46693 is a Destructive AssignmentResult: RC degrades to classic state-saving…can we do better?


Rngs a higher level view
RNGs: A Higher Level View Computation

The previous RNG is based on the following recurrence….

xi,n = aixi,n-1 mod mi

where xi,none of the four seed values in the Nth set, miis one the four largest primes less than 231, and ai is a primitive root of mi.

Now, the above recurrence is in fact reversible….

inverse of aimodulo mi is defined,

bi = aimi-2 mod mi

Using bi, we can generate the reverse recurrence as follows:

xi,n-1 = bixi,n mod mi


Reverse code efficiency
Reverse Code Efficiency... Computation

  • Future RNGs may result in even greater savings.

    • Consider the MT19937 Generator...

    • Has a period of 219937

    • Uses 2496 bytes for a single “generator”

  • Property...

    • Non-reversibility of indvidual steps DO NOT imply that the computation as a whole is not reversible.

    • Can we automatically find this “higher-level” reversibility?

  • Other Reversible Structures Include...

    • Circular shift operation

    • Insertion & deletion operations on trees (i.e., priority queues).

Reverse computation is well-suited for queuing network models!


Performance study

Platform Computation

SGI Origin 2000, 16 processors (R10000), 4GB RAM

  • Model

  • 3 levels of multiplexers, fan-in N

  • N^3 sources => N^3 + N^2 + N + 1entities in total

  • eg. N=4 => entities=85, N=64 => entities=266,305

Performance Study


Why the large increase in parallel performance
Why the large increase in parallel performance? Computation

million events/second


Cache performance
Cache Performance... Computation

FaultsTLB P cache S cache

  • SS 12pe: 43966018 1283032615 162449694

  • RC 12pe: 11595326 590555715 94771426


Related work
Related Work... Computation

  • Reverse computation used in

    • low power processors, debugging, garbage collection, database recovery, reliability, etc.

  • All previous work either

    • prohibit irreversible constructs, or

    • use copy-on-write implementation for every modification(correspond to incremental state saving)

  • Many operate at coarse, virtual page-level


Contributions
Contributions Computation

We identify that

  • RC makes Time Warp usable for fine-grain models!

    • disproved previous beliefthat “fine grain models can’t be optimistically simulated efficiently”

    • less memory consumption, more speed, without extra user effort

  • RC generalizes state saving

    • e.g., incremental state saving, copy state saving

  • For certain data types, RC is more memory efficient than SS

    • e.g., priority queues


Future work
Future Work Computation

  • Develop state minimization algorithms, by

    • State compression:bit size for reversibility < bit size of data variables

    • State reuse:same state bits for different statements

      • based on liveness, analogous to register allocation

  • Complete RC automation algorithm designavoiding the straightforward incremental state saving approach

    • Lossy integer and floating point arithmetic

    • Jump statements

    • Recursive functions


Geronimo system architecture

Myrinet Computation

Geronimo! System Architecture

High Performance Simulation Application

Geronimo

distributed

compute

server

rack-mounted CPUs

(not in demonstration)

multiprocessor

Geronimo Features: (1) “risky” or “speculative” processing of object computations, (2) reverse computation to support “undo” operation, (3) “Active Code” in a combination, heterogeneous, shared-memory, message passing environment...


Error detection and Rollback Computation

V

i

r

t

u

a

l

T

i

m

e

(1) undo

state D’s

(2) cancel

“scheduled”

tasks

Object 2

Object 3

Object 1

Geronimo!: “Risky” Processing...

  • Execution Framework:

    • Objects

    • schedule Threads / Tasks

    • at some “virtual time”

  • Applications:

    • discrete-event simulations

    • scientific computing applications

processed thread

CAVEAT: Good performance relies on cost of recovery * probability of failure being less than cost of being “safe”!

“straggler” thread

unprocessed thread


Geronimo efficient undo
Geronimo!: Efficient “Undo” Computation

  • Traditional approach: State Saving

    • save byte-copies of modified items

      • high overhead for fine-granularity computations

      • memory utilization is large

    • need alternative for large-scale, fine-grain simulations

  • Our approach: Reverse Computation

    • automatically generate reverse code from model source

    • utilize reverse code to do rollback

      • negligible overhead for forward computation

      • significantly lower memory utilization

    • joint with Kalyan Perumalla and Richard Fujimoto

Observation: “reverse” computation treats “code” as“state”. This results in a code-state duality.Can we generalize notion?…..


Geronimo active code
Geronimo!: Active Code Computation

  • Key idea: allow object methods/code to be dynamically changed during run-time.

    • objects can schedule in the future a new method or re-define old methods of other objects and themselves.

    • objects can erase/delete methods on themselves or other objects.

    • new methods can contain “Active Code” which can re-specialize itself or other objects.

    • work in a heterogeneous environment.

  • How is this useful?

    • increase performance by allowing the program to consistently “execute the common case fast”.

    • adaptive, perturbation-free, monitoring of distributed systems.

    • potential for increasing a language’s “expressive power”.

  • Our approach?

    • Java…no, need higher performance…maybe used in the future...

    • special compiler…no, can’t keep up with changes to microprocessors.


Geronimo active code implementation
Geronimo!: Active Code Implementation Computation

  • Runtime infrastructure

    • modifies source code tree

    • start a rebuild of the executable on a another existing machine

    • uses a system’s naïve compiler

  • Re-exec system call

    • reloads only the new text or code segment of new executable

    • fix-up old stack to reflect new code changes

    • fix-up pointers to functions

    • will run in “user-space” for portability across platforms

  • Language preprocessor

    • instruments code to support stack and function pointer fix-up

    • instruments code to support stack reconstruction and re-start process


Research issues
Research Issues Computation

  • Software architecture for the heterogeneous, shared-memory, message passing environment.

  • Development of distributed algorithms that are fully optimized for this “combination” environment.

  • What language to use for development, C or C++ or both?

  • Geronimo! API.

  • Active Code Language and Systems Support.

  • Mapping relevant application types to this framework

Homework Problem: Can you find specific applications/problems where we can apply Geronimo!?


ad