Optimizing Performance of Null Message Algorithm in Parallel Simulation

A Practical Efficiency Criterion For The Null Message Algorithm András Varga1, Y. Ahmet Şekerciuğlu2, Gregory K. Egan2 1 Omnest Global, Inc. 2CTIE, Monash University, Melbourne, Australia

Parallel Simulation • Research on telecommunication networks extensively uses simulation • Several problems require large-scale simulations that cannot be done on any single computer (scalable routing, scalable multicasting, node mobility, etc.) • Parallel distributed simulation may be the only solution • makes use of cluster computing • it may allow for running very large models (by distributing memory requirements) • it may also produce speedup • Some parallel and distributed network simulation software projects: SSFNet, Parsec/Qualnet, OMNeT++

Conservative PDES Algorithms LP2 (on CPU2) LP1 (on CPU1) LP3 (on CPU3) • We consider conservative PDES algorithms, specifically the Null Message Algorithm(NMA; Chandy-Misra-Bryant, 1979) • In parallel simulation, the model is partitioned into several logical processes (LPs), simulated on different CPUs • Basic problem in PDES: if one of the LPs run ahead, it might receive an event in its past from other LPs => this should not happen! • NMA’s solution: LPs constantly keep their neighbors updated (via null messages) how far they can proceed into the future (lookahead) – lookahead is extracted from the model (i.e. link delays)

NMA Performance • Two observations can be made about the performance of the NMA: • if lookahead is too small, performance is poor because CPUs have to wait for or deal with null messages too often (this is extensively discussed in the literature). • Note that as lookahead is provided by the model (e.g. as link delays), one cannot do much about it! • if communication latency between LPs is high, performance is also poor because null messages don’t arrive in time and CPUs are forced to block. • This is discussed in few papers, but there’s significant empirical/ anecdotal evidence, e.g. it’s well-known that shared-memory multiprocessors can produce much better speedups than clusters • other factors such as throughput tend to have less dramatic effect on performance

Performance Criteria • Present work shows that lookahead and communication latency are related and can be traded for each other, and provides quantitative criteria to determinehow large lookahead and how small communication latency is needed for satisfactory NMA performance. • Earlier works considered lookahead and communication latency independently, and have not been able to quantify requirements for lookahead and communication latency.

Input Variables • In large simulation models, P, E and R usually stay relatively constant (little fluctuations) • In addition to L lookahead and communication latency, we’ll use P, E and R that can be easily measured by any simulator: OMNeT++ gauge bar E: event density number of events per simulated second (does not depend on hardware!) P: performance number of events processed per second R: relative speed simulation time (simsecs)advanced per second R = P/E

Performance of NMA • Null Message Algorithm may perform poorly* because of two main reasons: (a) too frequent null messages • performance is consumed by sending and receiving null messages, instead of processing events • cause: poor lookahead compared to event density (b) null messages take too long to arrive and CPUs run out of work • result: CPUs spend too much time waiting for null messages, instead of processing events • cause: too tight coupling of LPs, as a consequence of too little workload on LPs, combined with poor lookahead and/or long communication latencies and/or wrong load balancing • We’ll examine and quantify both (a) and (b) * we use the Ideal Simulation Protocol (ISP; Bagrodia et al, 2000) as comparison

Performance Criterion (a) • (a) if lookahead L is too small compared to the event density E (ev/simsec), several rounds of null messages will be needed to advance simulation time over otherwise event-free time periods • example: in a queuing system with few jobs, events may occur about every 3 simulated seconds, but if lookahead is only 0.5 sec --> at least 6 rounds of null messages are necessary to advance to the next event! • condition for good performance: L >> 1/E • Note: L and E are properties of the model --> some models will always perform poorly under NMA, no matter the hardware • remedy: to use alternative synchronization method, e.g. Conditional Event Algorithm (which relies on calculating global virtual time)

Performance Criterion (b) • (b) null messages should reach the target LP before it runs out of work (i.e. before it completes processing of the L amount of simulation time the previous null message allowed it to advance) • condition*:  < L/R • more convenient form (using R=P/E): P < LE • the formula can be generalized to any number of LPs (in nutshell, P < LE must hold for every LP pair) * the paper gives a more detailed derivation of this formula

Interpreting the Criterion P < LE Shows the model’s potential for efficient PDES under NMA. • “how many events does lookahead cover” • its value characterizes the model: • small values (say LE<1) mean poor lookahead is present -- stronger shared memory-type hardware is needed • large values (say LE>1) mean good lookahead – model may execute well on cluster-type hardware, too • Note that a small lookahead L can be compensated by a large event density E! Depends only on properties of the hardware and the simulation environment. • “how many events are processed during  time” • its value characterizes the hardware: • small values (say P<1) indicate fast communication compared to processing speed: shared-memory-type hw • large values (say P>1) indicate strong processing power and slower communication: cluster-type hw • e.g. Linux cluster with MPI and switched Fast Ethernet: • P = 22μs*100,000ev/sec = 2.2

The  Coupling Factor • P < LE only guarantees that there’s no blocking if simulation time passes completely evenly (R=P/E relative speed is constant) • if P≈ LE, fluctuations in P and E will lead to frequent blocking because the LPs are too tightly coupled, resulting in lost performance • Let us introduce= LE/P“coupling factor” • for potentially good performance,  should be sufficiently large • actual  value depends on how much P and E fluctuate, but experiments show that <10 values are usually too small, and >100 is almost always enough

Experimental Verification CPU0 CPU1 CPU2 • Model: Closed Queuing Network (CQN) • L lookahead: delay of links that leave switches • CQN chosen because it allows tuning of n, L and E

CQN Experiments • Several simulation runs: • 16 tandem queues • number of processors: 2, 4, 8, 16 • number of queues per tandem: 5, 10, 20, 50, 100, 200, 500 • link delay (=lookahead): 1, 2, 5, 10, 20, 50, 100, 500 (simsecs) • Simulation environment • 8 PC Linux cluster; switched Fast Ethernet; LAM-MPI communications library; OMNeT++ • sequential performance of CQN is Pseq~120,000 ev/sec; MPI latency is ~ 22μs • Measured: • P performance (ev/sec) and E event density (ev/simsec) per processor and total • performance under the Ideal Simulation Protocol (ISP; Bagrodia et al, 2000)

Experimental Results Efficiency (speedup compared to maximum achievable speedup*) vs. the = LE/Pcoupling factor * speedup under Ideal Simulation Protocol (ISP)

Increasing the Number of Processors • How does NMA scale with the number of processors used? • as we partition the model, event density will diminish: E1+E2+..+En = Eseq E ≈ Eseq/n where Eseq is the event density is the whole model n is the number of LPs • n≈ seq/n with heavy partitioning it’s more difficult to achieve potentially good performance • explanation: with heavy partitioning, some processors might not get enough work to do until null messages arrive

Applying the Results in Practice • How to apply criteria to find out if model has potential for parallel execution with NMA: • when using OMNeT++: run the sequential simulation under Tkenv, and read P, E values from the status bar • say P=100,000 ev/sec and E=5,000 ev/simsec • measure  using provided test programs, or guess approximate  value from hardware components used • e.g. 2GHz PCs with switched Fast Ethernet and MPI: ~20s • look at model topology for partitioning hints, and determine likely L lookahead from link delays • calculate approximate LE and  values for different n=#LPs, using E’=E/n  if good LE and  values are present, model has good potential for NMA  one can get a feel for the parallel potential of the model in minutes, even without a pocket calculator!

Ongoing Work • Further experimental verification of the criteria • larger number of LPs, different hardware architectures, different models • Probabilistic approach quantify fluctuations in P and E, and relate that to  coupling factor

Optimizing Performance of Null Message Algorithm in Parallel Simulation

Optimizing Performance of Null Message Algorithm in Parallel Simulation

Presentation Transcript

Chapter 23 Algorithm Efficiency

MD5 Message Digest Algorithm

A Practical Quicksort Algorithm for Graphics Processors

Algorithm Efficiency and Sorting

Message Authentication using Message Digests and the MD5 Algorithm

Algorithm Efficiency

A message for the confused

A message for the confused

Algorithm Efficiency

The BME criterion for tree reconstruction and a Branch and Bound algorithm for BME-optimal trees.

Algorithm Efficiency

Practical Algorithm for Channel Selection

A Practical Algorithm for Constructing Oblivious Routing Schemes

Algorithm Efficiency

Algorithm Efficiency

Algorithm Efficiency and Sorting

7.Algorithm Efficiency

Efficiency – practical

Analysis of Algorithm Efficiency

The BME criterion for tree reconstruction and a Branch and Bound algorithm for BME-optimal trees.