1 / 21

Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations. Qi (Jacky) Liu and Gabriel Wainer Department of Systems and Computer Engineering Carleton University Ottawa, Canada. Outline. Motivation & Background. Fine-Grained Event Parallelism. Event Processing Kernel.

asasia
Download Presentation

Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations Qi (Jacky) Liu and Gabriel Wainer Department of Systems and Computer Engineering Carleton University Ottawa, Canada

  2. Outline Motivation & Background Fine-Grained Event Parallelism Event Processing Kernel Parallel DEVS Simulation on Cell Experimental Results Conclusion & Future Work

  3. Motivation • Accelerate general-purpose DEVS-based simulations on heterogeneous CMP architectures like the Cell processor • Develop new parallelization strategies based on fine-grainedevent-level parallelism inherent in the simulation process • Exploit multi-grained parallelismsimultaneously at different levels of the system • Allow general users to gain performance transparently w/o being distracted by multicore programming details • Provide some generalizable methods & insight for PDES on emerging CMP architectures

  4. Cell Processor Overview • Nine-core heterogeneous CMP with two distinct ISAs • Software-managed LS with explicitly-addressed DMA transfer • Low-latency EIB channels – 32-bit mailbox & signal messages

  5. Parallel DEVS (P-DEVS) Formalism Discrete-EVent System Specification (DEVS) • Cell-DEVS Formalism

  6. Layered View of M&S

  7. Structured Simulation Process Parallel Simulation with CD++ • Flat LP Structure • (I)  LP and model init. • (@)  model output • (*)  model state trans. • (D)  model sync. • (X)  model input data • (Y)  model output data

  8. Fine-Grained Event Parallelism • Event-embarrassing parallelism • Independent events within a step • Executed in an arbitrary order • Event-streaming parallelism • Causally-related events between consecutive steps • Executed in a pipelined fashion • Phase-changing events • Exchanged between NC & FC • Natural fork & join points • Data-flow oriented parallelization

  9. SEK Concurrent exec. across SPEs - 98.02% (event-embarrassing parallelism) Pipelined exec. between PPE & SPEs - 1.15% (event-streaming parallelism) Event Processing Kernel • Hydrological Watershed Simulation • 320×320×2 with 204,800 Simulators • Compute-intensive state transitions • Over 300 million events across 663 phases • Cell-DEVS model defined in CD++ spec. lang. • Simulation Profile on the PPE

  10. EVENT-STREAMING PARALLELISM (TWO-STAGE PIPELINE) Parallel DEVS Simulation on Cell - Overview COMPUTE-I/O PARALLELISM THREAD PARALLELISM VECTOR PARALLELISM (SPE SIMD) EVENT-EMBARRASSING PARALLELISM DATA-STREAMING PARALLELISM (DOUBLED-BUFFERED DMA AT THREE LAYERS)

  11. Parallel DEVS Simulation on Cell – LP Virtualization • Purpose • Map active Simulators to a limited group of SPE threads • Fit into the small on-chip LS • Assign each SPE a reusable task operating on a stream of data • Facilitate fine-grained dynamic load-balancing between SPEs • Solution • Turn Simulators (and associated atomic models) into virtual LPs • Separate event-processing logic (wrapped in SPE threads) from state data (maintained in main memory buffers) • Match the states of active Simulators to available SPE threads dynamically at each virtual time – SEK job scheduling

  12. Virtual Simulator State Mgmt. • Decentralized Event Mgmt. Parallel DEVS Simulation on Cell – More Details

  13. Rule Evaluation on SPEs • SEK Job Scheduling Parallel DEVS Simulation on Cell – More Details

  14. IBM BladeCenter QS22 3.2GHz PowerXCell 8i × 2 32GB RAM Red Hat Enterprise Linux 5.2 IBM SDK for Multicore Acceleration 3.1 Platform and Configuration • Parallel DEVS simulator on Cell  CD++/Cell • SEK job scheduling policy  round-robin or shortest-queue-first • CD++ event-logging turned off  minimize the impact of file I/O

  15. Total Simulation Time with Watershed Model • Performance gain with just one SPE  5.84× • OO C++ code on PPE vs. SIMD-aware C code on SPEs • memory latency & cache miss vs. data locality & double-buffered DMA • Low-level optimizations on SPEs (LS data alignment, call stack usage, branch minimization, loop unrolling, in-line substitution, pipelined event execution) • Overall performance with 8 SPEs  33.06×

  16. Speedups over (PPE with 1 SPE) Version • Speedup grows slower with more and more SPEs • Higher overhead for SEK job scheduling and orchestration • Increased DMA contention & channel stalls

  17. Conclusion • Formalism-Based Design Methodology • Facilitate model reuse & portability • Reduce validation & verification cost • Performance-Centric Approach • Accelerate event processing for compute-intensive DEVS models • Minimize communication & synchronization overhead • Achieve fine-grained dynamic load balancing • New Parallelization Strategy for PDES • Exploit fine-grained event parallelism from a data-flow perspective • Combine multi-grained parallelism at different system levels • Break LP boundaries with LP virtualization • Insight for PDES on Heterogeneous CMP Architectures • Match workload characteristics to functional specialization of cores • Address data locality, memory latency, & code optimization issues

  18. Future Work • Porting different types of models to Cell  performance testing • Transparency • Minimal knowledge (and learning curve) from users • Integrating with existing conservative/optimistic approaches • Combine cluster-level LP-based conservative simulation •  Using both synchronous & asynchronous algorithms • Combine cluster-level Time Warp optimistic simulation •  Using Lightweight Time Warp (DS-RT 2008, PADS 2009) • Testing on large-scale hybrid supercomputers • Using Cell processor in new ways 18/18

  19. Questions? This research was supported in part by the MITACS Accelerate Ontario program, Canada, and by the IBM T. J. Watson Research Center, NY. liuqi@sce.carleton.ca http://www.sce.carleton.ca/~liuqi/ ARS Lab: http://cell-devs.sce.carleton.ca/ars/

  20. Some Applications • Defense & Emergency Planning Battlefield Simulations Crowd Behavior & Evacuation Analysis

  21. Some Applications • Biomedical & Environmental Analysis Deformable Membrane Presynaptic Nerve Krebs Cycle in living organisms Forest fire propagation Watershed formation

More Related