1 / 54

Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations

IEEE DS-RT, Singapore Oct 26, 2009. Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations. Kalyan S. Perumalla, Ph.D. Senior Research Staff Member Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology. Main Theme.

gram
Download Presentation

Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IEEE DS-RT, Singapore Oct 26, 2009 Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations Kalyan S. Perumalla, Ph.D. Senior Research Staff MemberOak Ridge National Laboratory Adjunct ProfessorGeorgia Institute of Technology

  2. Main Theme Computational Power…unprecedented potential…exploit Simulation Scale…stretch imagination…new scopes “Think Big…Really Big”

  3. Confluence of Opportunities, Needs • Yes ??? • Yes

  4. Parallel Computing Power: It’s Coming High-end computing… Coming soon to a center near you! Access to 1000’s of cores… for every parallel simulation researcher… in just 2-3 years from now

  5. Evidence of Growth in 103-Core

  6. Now, all Top 500 are 103-core or More!

  7. Switching Gears Gear Decade Processors 1 1980 101 2 1990 102 3 2000 103 4 2010 104 5 2010 105 -106 R 2020

  8. Potential Areas for Discrete Event Execution on 105-106 Scale Cyber infrastructure simulations Internet protocols, peer-to-peer designs, … Epidemiological simulations Disease spread models, mitigation strategies, … Social dynamics simulations Pre- and post-operations campaigns, foreign policy, … Vehicular mobility simulations Regional- or nation-scale, … Agent-based simulations Behavioral exploration, complex compositions, … Sensor network simulations Wide area monitoring, situational awareness, … Organization simulations Command and control, business processes, … Logistics simulations Supply chain processes, contingency analyses, … Initial models scaling to103-104 cores Business Sensitive

  9. If only we look harder… Many nation-scale and world-scale questions are becoming relevant New methods and methodologies are waiting to be discovered

  10. Slippery Slopes Starting point for an experimental study Tendency with evolving needs of accuracy and detail Tendency with evolving needs of accuracy and detail Gory detail Abstractions 10

  11. How do we abstract immense complexity?Answer: Very difficult until we experiment with the system at scale

  12. What do we mean by Gory Detail?Cyber Security Example Network at large Topologies, bandwidths, latencies, link types, MAC protocols, TCP/IP, BGP, … Core systems Routers, databases, service level agreements, inter-AS relationships, … End systems Processor traits, disk traits, OS instances, daemons, services, S/W bugs, … “Heavy” applications and traffic Video (YouTube, …), VOIP, live streams; foreground, background Behavioral infusion Social nets (topologies, dynamics, agencies, advertisers), peer-to-peer 12

  13. Example: Epidemiology or Computer Worm Propagation Typical dynamics model Multiple variants exist, but qualitatively similar Excellent fit, but post-facto (!) Plot collected data Difficult as predictive model Great amount of detail buried in α Gory detail needed for better predictive power Interaction topology Resource limitations 13

  14. Slippery Slope: Cost and Time Cost to realize experimentation capability Time to reach experimentation capability 14

  15. Our Research Organization in Discrete Event Runtimes and Applications Evacuation Decision Support Automated Detection/Tracking Design & Analysis Comm. Effects Design & Analysis … • Customization • Scenario Generation • Experimentation • Visualization TransportationNetworkSimulations Sensor Network Simulations … • Core Models • Feasibility Demonstration • Extensible Frameworks • Novel Modeling Methods • Trade-offs • Memory-Computation • Speed-Accuracy Vehicular Simulations Communication Network Simulations Logistics Simulations Enterprise Simulations Social Network Simulations Asynchronous Scientific Simulations … Parallel/Distributed Discrete Event Simulation Engines • “Enabling” • Scalability • Efficiency • Correctness • Robustness • Usability • Extensibility • Integration Model Execution Synchronization Data Integration Interoperability Multi-Scale … Super computers Clusters Multi-Cores GPGPUs PDAs … Business Sensitive

  16. A Few of Our Current Areas, Projects State-level mobility Multi-million intersections and links Epidemiological analyses Detailed, billion-entity dynamics Wireless radio signal estimation Multi-million-cell cluttered terrains Supercomputer design Designing next architectures by simulating on current Internet security, protocol design As-is instantiation of nodes and routers Populace’s cognitive behaviors Large population cognition with connectionist networks GARFIELD-EVAC 106-107-link scenarios of FL, LA, … RCREDIF 109-individual infection scenarios RCTLM 3-D 107-cells simulated on 104 cores µΠ Performance prediction of 106-core MPI programs on 104 cores NetWarp Hi-fi Internet test-bed

  17. Scalable Experimentation for Cyber Security NetWarp is our novel test-bed technology for highly scalable, detailed, rapid experimentation of cyber security and cyber infrastructures

  18. Real-Time or Faster Cyber Experimentation Approaches Fully Virtualized System NetWarp Hardware Testbed As Fast As Possible Fidelity Emulation System Packet-level Simulation Parallel Sequential Mixed Abstraction Simulation Aggregate Models 102 103 104 105 106 107 108 Scalability

  19. NetWarp Architecture 19 Business sensitive

  20. DOE-Sponsored Institute for Advanced Architectures and Algorithms Need highly scalable simulation methods and methodologies to simulate next generation architectures and algorithms on future supercomputing platforms… “…catalyst for the co-design and development of architectures, algorithms, and applications to create synergy in their respective evolutions…”

  21. μπ (MUPI) Performance Investigation System μπ = micro parallel performance investigator Performance prediction for MPI, Portals and other parallel applications Actual application code executed on the real hardware Platform is simulated at large virtual scale Timing customized by user-defined machine Scale is key differentiator Target: 150,000 virtual cores E.g., 150,000 virtual MPI ranks in simulated scenario Based on µsik (microsimulator kernel)‏ Scalable PDES engine TCP- or MPI-connected simulation kernels

  22. Example: MPI application over μπ Modify MPI include and recompile Change #include <mpi.h> to#include <mupi.h> Relink to mupi library Instead of –lmpi, use -lmupi Run the modified MPI application(a μπ simulation)‏ mpirun –np 4 test -nvp 32runs test with 32 virtual MPI rankssimulation uses 4 real cores μπ itself uses multiple real cores to run in parallel

  23. Epidemic Disease Propagation • Can be an extremely challenging simulation problem • Asymptotic behaviors are relatively well understood • Transients are poorly understood, hard to predict well • Defined and characterized by many interlinked processes • “Gory Detail” necessary

  24. Epidemic Disease Propagation • Reaction-diffusion processes • Probability based on interaction times, vulnerabilities, thresholds • Short- and long-distance mobility, sojourn times • Probabilistic state transitions, infections, recoveries • Supercomputing’08 model reported scalability only to 400 cores • Synchronization costs become prohibitive • Synchronous execution our prime suspect • Our discrete event execution relieves synchronization costs • Scales to tens of thousands of cores • Up to 1 billion affected entities Image from psc.edu

  25. PDES Scaling Neeeds Anticipate impending opportunities in multiple application areas of grand-scale PDES scenarios Prepare to capitalize on increasing computational power (300K+ cores) Aim to achieve computational capability to enable new PDES-based scientific solutions

  26. Jaguar Petascale System [Cray XT5]

  27. Jaguar: NCCS’ Cray XT5* * Data and images from http://nccs.gov

  28. Technological Upgrade: 105-Scalable PDES Frameworks To realize scale with any of the PDES models and applications, we need the core frameworks to scale

  29. Recent Attempts at 105-Core PDES Frameworks Bauer et al (Jun’09) on Blue Gene P (Argonne) Perumalla & Tipparaju (Jan’09) on Cray XT5 (ORNL) Degradation beyond 64K cores observed by us as well as others Degradation observed in more than one metric (rollback efficiency, speedup) Business Sensitive

  30. Implications to Discrete Event Execution on High Performance Computing Platforms Business Sensitive

  31. Some of our Objectives Scale from 104 cores (current) to 105-106 cores (new) Realize very large-scale scenarios (multi-billion entity) Cyber infrastructures, social computing, epidemiology, logistics Aid projects in simulation-based design of future generation supercomputers Fill technological gap by achieving the highest scaling capabilities of parallel discrete event simulations Ultimately, enable formulation of grand-scale solutions with non-traditional supercomputing simulations

  32. Electro-magnetic (EM) Wave Propagation Predict receiver signal Account for reflectivity, transmitivity, multi-path effects Power level (voltage) modeled per face of grid cell

  33. PHOLD Benchmark Relatively fine grained ~5 microseconds computation per event 10 “juggler” entities per processor core Analogous to grid cells, road intersections or such Total of 1000 “juggling balls” per core Analogous to state updates exchanged among cells Upon receipt of a ball event, a juggler throws it back random (exponential) time into the future to a random juggler 1 every 1000 juggling exchanges are constrained to be intra-core, rest inter-core

  34. Radio Propagation: Speedup on Cray XT4

  35. Radio Propagation: Speedup on Cray XT4

  36. Radio Propagation: Runtime Costs on Cray XT4

  37. Epidemic Propagation: Performance on Cray XT5

  38. Epidemic Propagation – Parallel Run time on Cray XT5

  39. PHOLD: Performance on Cray XT5

  40. Scalability – Observations Scalability problems with current approaches not evident previously Fine until 104 cores, but poor thereafter Even with discrete event, implementation is key Semi-asynchronous execution scales poorly Fully asynchronous execution needed

  41. Algorithm Design and Development for Scalable Discrete Event Execution Design algorithms optimized for Cray XT5, Blue Gene P/Q Design new virtual-time synchronization algorithm Design novel rollback control schemes Design discrete event-specific flow control Current synchronization algorithm

  42. Additional Important Algorithmic Aspects Novel separation of event communication from synchronization Prioritization support in our communication layer “QoS” support for fast synchronization Novel timestamp-aware buffering Exploit near vs. far timestamps Coordinated with virtual-time synchronization Efficient flow control Highly unstructured inter-processor communication Optimized rollback dynamics Stability and throttling mechanisms Cancel back protocols Example of the “transient event” problem

  43. Data Integration Interface Development Application Programming Interface (API) to Incorporate streaming input into discrete event execution Achieve runtime efficiency as an important consideration Novel concepts supporting latency-hiding • To permit maximal concurrency without violating time-ordering between live simulation and real-time inputs • Reuse optimistic synchronization for latency-hiding for unpredictable data input from external sources

  44. Software Implementation Runtime algorithms and data integration interfaces realized in software Primarily in C/C++ Building on current software (scales to 104) Optimized for performance on Cray XT5 and Blue Gene P Communication to be structured flexibly Use MPI or Portals or combination Will explore potentially new layers Non-blocking collectives (MPI-3) Chapel language Our current scalable data structures Our existing layered software

  45. Performance Metrics Efficiency, speedup measured using event rates Event rate ≡ No. of events processed per wall clock sec • Weak scaling: Ideal speedup ≡ Events/second/processor invariant with number of processors • Strong scaling: Ideal speedup ≡ Aggregate events/second linearly increases with number of processors

  46. Entire runtime and data integration frameworks to be exercised Instantiate scenarios scaled up from smaller-scale scenarios in literature Experiment with strong-scaling as well as weak-scaling, as appropriate for each application area Application Benchmarking and Demonstration At-scale simulation from each area Epidemiological simulations Human behavioral simulations Cyber infrastructure simulations Logistics simulations Example: Probability of infection in epidemiological model Example inter-entity networks

  47. Status Showed preliminary evidence that PDES is Feasible even at the largest core-counts Adequately scalable to over 100,000 cores But should be improved much, much more Applications can now move beyond “if” and begin to contemplate on “how” to use petascale discrete event execution

  48. Methodological Alternatives Sometimes, new modeling formulations may better suit scaling needs! Redefine and refine model to suit the computing platform Example Ultra-scale vehicular mobility simulations on GPUs…

  49. Example: Ultra-scale Vehicular Mobility Simulations E.g., National Evacuation Conference www.nationalevacuationconference.org

  50. Our GARFIELD Simulation & Visualization System Texture Memory Texture Memory Texture Memory Texture Memory Texture Memory Texture Memory FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP FP FP FP FP FP Demo v v v v

More Related