1 / 21

BigSim: Simulating PetaFLOPS Supercomputers

BigSim: Simulating PetaFLOPS Supercomputers. Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign. Introduction. Petascale machines are being built It is hard to understand parallel application without actually running it

zola
Download Presentation

BigSim: Simulating PetaFLOPS Supercomputers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BigSim: Simulating PetaFLOPS Supercomputers Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

  2. Introduction • Petascale machines are being built • It is hard to understand parallel application without actually running it • Tools that allow one to predict the performance of applications before machines are built, also help debugging and tuning • Allows easier "offline" experimentation • Provides a feedback for machine designers 5th Annual Workshop on Charm++ and its Applications

  3. Overview • Prediction accuracy • Sequential performance • Parallel performance (message passing) • Flexibility - different levels of fidelity/tradeoffs • Challenges for usability • Memory constraints • Time cost • How to analyze result - scalable performance analysis tools • Portability • Scalability • BigSim system 5th Annual Workshop on Charm++ and its Applications

  4. Summary of Our Approach • Parallel Discrete Event Simulation (PDES) • the operation of a system is represented as a chronological sequence of events • Event-driven simulation method • Two phase simulation • BigSim Emulator, capable of running application • Charm++, Adaptive MPI / MPI • Predict sequential execution blocks • Generate trace logs • Parallel Discrete Event Simulator (POSE-based) • Predict message passing performance • Implemented on Charm++ 5th Annual Workshop on Charm++ and its Applications

  5. BigSim System Architecture Performance visualization (Projections) Offline PDES Network Simulator BigSim Simulator Simulation output trace logs Charm++ Runtime Load Balancing Module Performance counters Instruction Sim (RSim, IBM, ..) Simple Network Model BigSim Emulator Charm++ and MPI applications 5th Annual Workshop on Charm++ and its Applications

  6. BigSim Emulator – the first step • Actual execution of real applications on much smaller machines • Model each target processor as virtual processor • Using light-weight (Converse) threads simulating each target processor • Charm++/AMPI applications without change • No global variables/swap-globals/copy-globals 5th Annual Workshop on Charm++ and its Applications

  7. Hardware thread Simulating (Host) Processor Emulation on a Parallel Machine Target Nodes 5th Annual Workshop on Charm++ and its Applications

  8. Predicting Sequential Performance • Different level of fidelity: • Direct mapping • Performance counter • Instruction-level simulator • Tradeoff between time cost and accuracy • Exploit inherent determinacy of Charm++ application • Out of order messages do not change the computation behavior (structured dagger) 5th Annual Workshop on Charm++ and its Applications

  9. [22] name:msgep (srcnode:0 msgID:21) ep:1 [[ recvtime:0.000498 startTime:0.000498 endTime:0.000498 ]] backward: forward: [23] [23] name:Chunk_atomic_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000498 endTime:0.000503 ]] msgID:3 sent:0.000498 recvtime:0.000499 dstPe:7 size:208 msgID:4 sent:0.000500 recvtime:0.000501 dstPe:1 size:208 backward: [22] forward: [24] [24] name:Chunk_overlap_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000503 endTime:0.000503 ]] backward: [0x80a7af0 23] forward: [25] [28] Event logs are generated for each target processor Predicted time on execution blocks with timestamps Event dependencies Message sending events Trace Logs 5th Annual Workshop on Charm++ and its Applications

  10. Predicting message passing performance – second step • Parallel Discrete Event Simulation • Needed for correcting causality errors due to out-of-order messages • Timestamps are re-adjusted without rerunning the actual application • Simulate network behaviours: packetization, routing, contention, etc • simple latency based network model, and contention-based network model 5th Annual Workshop on Charm++ and its Applications

  11. PDES and Charm++ • PDES is a naturally message-driven activity • Large number of entities naturally maps to virtual processors • Fine grained PDES • Migratable PDES entities and load balancing 5th Annual Workshop on Charm++ and its Applications

  12. Network Simulator Design 5th Annual Workshop on Charm++ and its Applications

  13. Network Simulator: Data Flow 5th Annual Workshop on Charm++ and its Applications

  14. Network Communication Pattern Analysis • NAMD with apoa1 • 15 timestep 5th Annual Workshop on Charm++ and its Applications

  15. Network Communication Pattern Analysis Data transferred (KB) in a single time step 5th Annual Workshop on Charm++ and its Applications

  16. NAMD network simulation (scalability) • Simulating NAMD on 2048 processors with 3D Torus topology • Tests ran on Turing Apple cluster with Myrinet Terry L. Wilmarth, Gengbin Zheng, Eric J. Bohm, Yogesh Mehta, Praveen Jagadishprasad, Laxmikant V. Kalé, ``Performance Prediction using Simulation of Large-scale Interconnection Networks in POSE'', PADS 2005 5th Annual Workshop on Charm++ and its Applications

  17. LeanMD Performance Analysis • Benchmark 3-away ER-GRE • 36573 atoms • 1.6 million objects • 8 step simulation • 64k BG processors • Running on PSC Lemieux 5th Annual Workshop on Charm++ and its Applications

  18. Integration with Instruction level simulators • Instruction level simulators typically are • Standalone applications • Sequential execution • Very slow • An interpolation-based scheme • Use result of a smaller scale instruction level simulation to interpolate for large dataset • do a least-squares fit to determine the coefficients of an approximation polynomial function 5th Annual Workshop on Charm++ and its Applications

  19. Case study: BigSim / Mambo void func( ) { StartBigSim( ) … EndBigSim( ) } Mambo Prediction for Target System Cycle-accurate prediction of sequential blocks on POWER7 processor BigSim Parallel Emulation BigSim Parallel Simulation Interpolation + Replace sequential timing Trace files Parameter files for sequential blocks Adjusted trace files 5th Annual Workshop on Charm++ and its Applications

  20. Future Work • Use performance counter to predict sequential timing • Out-of-core execution for memory bound applications 5th Annual Workshop on Charm++ and its Applications

  21. Out-of-core Emulation • Motivation • Physical memory is shared • VM system would not handle well • Message driven execution • Peek msg queue => what execute next? (prefetch) 5th Annual Workshop on Charm++ and its Applications

More Related