1 / 23

Parallelizing Spacetime Discontinuous Galerkin Methods

Parallelizing Spacetime Discontinuous Galerkin Methods. Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber, S. Thite, J. Palaniappan This research made possible via NSF grant DMR 01-21695. http://charm.cs.uiuc.edu. Parallel Programming Lab.

usoa
Download Presentation

Parallelizing Spacetime Discontinuous Galerkin Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber, S. Thite, J. Palaniappan This research made possible via NSF grant DMR 01-21695 http://charm.cs.uiuc.edu

  2. Parallel Programming Lab • Led by Professor Laxmikant Kale • Application-oriented • Research is driven by real applications and the needs of real applications • NAMD • CSAR Rocket Simulation (Roc*) • Spacetime Discontinuous Galerkin • Petaflops Performance Prediction (Blue Gene) • Focus on scaleable performance for real applications http://charm.cs.uiuc.edu

  3. Charm++ Overview • In development for roughly ten years • Based on C++ • Runs on many platforms • Desktops • Clusters • Supercomputers • Overlays a C layer called Converse • Allows multiple languages to work together http://charm.cs.uiuc.edu

  4. System of objects Asynchronous communication via method invocation Use an object identifier to refer to an object. User sees each object execute its methods atomically As if on its own processor Charm++: Programmer View Processor Object/Task http://charm.cs.uiuc.edu

  5. Charm++: System View • Set of objects invoked by messages • Set of processors of the physical machine • Keeps track of object to processor mapping • Routes messages between objects Processor Object/Task http://charm.cs.uiuc.edu

  6. Charm++ Benefits • Program is not tied to a fixed number of processors • No problem if program needs 128 processors and only 45 available • Called processor virtualization • Load balancing accomplished automatically • User writes a short routine to transfer object between processors http://charm.cs.uiuc.edu

  7. Load Balancing - Green Process Starts Heavy Computation A B C http://charm.cs.uiuc.edu

  8. Yellow Processes Migrate Away – System Handles Message Routing A A B B C C http://charm.cs.uiuc.edu

  9. Load Balancing • Load balancing isn’t solely dependant on CPU usage • Balancers consider network usage as well • Can move objects to lessen network bandwidth usage • Migrating an object to disk instead of another processor gives checkpoint/restart, out-of-core execution http://charm.cs.uiuc.edu

  10. Parallel Spacetime Discontinuous Galerkin • Mesh generation is an advancing front algorithm • Adds an independent set of elements called patches to the mesh • Spacetime methods are setup in such a way they are easy to parallelize • Each patch depends only on inflow elements • Cone constraint insures no other dependencies • Amount of data per patch is small • Inexpensive to send a patch and its inflow elements to another processor http://charm.cs.uiuc.edu

  11. Mesh Generation Unsolved Patches

  12. Mesh Generation Unsolved Patches Solved Patches

  13. Mesh Generation Refinement Unsolved Patches Solved Patches

  14. Parallelization Method (1D) • Master-Slave method • Centralized mesh generation • Distributed physics solver code • Simplistic implementation • But fast to get running • Provides object migration sanity check • No “time-step” • as soon as a patch returns the master generates any new patches it can and sends them off to be solved http://charm.cs.uiuc.edu

  15. Results - Patches / Second http://charm.cs.uiuc.edu

  16. Scaling Problems • Speedup is ideal at 4 slave processors • After 4 slaves, diminishing speedup occurs • Possible sources: • Network bandwidth overload • Charm++ system overhead (grainsize control) • Mesh generator overload • Problem doesn’t scale-down • More processors don’t slow the computation down http://charm.cs.uiuc.edu

  17. Network Bandwidth • Size of a patch to send both ways is 2048 bytes (very conservative estimate) • Can compute 36 patches/(second*CPU) • Each CPU needs 72kbytes/second • 100Mbit Ethernet provides 10Mbyte/sec • Network can support ~130 CPUs • Must not be a lack of network bandwidth http://charm.cs.uiuc.edu

  18. Charm++ System Overhead (Grainsize Control) • Grainsize is a measure of the smallest unit of work • Too small and overhead dominates • Network latency overhead • Object creation overhead • Each patch takes 1.7ms to setup the connection to send (both ways) • Can send ~550 patches/sec to remote processors • Again, higher than observed patch/second rate • Grainsize can be reduced by sending multiple patches at once • Speeds up the computation but speedup still flattens out after 8 processors http://charm.cs.uiuc.edu

  19. Mesh Generation • With 0 slave processors, 31ms/patch • With 1 slave processor, 27ms/patch • Geometry code takes 4ms to generate a patch • Mesh generator needs a bit more time due to Charm++ message sending overhead • Leads to less than 250 patches/second • Can’t trivially speed this up • Would have to parallelize mesh generation • Parallel mesh generation also would lighten network load if the mesh were fully distributed to slave nodes http://charm.cs.uiuc.edu

  20. Testing the Mesh Generator Bottleneck • Does speeding up the mesh generator give better results? • Leaves the question how to speed up the mesh generator • The cluster used is a P3 Xeon 500Mhz • So run the mesh generator on something faster (a P4 2.8Ghz) • Everything still on 100Mbit network

  21. Fast Mesh Generator Results

  22. Future Directions • Parallelize geometry/mesh generation • Easy to do in theory • More complex in practice with refinement, coarsening • Lessens network bandwidth consumption • Only have to send border elements of all meshes • Compared to all elements sent right now • Better cache performance http://charm.cs.uiuc.edu

  23. More Future Directions • Send only necessary data • Currently send everything, needed or not • Use migration to balance load rather than slaves • Means we’ll also get checkpoint/restart and out-of-core execution for free • Also means we can load balance away some of the network communication • Integrate 2D mesh generation/physics code • Nothing in the parallel code knows the dimensionality http://charm.cs.uiuc.edu

More Related