1 / 34

Optimizing N-body Simulations for Multi-core Compute Clusters

Optimizing N-body Simulations for Multi-core Compute Clusters. Advisor: Dr. Aamir Shafi. Ammar Ahmad Awan BIT-6. Co-Advisor: Mr. Ali Sajjad. Member: Dr. Hafiz Farooq. Member: Mr. Tahir Azim. Introduction Design & Implementation Performance Evaluation Conclusions and Future Work.

azia
Download Presentation

Optimizing N-body Simulations for Multi-core Compute Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing N-body Simulations for Multi-core Compute Clusters Advisor: Dr. AamirShafi Ammar Ahmad Awan BIT-6 Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. TahirAzim

  2. Introduction Design & Implementation Performance Evaluation Conclusions and Future Work Presentation Outline

  3. Introduction • Sea change in the basic computer architecture: • Power Consumption • Heat Dissipation • Emergence of multiple energy-efficient processing cores instead of a single power-hungry core • Moore’s law will now be realized by increasing core-count instead of increasing clock speeds • Impact on software applications: • Change of focus from Instruction Level Parallelism ( higher clock frequency) to Thread Level Parallelism ( increasing core count ) • Huge impact on High Performance Computing (HPC) community: • 70% of the TOP500 supercomputers are based on multi-core processors

  4. Source : Google Images

  5. Source : www.intel.com

  6. SMP vs Multicore Symmetric Multi-Processor Multi-core Processor

  7. HPC and Multi-core • Message Passing Interface (MPI) is the defacto standard for programming today’s supercomputers • Alternatives include OpenMP (for SMP machines) and Unified Parallel C (UPC) • With the existing approaches, it is possible to port MPI on multi-core processors: • One MPI process per core—we call it the “Pure MPI” approach • OpenMP threads inside MPI process—we call it “MPI+threads” approach • We expect “MPI+threads” approach to be good because • Communication cost for threads is lower than processes • Threads are light-weight • We have evaluated this hypothesis by comparing both approaches

  8. Pure MPI vs “MPI+threads” approach MPI + Threads Approach Pure MPI Approach

  9. Sample Application: N-body Simulations • To demonstrate the usefulness of our “MPI+threads” approach, we chose N-body simulation code • N-body or “many body” method is used for simulating the evolution of a system consisting of ‘n’ bodies. • It has found a widespread use in the fields of • Astrophysics • Molecular Dynamics • Computational Biology

  10. Summation Approach to solving N-body problems The most compute intensive part of any N-body method is the “force calculation” phase The simplest expression for a far field force f(i)on particle ‘i’ is for i = 1 to n f(i) = sum[ j=1,...,n, j != i ] f(i,j) end for where f(i,j) is the force on particle i due to particle j. The cost of this calculation is O(n2)

  11. Barnes Hut Tree The Barnes-Hut algorithm is divided into 3 steps • Building the tree – O( n * log n ) • Computing cell centers of mass – O (n) • Computing Forces – O( n * log n ) • Other popular methods are • Fast Multipole Method • Particle Mesh Method • TreePM Method • Symplectic Methods

  12. Sample Application: Gadget-2 • Cosmological Simulation Code • Simulates a system of “n” bodies • Implements Barnes-Hut Algorithm • Written in C language & parallelized with MPI • As part of this project: • Understood the Gadget-2 code • How it is used in production mode • Modified the C code to use threads in the Barnes-hut tree algorithm • Added performance counters to the code for measuring cache utilization

  13. Presentation Outline Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 13

  14. Gadget-2 Architecture

  15. Code Analysis for ( i = 0 to No. of particles && n = 0 to BufferSize) { calculate_force( i ); for ( j = 0 to No. of tasks ) { export_particles( j ); } } Original Code parallel for ( i=0 to n ) { calculate_force( i ); } for (i = 0 to No. of particles && n = 0 to BufferSize ) { for ( j = 0 to No. of tasks ) { export_particles( j ); } } Modified Code

  16. Presentation Outline Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 16

  17. Evaluation Testbed • Our cluster called Chenab consists of nine nodes. • Each node consists of an • Intel Xeon Quad-Core Kentsfield Processor • 2.4 GHz with 1066 MHZ FSB • 4 MB L2 Cache / two cores • 32 KB L1 Cache / core • 2 GB main memory

  18. Performance Evaluation • Performance evaluation is based on two main parameters • Execution Time • Calculated directly from MPI wallclock timings • Cache Utilization • We patched the Linux kernel using perfctr patch • We selected the PerfAPI ( PAPI ) for hardware performance counting • Used PAPI_L2_TCM (Total Cache Misses ) and PAPI_L2_TCA (Total Cache Accesses ) to calculate cache miss ratio • Results are shown on the upcoming slides • Execution Time for Colliding Galaxies • Execution Time for Cluster Formation • Execution Time for Custom Simulation • Cache Utilization for Cluster Formation

  19. Execution Time for Colliding Galaxies

  20. Execution Time for Cluster Formation

  21. Execution Time for Custom Simulation

  22. Cache Utilization for Cluster Formation Cache utilization has been measured using hardware counters provided by the kernel patch (Perfctr) and PerfAPI (PAPI)

  23. Presentation Outline Introduction Design & Implementation Performance Evaluation Conclusions and Future Work 23

  24. Conclusion • We optimized Gadget-2 which was our sample application • “MPI+threads” approach performs better • The optimized code offers scalable performance • We are witnessing dramatic changes in core designs for multicore systems • Heterogeneous and Homogeneous designs • Targeting a 1000 core processor will require scalable frameworks and tools for programming

  25. Conclusion • Towards Many-core computing • Multicore: 2x / 2 yrs  ≈ 64 cores in 8 years • Manycore : 8x to 16x multicore Source: Dave Patterson, Overview of the Parallel Laboratory

  26. Future Work • Scalable Frameworks which provide programmer friendly high level constructs are very important • PeakStream provides GPU and CPU+GPU hybrid programs • Cilk++ augment the C++ compiler with three new keywords ( cilk_for, cilk_sync, cilk_spawn ) • Research Accelerator for Multi Processors (RAMP) can be used to simulate a 1000 core processor • Gadget-2 can be ported to GPUs using Nvidia’s CUDA framework • ‘xlc’ compiler to program the STI Cell Processor

  27. The Timeline

  28. Barnes Hut Tree

More Related