mpi vs posix threads
Download
Skip this Video
Download Presentation
MPI vs POSIX Threads

Loading in 2 Seconds...

play fullscreen
1 / 14

MPI vs POSIX Threads - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

MPI vs POSIX Threads. A Comparison. Overview. MPI allows you to run multiple processes on 1 host How would running MPI on 1 host compare with POSIX thread solution? Attempting to compare MPI vs POSIX run times Hardware Dual 6 Core (2 threads per core) 12 logical

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MPI vs POSIX Threads' - jaunie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview
Overview
  • MPI allows you to run multiple processes on 1 host
    • How would running MPI on 1 host compare with POSIX thread solution?
  • Attempting to compare MPI vs POSIX run times
  • Hardware
    • Dual 6 Core (2 threads per core) 12 logical
      • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt
    • Intel Xeon CPU E5 – 2667 (show schematic)
      • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf
    • 2.96 GHz
    • 15 MB L3 Cache
  • All code / output / analysis available here:
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/
specifics
Specifics
  • Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory
    • Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys
    • Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. The network doesn’t get involved)
  • Only makes sense with 1 machine
  • Set up test bed
    • Try each step individually, check results, then automate
  • Use Matrix Matrix multiply code we developed over the semester
    • Everyone is familiar with the code and can make observations
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/pthread_matrix_21.c
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_3.c
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_no_mp.c
  • Use square matrices
  • Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of big ones)
  • Matrix A will be filled with 1-n Left to Right and Top Down
  • Matrix B will be the identity matrix
    • Can then check our results easily as A*B = A when B = identity matrix
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt
    • Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing
matrix sizes
Matrix Sizes

Third Column:

Just the number of calculations inside the loop for calculating the matrix elements

specifics cont
Specifics cont.
  • About the runs
    • For each MATRIX size (500 -> 3000 ,4000, 5000, 6000,7000,8000,9000,10000)
    • Vary thread count 2-12 (POSIX)
    • Vary Processes 2-12 (MPI)
    • Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the system doing routine tasks)
  • Make observations about anomalies in the run times where appropriate
  • Caveats
    • All initial runs with no optimization for testing, but hey this is a class about performance
    • Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference)
      • First level optimization made a huge difference > 3 x improvement
    • GNU Optimization explanation can be found here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
    • Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated)
    • Not all optimizations are flag controlled
    • Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and observations
  • Oh No moment **
    • Huge improvement in performance with optimized code, why?
    • What if the improvement in performance ( from compiler optimization) was due to the identity matrix?
    • Came back and made matrix B non Identity, same performance. Whew.
      • I now Believe the main performance improvement came from loop unrolling.
    • Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I thought it was?
    • Came back and made matrix B non Identity, same performance. Whew.
    • Ready to make the runs
discussion
Discussion
  • Please chime in as questions come up.
  • Process Explanation: (After initial testing and verification)
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt
  • Attempted a 25,000 x 25,000 matrix
    • Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices)
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt
    • Not an issue for POSIX threads (until you run out of memory on the machine) swap
  • Settled on 12 Processes / Threads because of the number of cores available
    • Do you get enhanced or degraded performance by exceeding that number?
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt
  • Example of process space / top output (10,000 x 10,000)
    • Early testing, before runs started. Pre Optimization
    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt
slide8
Time Comparison (still boring…)In all these cases time for 5 ,4, 3, 2 processes much longer than 6 so left of for comparison

POSIX Doesn’t “catch” back up till 9 processes

MPI Doesn’t “catch” back up till 11 processes

1600 x 1600 case
1600 x 1600 case
  • Straight C runs long enough to see top output (here I can see the memory usage)
    • threaded ,MPI, and non mp code share same basic structure for calculating “C” Matrix
  • Suspect some kind of boundary issue here, possibly “false sharing”?
  • Process fits entirely in shared L3 cache 15 MB x 2 = 30MB
  • Do same number of calculations but make initial array allocations larger (shown below)

[[email protected] ~/SUNY]$ foreach NUM_TRIALS (1 2 3 4 5)

foreach? ./a.out

foreach? End

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.979548 secs

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.980786 secs

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.971891 secs

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.974897 secs

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 22.012967 secs

[[email protected] ~/SUNY]$ foreach NUM_TRIALS ( 1 2 3 4 5 )

foreach? ./a.out

foreach? End

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.890815 secs

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.903997 secs

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.881991 secs

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.884655 secs

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.887197 secs

[[email protected] ~/SUNY]$

future directions
Future Directions
  • POSIX Threads with Network memory? (NFS)
  • Combo MPI and POSIX Threads?
    • MPI to multiple machines, then POSIX threads ?
    • http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads
    • POSIX threads that launch MPI ?
  • Couldn’t get MPE running with MPIch (would like to re-investigate why)
  • Investigate optimization techniques
    • Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO
    • Rerun with non-identity B matrix and compare times <- DONE
  • Try different languages ie CHAPEL
  • Try different algorithms
  • Want to add OpenMP to the mix
    • Found this paper on OpenMPvs direct POSIX programming (similar tests)
    • http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf
  • For < 6 processes look at thread_affinity and assignment of threads to a physical processor
ad