1 / 15

MPI vs POSIX Threads

MPI vs POSIX Threads. A Comparison. Overview. MPI allows you to run multiple processes on 1 host How would running MPI on 1 host compare with a similar POSIX thread solution? Attempting to compare MPI vs POSIX run times Hardware Dual 6 Core (2 threads per core) 12 logical

harley
Download Presentation

MPI vs POSIX Threads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MPI vs POSIX Threads A Comparison

  2. Overview • MPI allows you to run multiple processes on 1 host • How would running MPI on 1 host compare with a similar POSIX thread solution? • Attempting to compare MPI vs POSIX run times • Hardware • Dual 6 Core (2 threads per core) 12 logical • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt • Intel Xeon CPU E5 – 2667 (show schematic) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf • 2.96 GHz • 15 MB L3 Cache Shared 2.5MB per core • All code / output / analysis available here: • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/

  3. About the Time Trials • Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory • Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys • Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. network latency isn’t the weak link. • So this analysis only makes sense on 1 machine • Use Matrix Matrix multiply code we developed over the semester • Everyone is familiar with the code and can make observations • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/pthread_matrix_21.c • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/matmat_3.c • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/matmat_no_mp.c • Use square matrices • Not necessary but it made things more convenient • Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of bigger ones) • Matrix A will be filled with 1-n Left to Right and Top Down • Matrix B will be the identity matrix • Can then check our results easily as A*B = A when B = identity matrix • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt • Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing • Set up test bed • Try each step individually, check results, then automate

  4. Specifics cont. • About the runs • For each MATRIX size (500 -> 3000 ,4000, 5000, 6000,7000,8000,9000,10000) • Vary thread count 2-12 (POSIX) • Vary Processes 2-12 (MPI) • Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the system doing routine tasks) • With later runs I ran 12, dropped high and low then took average • Try Make observations about anomalies in the run times where appropriate • Caveats • All initial runs with no optimization for testing, but hey this is a class about performance • Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference) • First level optimization made a huge difference > 3 x improvement • GNU Optimization explanation can be found here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html • Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated) • Not all optimizations are flag controlled • Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and observations • Oh No moment ** • Huge improvement in performance with optimized code, why? • I now Believe the main performance improvement came from loop unrolling. • Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I thought it was? • Came back and made matrix B non Identity, same performance. Whew. • OK - Ready to make the runs

  5. Discussion • Please chime in as questions come up. • Process Explanation: (After initial testing and verification) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt • top –d .1 (tap 1 to show CPU list tap H to show threads) • Attempted a 25,000 x 25,000 matrix • Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt • Not an issue for POSIX threads (until you run out of memory on the machine) swap • Settled on 12 Processes / Threads because of the number of cores available • Do you get enhanced or degraded performance by exceeding that number? • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt • Example of process space / top output (10,000 x 10,000) • Early testing, before runs started. Pre Optimization • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt • Use >> top –d t (t in floating point secs ; linux) hit “1” key to see list of the cores • Take a look at some numbers • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_optmized-400-3000_ave.xlsx • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_optimized-4000-10000_ave.xlsx • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/MPI_optmized-400-3000_ave.xlsx • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/MPI_optimized-4000-8000_ave.xlsx

  6. Time Comparison

  7. Time ComparisonIn all these cases time for 5 ,4, 3, 2 processes much longer than 6 so left of for comparison POSIX Doesn’t “catch” back up till 9 processes MPI Doesn’t “catch” back up till 11 processes

  8. MPI Time Curve

  9. POSIX Time Curve

  10. POSIX Threads Vs MPI Processes Run TimesMatrix Sizes 4000x4000 – 10,000 x 10,000

  11. POSIX Threads 1500 x 1500 – 2500x2500

  12. MPI 1500 x 1500 – 1800 x 1800 Notice MPI Didn’t exhibit the same problem at size 1600 as POSIX and NO MP case.

  13. POSIX & NO MP 1600 x 1600 case • Straight C runs long enough to see top output (here I can see the memory usage) • threaded ,MPI, and non mp code share same basic structure for calculating “C” Matrix • Suspect some kind of boundary issue here, possibly “false sharing”? • Process fits entirely in shared L3 cache 15 MB x 2 = 30MB • Do same number of calculations but make initial array allocations larger (shown below) [rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS (1 2 3 4 5) foreach? ./a.out foreach? End Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.979548 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.980786 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.971891 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.974897 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 22.012967 secs [rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS ( 1 2 3 4 5 ) foreach? ./a.out foreach? End Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.890815 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.903997 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.881991 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.884655 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.887197 secs [rahnbj@rage ~/SUNY]$

  14. Notes / Future Directions • Start MPI Timer after communication. Is comsthe sole source of difference? <- TESTED NO • At the boundary conditions the driving force is the amount of memory allocated on the heap. • Not the number of calculations being performed • Intel had a nice article about false sharing: • https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads • link to a product they sell for detecting false sharing on their processors • Combo MPI and POSIX Threads? • MPI to multiple machines, then POSIX threads ? • http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads • Found this paper on OpenMPvs direct POSIX programming (similar tests) • http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf • Couldn’t get MPE running with MPIch (would like to re-investigate why) • Investigate optimization techniques • Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO • Rerun with non-identity B matrix and compare times <- DONE • Try different languages ie CHAPEL • Try different algorithms • For < 6 processes look at thread_affinity and assignment of threads to a physical processor • There is no gaurantee that with 6 or less processes they will all reside on same physical processor • Noticed CPU switching occaionally. • Setting the affinity can mitigate this, thread = assigned and not “allowed” to move

  15. Notes / Future Directions cont. • Notice the shape of the curves for both MPI and POSIX solutions. There is definitely a point of diminishing returns. 6? In this particular case. • Instead of using 12 cores could we cut the problem set in half and launch 2 independent 6 process solutions by declaring thread_affinity? • Would this produce better results? • How to merge the 2 process spaces?

More Related