Mpi vs posix threads
This presentation is the property of its rightful owner.
Sponsored Links
1 / 14

MPI vs POSIX Threads PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on
  • Presentation posted in: General

MPI vs POSIX Threads. A Comparison. Overview. MPI allows you to run multiple processes on 1 host How would running MPI on 1 host compare with POSIX thread solution? Attempting to compare MPI vs POSIX run times Hardware Dual 6 Core (2 threads per core) 12 logical

Download Presentation

MPI vs POSIX Threads

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


MPI vs POSIX Threads

A Comparison


Overview

  • MPI allows you to run multiple processes on 1 host

    • How would running MPI on 1 host compare with POSIX thread solution?

  • Attempting to compare MPI vs POSIX run times

  • Hardware

    • Dual 6 Core (2 threads per core) 12 logical

      • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt

    • Intel Xeon CPU E5 – 2667 (show schematic)

      • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf

    • 2.96 GHz

    • 15 MB L3 Cache

  • All code / output / analysis available here:

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/


Specifics

  • Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory

    • Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys

    • Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. The network doesn’t get involved)

  • Only makes sense with 1 machine

  • Set up test bed

    • Try each step individually, check results, then automate

  • Use Matrix Matrix multiply code we developed over the semester

    • Everyone is familiar with the code and can make observations

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/pthread_matrix_21.c

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_3.c

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_no_mp.c

  • Use square matrices

  • Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of big ones)

  • Matrix A will be filled with 1-n Left to Right and Top Down

  • Matrix B will be the identity matrix

    • Can then check our results easily as A*B = A when B = identity matrix

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt

    • Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing


Matrix Sizes

Third Column:

Just the number of calculations inside the loop for calculating the matrix elements


Specifics cont.

  • About the runs

    • For each MATRIX size (500 -> 3000 ,4000, 5000, 6000,7000,8000,9000,10000)

    • Vary thread count 2-12 (POSIX)

    • Vary Processes 2-12 (MPI)

    • Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the system doing routine tasks)

  • Make observations about anomalies in the run times where appropriate

  • Caveats

    • All initial runs with no optimization for testing, but hey this is a class about performance

    • Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference)

      • First level optimization made a huge difference > 3 x improvement

    • GNU Optimization explanation can be found here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

    • Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated)

    • Not all optimizations are flag controlled

    • Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and observations

  • Oh No moment **

    • Huge improvement in performance with optimized code, why?

    • What if the improvement in performance ( from compiler optimization) was due to the identity matrix?

    • Came back and made matrix B non Identity, same performance. Whew.

      • I now Believe the main performance improvement came from loop unrolling.

    • Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I thought it was?

    • Came back and made matrix B non Identity, same performance. Whew.

    • Ready to make the runs


Discussion

  • Please chime in as questions come up.

  • Process Explanation: (After initial testing and verification)

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt

  • Attempted a 25,000 x 25,000 matrix

    • Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices)

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt

    • Not an issue for POSIX threads (until you run out of memory on the machine) swap

  • Settled on 12 Processes / Threads because of the number of cores available

    • Do you get enhanced or degraded performance by exceeding that number?

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt

  • Example of process space / top output (10,000 x 10,000)

    • Early testing, before runs started. Pre Optimization

    • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt


Time Comparison (Boring)


Time Comparison (still boring…)In all these cases time for 5 ,4, 3, 2 processes much longer than 6 so left of for comparison

POSIX Doesn’t “catch” back up till 9 processes

MPI Doesn’t “catch” back up till 11 processes


MPI Time Curve


POSIX Time Curve


POSIX Threads Vs MPI Processes Run TimesMatrix Sizes 4000x4000 – 10,000 x 10,000


POSIX Threads 1500 x 1500 – 2500x2500


1600 x 1600 case

  • Straight C runs long enough to see top output (here I can see the memory usage)

    • threaded ,MPI, and non mp code share same basic structure for calculating “C” Matrix

  • Suspect some kind of boundary issue here, possibly “false sharing”?

  • Process fits entirely in shared L3 cache 15 MB x 2 = 30MB

  • Do same number of calculations but make initial array allocations larger (shown below)

[[email protected] ~/SUNY]$ foreach NUM_TRIALS (1 2 3 4 5)

foreach? ./a.out

foreach? End

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.979548 secs

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.980786 secs

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.971891 secs

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.974897 secs

Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 22.012967 secs

[[email protected] ~/SUNY]$ foreach NUM_TRIALS ( 1 2 3 4 5 )

foreach? ./a.out

foreach? End

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.890815 secs

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.903997 secs

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.881991 secs

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.884655 secs

Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.887197 secs

[[email protected] ~/SUNY]$


Future Directions

  • POSIX Threads with Network memory? (NFS)

  • Combo MPI and POSIX Threads?

    • MPI to multiple machines, then POSIX threads ?

    • http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads

    • POSIX threads that launch MPI ?

  • Couldn’t get MPE running with MPIch (would like to re-investigate why)

  • Investigate optimization techniques

    • Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO

    • Rerun with non-identity B matrix and compare times <- DONE

  • Try different languages ie CHAPEL

  • Try different algorithms

  • Want to add OpenMP to the mix

    • Found this paper on OpenMPvs direct POSIX programming (similar tests)

    • http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf

  • For < 6 processes look at thread_affinity and assignment of threads to a physical processor


  • Login