Parallel Programming with PVM

Parallel Programmingwith PVM Chris Harper

The Choice of PVM • Existing implementation of the most obvious way to setup a parallel cluster • every node runs a daemon and any daemon can execute a program on the cluster by messaging daemons • Reasonably well supported and widespread

Setup Problems • PVM difficult to setup • Unix novices • Remote Shell through PVM blocked for unknown reasons • Group schedule and lab access limitations

Solutions • Use SSH instead of RSH • Less convenient and harder to roll out installation • But it worked and is more secure (not that security mattered) • Install Fedora and PVM on home computer using VMWare to increase access to platform • More time to write code and troubleshoot problems

as root (su -): install pvm: > yum install pvm.i386 set env vars: in /etc/profile, append: PVM_ROOT=/usr/share/pvm3 PVM_ARCH=LINUX PVM_RSH=/usr/bin/ssh export PVM_ROOT PVM_ARCH PVM_RSH in /root/.bashrc, append: PVM_ROOT=/usr/share/pvm3 log out and back in for settings to take affect test env vars: > echo $PVM_ROOT test start pvm: > pvm create a public key for ssh (on master machine): > ssh-keygen -t rsa (use default settings) copy the key to slaves: > scp /root/.ssh/id_rsa.pub root@machine_name:/root/.ssh/authorized_keys (or if multiple masters, scp to a temp file and then cat to authorized keys) might need to restart

The Program:Parallel Merge/Quicksort Hybrid • Why use a Mergesort to test parallel execution? • Simple • The size of the data partitions can be controlled (though usually just divided into equal parts) • Can use any kind of sort on the data partitions that the slaves receive

Algorithm • Random list partitioned into N parts, where N is the greatest power of two less than the number of available (or selected) nodes • List parts sent to N slaves in sequence • Each slave does a quicksort on the received list and returns it to the master • The master takes each pair of list parts and merges them together until the final sorted list is achieved

Problems with Algorithm • Final merge is still serially executed • As the parallel part of the algorithm decreases execution time, the serial part increases. • As such, this algorithm does not scale up well • Requires power of 2 number of nodes • Ignores multiple processors on a node • Unless running on a multiprocessor machine with no other machines in PVM

[root@gemini LINUX]# ./master1 10000000 4 PARALLEL SORT Usage: executable [# items [# tasks]] # items - length of array to sort or -1 for default # tasks - force this many tasks, -1 for default, or 0 for serial Alloc 10000000 items ... OK Memory Use: 19.07 MB Randomizing ... OK 1 nodes available 4 tasks selected Using 4 parallel parts part: 0, pos: 0, len: 2500000, left: 7500000 part: 1, pos: 2500000, len: 2500000, left: 5000000 part: 2, pos: 5000000, len: 2500000, left: 2500000 part: 3, pos: 7500000, len: 2500000, left: 0 Spawning 4 worker tasks ... OK Sending data to slave tasks ... OK Task 262152 (gemini), Part 0 returned. Took 1.668 secs. Task 262153 (gemini), Part 1 returned. Took 1.667 secs. Task 262154 (gemini), Part 2 returned. Took 1.709 secs. Task 262155 (gemini), Part 3 returned. Took 1.688 secs. Elapsed Spawn Time: 0.003 secs Elapsed Tx Overhead Time: 0.328 secs Elapsed Rx Overhead Time: 0.258 secs Elapsed Total Comm Overhead Time: 0.585 secs Elapsed Parallel Time: 2.238 secs Elapsed Program Time: 2.241 secs

Second Attempt:Recursion in Parallel • Makes a list of all the available tasks and associated nodes • Every slave divides the list of available tasks (nodes) in half, keeping the first half and giving the second to the first node in the second half • Similarly, every slave divides their received unsorted list in half, keeping the first half and giving the second half to the same node in the second half of the host list • Until the node pool is depleted • After each task finishes sorting its half of the list, it waits to receive the second half (if it split its half previously) and merges it with the first half • The merged and sorted list is returned to the parent task

More Problems • Latency while one task waits for the second half of the list to return • Fixed by giving the other task less than half of the list • Reduces parallelism unfortunately • First recursive task has to duplicate the list in memory; recursion should start within the spawning (master) task • Still works best with a power of 2 number of nodes

[root@sagitarius LINUX]# ./master2 10000000 4 PARALLEL SORT Usage: executable [# items [# tasks]] # items - length of array to sort or -1 for default # tasks - force this many tasks, -1 for default, or 0 for serial Alloc 10000000 items ... OK Memory Use: 19.07 MB Randomizing ... OK 4 nodes available 4 tasks selected Task 0 Host: sagitarius Task 1 Host: virgo Task 2 Host: pisces-r Task 3 Host: leo Spawning root slave task ... OK Sending data to slave ... OK [sagitarius (0)]: Spawning Task: 2 (pisces-r) [sagitarius (0)]: Spawning Task: 1 (virgo) [pisces-r (2)]: Spawning Task: 3 (leo) [leo (3)]: Done. Returning. Times - Wait: 0.000 Comm: 0.010 Calc: 0.522 Slave: 0.784 [virgo (1)]: Done. Returning. Times - Wait: 0.000 Comm: 0.034 Calc: 1.882 Slave: 3.172 [pisces-r (2)]: Done. Returning. Times - Wait: 0.008 Comm: 0.054 Calc: 1.936 Slave: 3.320 [sagitarius (0)]: Done. Returning. Times - Wait: 0.064 Comm: 0.277 Calc: 6.775 Slave: 7.905 Elapsed Comm Overhead Time: 0.498 secs Elapsed Wait Overhead Time: 0.064 secs Elapsed Calculation Time: 8.657 secs Elapsed Program Time: 8.713 secs

[root@localhost LINUX]# ./master2 25 2 PARALLEL SORT Usage: executable [# items [# tasks]] # items - length of array to sort or -1 for default # tasks - force this many tasks, -1 for default, or 0 for serial Alloc 25 items ... OK Memory Use: 0.00 MB Randomizing ... OK 13602 20334 6641 11971 9234 4357 23162 31256 26410 19358 1770 9206 7125 32040 14798 1441 19603 11567 30520 21757 28494 18730 11100 16631 4147 1 nodes available 2 tasks selected Task 0 Host: localhost Task 1 Host: localhost Spawning root slave task ... OK Sending data to slave ... OK [localhost (0)]: Spawning Task: 1 (localhost) [localhost (1)]: Done. Returning. Times - Wait: 0.000 Comm: 0.000 Calc: 0.000 Slave: 0.007 [localhost (0)]: Done. Returning. Times - Wait: 0.008 Comm: 0.000 Calc: 0.000 Slave: 0.025 1441 1770 4147 4357 6641 7125 9206 9234 11100 11567 11971 13602 14798 16631 18730 19358 19603 20334 21757 23162 26410 28494 30520 31256 32040 Elapsed Comm Overhead Time: 0.000 secs Elapsed Wait Overhead Time: 0.008 secs Elapsed Calculation Time: 0.000 secs Elapsed Program Time: 0.050 secs

Future Optimizations • Implement a method to detect how many processors a single machine has and treat each processor as a viable node • Use new threads to wait for data returned from spawned tasks

Conclusion • Sorting data not a good use of parallel hardware • PVM better for: • Algorithms that run with almost completely independent parts • Algorithms that require a lot of computation for not much data

Parallel Programming with PVM

Parallel Programming with PVM

Presentation Transcript

Parallel Programming with v14.0

Parallel Programming with PThreads

Parallel Programming with OpenMp

Parallel Programming with CUDA

4 Parallel Computing and PVM

Parallel Programming With Spark

Parallel Programming With Spark

Parallel Programming With Spark

Parallel Programming with OpenMP

PVM: Parallel Virtual Machine

PVM (Parallel Virtual Machine) ‏

Parallel Programming with Java

Parallel Programming with OpenMP

Parallel Programming with OpenMP

Parallel Programming with MPI

Parallel Programming with OmniThreadLibrary

Parallel Programming using PVM (Parallel Virtual Machines) Douglas Moore 10 November 2003

Parallel Programming with Swan

Parallel Programming with OpenMP

Parallel Programming with Java

Parallel Programming with Threads