CS4402 – Parallel Computing

CS4402 – Parallel Computing Lecture 7 - Simple Parallel Sorting. - Parallel Merge Sort.

Sorting Change the elements of a=(a[i],i=0,1,…,n-1) in an increase or decrease order. Several sequential algorithms available: - Internal sorting swaps elements and does not use extra memory. - External sorting uses external memory e.g. the linear sorting. - Complexity O(n^2) for simple but no optimal algorithms. - counting, bubble, sequential insert etc. - Complexity O(n*logn) for optimal algorithms. - quick, merge, binary insert, etc. - Linear Complexity O(n) when the array has some properties.

Rank Sort – Worse Description for(i=0;i<n;i++){ for(rank[i]=0, j=0;j<n;j++){ if(a[i]>a[j]) rank[i]++; } } for(i=0;i<n;i++){ b[rank[i]] = a[i]; }

Rank Sort – MPI Implementation Using a parallel machine with size processors. Some Remarks: • Each processor must know the whole array. • The counting process is then partitioned onto processors. • Each processor counts only a chunk of the array. • Processor rank generates the array ranking = (ranking[i], i=0,…, n/size-1) ranking[i] = rank of a[rank*n/size+i] in a. • The arrays ranking are gathered and then the array b is restored.

MPI_Rank_sort(int n, int * a, int root, MPI_Comm comm) This MPI function must have the following steps: • Bcast the whole array to the processors. • Generate the array ranking = (ranking[i], i=0,…,n/size-1) • Gather the array ranking on processor root • If root then generate/restore the array b. Question? Can we avoid the serial step? If Yes to what price?

Linear Sort: Suppose that the array a=(a[i], i=0,…,n-1) has only integers in 0,1,…,m-1. In this case we can count how many times j=0,1,…,m-1 occurs in a. Then this information is reused to generate the array. Example: a=(2,1,3,2,1,3,0,1,1,2,0,3,1) count[0]=2, count[1]=5, count[2]=3, count[3]=3 a is restore with 2 0-s, 5 1-s, 3 2-s and 3 3-s. a=(0,0,1,1,1,1,1,2,2,2,3,3,3)

Linear Sort: // reset the counters for(j=0;j<m;j++) count[j] = 0; // generate the counters for(i=0;i<n;i++) count[a[i]] ++; // restore the array order based on counters for(j=0;j<m;j++) for(k=0;k<count[j];k++) a[i++] = j; Complexity is

MPI_Linear_sort(int n, int * a, int m, int root, MPI_Comm comm) The MPI routine should • The array a is scattered on processors. • The count is done on the scattered arrays. • The count arrays are all sum-reduced on processors • If root then restore the array The linear complexity makes this computation perhaps unsuitable for parallel computation.

Bucket Sort: Suppose that array=(array[i], i=0,…,n-1) has all elements in the interval [0, a]. Use multiple buckets / collectors to filter the elements in the buckets. Then sort the buckets.

Bucket Sort: // empty the buckets for(j=0;j<m;j++) bucket[j] = empty; // sweep the array and collect the elements in the correct bucket for(i=0;i<n;i++) { bucket_id = (int) (a[i] / m); push(a[i], bucket[bucket_id]); } // sort all buckets for(j=0;j<m;j++) sort(bucket[j]); // append all buckets for(j=0;j<m;j++) push(a, bucket[j]);

MPI_Bucket_sort(int n, int * a, int , int root, MPI_Comm comm) The MPI routine should • The array a is bcast on processors. • The elements of bucket rank are collected from the array. • The bucket rank is sorted up. • The buckets are then gathered to root.

Strategies for || Algorithms: Divide and Conquer

Main Strategies to Develop Parallel Algorithms Partitioning, Divide and Conquer, Pipelining etc Partitioning is the most popular strategy - the problem is split into several parts. - the results are combined to obtain the final result. Data is partitioned  domain decomposition. Computation is partitioned  functional decomposition. Remarks: - Embarrassingly computation uses partitioning. - The simplest partitioning is when # processors = # parts. Divide and Conquer is a recursive partitioning until the parts’ size is smaller.

The Summation Problem Find the sum of the array x[0],x[1],…,x[n-1] using m processors. The sequential solution uses n-1 additions. The array is divided into m sub-arrays. Each processor computes a partial sum. All the partial sums are collected by the master to find the final sum. Important problem: How to make communication efficient. 1. send / receive routines. 2. scatter / reduce routines. 3. divide and conquer?

More Elements Suppose that there are p=2^q processors. The D&C tree has the following elements: • It has q+1=log(p)+1 levels. • The active nodes on level l are • The receiver nodes on level l are • The active node sends half of data to

For the processor P(rank) we work with: - P(rank) is active on level l if (rank % p/pow(2,l)==0). - P(rank) is receiver on level l if it is active && (rank / p/pow(2,l) is odd). - If active P(rank) sends half of data to P(rank+p/pow(2,l+1)).

The Algorithm Step 1. Top-Bottom For l=0,1,2,.., q-1 if rank is receiver then receive n/pow(2,l) data from rank- p/pow(2,l) if rank is active then send n/pow(2,l+1) data to rank+p/pow(2,l+1) Step 2. Computation: find the summation of the local array Step 3. Bottom-Top For l=q-1,…2,1,0 if rank is active then receive received_sum from rank+p/pow(2,l) find sum=sum+received_sum if rank is sender then send sum to rank-p/pow(2,l)

The Program /*----------------------------- STAGE 1 - TOP->DOWN --------------------*/ // using the D&C tree scatter the array onto processors for(level = 0; level <= q; level++){ //if is Receiver if( isReceiver( rank, size, level ) && level > 0) MPI_Recv(a, n/(int)pow(2, level), MPI_DOUBLE, rank - size/(int)pow(2, level), 0, MPI_COMM_WORLD, NULL); //if is Active if( isActive( rank, size, level ) && level < q ) MPI_Send(&a[n/(int)pow(2, level+1)], n/(int)pow(2, level+1), MPI_DOUBLE, rank+size/(int)pow(2, level+1), 0, MPI_COMM_WORLD); }

The Program /*----------------------------- STAGE 3 - DOWN->TOP --------------------------*/ for(level = q; level >= 0; level--){ if( isActive( rank, size, level ) && level < q ){ MPI_Recv(&tmpSum, 1, MPI_DOUBLE, rank+size/(int)pow(2, level+1), 0, MPI_COMM_WORLD, NULL); s += tmpSum; } if( isSender( rank, size, level ) && level > 0 ) MPI_Send(&s, 1, MPI_DOUBLE, rank-size/(int)pow(2, level), 0, MPI_COMM_WORLD); }

Parallel Merge Sort (1) Parallel Merge Sort uses the D&C tree to sort in parallel. The stages of the D&C computation are as follows: Stage 1. The array is scattered / communicated through the tree from root to leaves. Stage 2. The leaves sort out smaller array. Stage 3. The sorted arrays are gathered / merged / communicated through the tree from leaves to root. Each node of the tree computes: a. receive an array from the right hand side child b. merge the received array with the local array c. Send the new array to its father if sender

Parallel Merge Sort (2) Step 1. Top-Bottom For l=0,1,2,.., q-1 if rank is receiver then receive n/pow(2,l) data from rank- p/pow(2,l) if rank is active then send n/pow(2,l+1) data to rank+p/pow(2,l+1) Step 2. Computation Sort the local array of n/size elements Step 3. Bottom-Top For l=q-1,…2,1,0 if rank is active then receive the array from rank+p/pow(2,l) merge the local array with the received array if rank is sender then send the local array over to rank-p/pow(2,l)

CS4402 – Parallel Computing