By Sumit Malhotra Computer Science, Florida Tech 767050340 Dr. Charles Fulton

Optimization of Loop Unrolling on dense Vector-matrix multiplication -Parallel Processing By Sumit Malhotra Computer Science, Florida Tech 767050340 Dr. Charles Fulton

Aim of project • To find the best loop unrolling parameters for different number of processors on a 5120 X 5120 matrix.

Algorithm for Matrix Multiplication m = n = 5120; for (i=0; i < local_m; i+=UNROLL2) { for (j=0; j < n; j+=UNROLL) { matrix multiplication; } } Where UNROLL2 and UNROLL are loop unrolling parameters and local_m = m/p and p = number of processors. Therefore the size of matrix on each processor will be local_m x n.

Size of matrix on each processor Size of Matrix when p=1 : 5120 X 5120 Size of Matrix when p=2 : 2560 X 5120 Size of Matrix when p=4 : 1280 X 5120 Size of Matrix when p=8 : 640 X 5120 Where p = number of processors.

Sample Code UNROLL2 = UNROLL = 2; for (i=0; i < local_m; i+=UNROLL2) { for (j=0; j < n; j+=UNROLL) { y[i] += local_A[i][j] * x[j] +local_A[i][j+1] * x[j+1]; y[i+1] += local_A[i+1][j] * x[j] +local_A[i+1][j+1] * x[j+1]; } }

Sample Code UNROLL2 = 2, UNROLL = 4. for (i=0; i < local_m; i+=UNROLL2) { for (j=0; j < n; j+=UNROLL) { y[i] += local_A[i][j] * x[j] + local_A[i][j+1] * x[j+1] + local_A[i][j+2] * x[j+2] + local_A[i][j+3] * x[j+3]; y[i+1] += local_A[i+1][j] * x[j] + local_A[i+1][j+1] * x[j+1] + local_A[i+1][j+2] * x[j+2] + local_A[i+1][j+3] * x[j+3]; } }

Time Calculation Start = clock(); Multiplication code; //Computation. MPI_Gather(); //Communication. End = clock(); Total Computation + Communication Time = Start – End; Start = clock(); MPI_Gather(); //Communication. End = clock(); Communication Time = Start – End; Start = clock(); MPI_Scatter(); //Communication. End = clock(); Scatter Time = Start – End;

Conclusion - Result Table

By Sumit Malhotra Computer Science, Florida Tech 767050340 Dr. Charles Fulton