Introduction to PSPASES Review Team Project

MA/CS 471 Lecture 16, Fall 2002 Introduction to PSPASES

Review Team Project Continued • Now we are ready to progress towards making the serial Poisson solver work in paralllel. • This task divides into a number of steps: • Conversion of umDriver, umMESH, umStartUp, umMatrix and umSolve • Adding a routine to read in a partition file (or call parMetis to obtain a partition vector)

Review umDriver modification • This code should now initialize MPI • This code should call the umPartition routine • This should be modified to find the number of processors and local processor ID (stored in your struct/class..) • This code should finalize MPI

Review umPartition • This code should read in a partition from file • The input should be the name of the partition file, the current process ID (rank) and the number of processes (size) • The output should be a list of elements belonging to this process

Review umMESH Modifications • This routine should now be fed a partition file determining which elements it should read in from the .neu input mesh file • You should replace the elmttoelmt part with a piece of code which goes through the .neu file and reads in which element/face lies on the boundary and use this to mark whether a node is known or unknown • Each process should send a list of its “known” vertices’ global numbers to each other process so all nodes can be correctly identified as lying on the boundary or not

Review umStartUp modification • Remains largely unchanged (depending on how you read in umVertX,umVertY, elmttonode).

Review umMatrix modification • This routine should be modified so that instead of creating the mat matrix it should be fed a vector vecand returns mat*vec • IT SHOULD NOT STORE THE GLOBAL MATRIX AT ALL!! • I strongly suggest creating a new routine (umMatrixOP) and comparing the output from this with using umMatrix to build and multiply some vector as debugging

Review umSolve modification • The major biggy here is the replacement of umAinvB with a call to your own conjugate gradient solver • Note – the rhs vector is filled up here with a global gather of the elemental contributions, so this will have to be modified due to the elements on other processes.

Review umCG modification • umCG is the routine which should take a rhs and return an approximate solution using CG. • Each step of the CG algorithm needs to be analyzed to determine the process data dependency • For the matrix*vector steps a certain amount of data swap is required • For the dot products an allreduce is required. • Strongly suggest creating the exchange sequence before the iterations start.

Review Work Partition • Here’s the deal – there are approximately six unequal chunks of work to be done. • I suggest the following code split up • umDriver, umCG • umPartition, umSolve • umMESH, umStartUp • umMatrixOP • However, you are free to choose. • Try to minimize the amount of data stored on multiple processes (but do not make the task too difficult, by not sharing anything)

Review Discussion and Project Write-Up • This is a little tricky so now is the time to form a plan and to ask any questions. • This will be due on Tuesday 22nd October • As usual I need a complete write up. • This should include parallel timings and speed up tests (I.e. for a fixed grid find wall clock time umCG for Nprocs =2,4,6,8,10,12,14,16 and compare in a graph) • Test the code to make sure it is giving the same results (up to convergence tolerance) as the serial code • Profile your code using upshot • Include pictures showing partition (use a different colour per partition) and parallel solution.

New Approach • In the project we ditched the idea of using a direct solver (i.e. Cholesky or LU factorization) • However, there has been certain amount of effort directed towards creating automated, parallel factorization routines. • One example is the PSPASES library by Joshi, Karypis and Kumar from CS, UMN and Gupta, Gustavson from IBM. http://www-users.cs.umn.edu/~mjoshi/pspases/ http://www-users.cs.umn.edu/~mjoshi/pspases/download.html

PSPASES: An Efficient and Scalable Parallel Sparse Direct Solver • Their ideas encompass: • Suppose we wish to solve Ax=B • Where A is a sparse matrix • Suppose A is sparse enough to let us store all it’s entries • In addition suppose A is symmetric, positive definite • First create a permutation matrix P such that the matrix:A’ = PAPThas a Cholesky factorization A’=LLTwhere L has the minimum non-zero additional entries to the original lower triangular portion of A’ • Create a tree sequence for back solving in parallel

Example – Showing Possible Parallelism Suppose we wish to back solve the following Ly=b for:

Modified Example – Showing Possible Parallelism Notice that we are able to solve for y1 and y2 at the same timesince they are decoupled in the system. So we could have two processes crunching at the same time for this phase. In the second phase, Process 0 (P0) has to send y1 to P1 in orderfor P1 to complete the computation of y4. However, P1 can crunchwhile it is waiting for y1 and complete y4 when it arrives. P0 P1 P0 P1

Solving Sequence 4 Phase 3 3 Phase 2 Phase 1 1 2 Proc. 1 Proc. 0

In Practice • In practice this appears to be quite a complicated method. • However, since the hard work has already been done we are going to use the PSPASES library • Idea: we are going to compare the time it takes to solve the finite element Poisson problem using CG versus the time it takes for PSPASES to do the same. • As part of this benchmarking we should also compare memory usage of the two methods and the scaling with increasing numbers of processes

Project Part 3 • Change of emphasis: • We will restrict to 1 or 2 or 4 or 8 or 16 processors to accommodate the restriction of PSPASES • Next – a new version of the class code should be created with modifications to be outlined in the next slides

Project Part 3 • If the .neu file has N nodes then each process will load in a segment of N/Nprocs nodes (one process should mop the remaining nodes up if N is not divisible by Nprocs) • Each process will load in all the elements containing the nodes in its list • Each process will construct a sparse representation of the rows of the matrix corresponding to its node list • Then go through the calling sequence in PSPASES to construct the parallel factorization of the matrix (PSPASES will automatically partition the data using parmetis itself ) • Next each process constructs the right hand side (rhs) for their resident nodes. • Then each process calls the PSPASES backsolve routines.

PSPASES Calling Sequence // computes Cholesky factorization with minimized fill ordering DPSPACEF(rowdist,aptrs,ainds,avals,options,doptions,&pspcommF,&comm); // perform Cholesky back solve DPSPACET(rowdistbx,&nrhs,b,&ldb,x,&ldx,options,&pspcomm,&comm); // for details see PSPASES: Scalable Parallel Direct Solver Library for Sparse Symmetric Positive Definite Linear Systems

Sparse Matrix Storage Format • Go over this in class – exact details in user manual

Introduction to PSPASES Review Team Project

Introduction to PSPASES Review Team Project

Presentation Transcript