Introduction to ScaLAPACK

Introduction to ScaLAPACK

ScaLAPACK • ScaLAPACK is for solving dense linear systems and computing eigenvalues for dense matrices

ScaLAPACK • Scalable Linear Algebra PACKage • Developing team from University of Tennessee, University of California Berkeley, ORNL, Rice U.,UCLA, UIUC etc. • Support in Commercial Packages • NAG Parallel Library • IBM PESSL • CRAY Scientific Library • VNI IMSL • Fujitsu, HP/Convex, Hitachi, NEC

Description • A Scalable Linear Algebra Package (ScaLAPACK) • A library of linear algebra routines for MPP and NOW machines • Language : Fortran • Dense Matrix Problem Solvers • Linear Equations • Least Squares • Eigenvalue

Technical Information • ScaLAPACK web page • http://www.netlib.org/scalapack • ScaLAPACK User’s Guide • Technical Working Notes • http://www.netlib.org/lapack/lawn

Packages • LAPACK • Routines are single-PE Linear Algebra solvers, utilities, etc. • BLAS • Single-PE processor work-horse routines (vector, vector-matrix & matrix-matrix) • PBLAS • Parallel counterparts of BLAS. Relies on BLACS for blocking and moving data.

Packages (cont.) • BLACS • Constructs 1-D & 2-D processor grids for matrix decomposition and nearest neighbor communication patterns. Uses message passing libraries (MPI, PVM, MPL, NX, etc) for data movement.

Dependencies & Parallelismof Component Packages Serial Parallel

Data Partitioning • Initialize BLACS routines • Set up 2-D mesh (mp, np) • Set (CorR) Column or Row Major Order • Call returns context handle (icon) • Get row & column through gridinfo (myrow,mycol) • blacs_gridinit( icon, CorR, mp, np ) • blacs_gridinfo( icon, mp, np, myrow, mycol)

Data Partitioning (cont.) • Block-cyclic Array Distribution. • Block: (groups of sequential elements) • Cyclic: (round-robin assignment to PEs) • Example 1-D Partitioning: • Global 9-element array distributed on a 3-PE grid with 2 element blocks: block(2)-cyclic(3)

Block-Cyclic Partitioning

2-D Partitioning • Example 2-D Partitioning • A(9,9) mapped onto PE(2,3) grid block(2,2) • Use 1-D example to distribute column elements in each row, a block(2)-cyclic(3) distribution. • Assign rows to first PE dimension by block(2)-cyclic(2) distribution.

1 1 1 2 1 3 1 5 1 6 1 4 1 7 1 8 1 9 G l o b a l M a t r i x 2 1 2 3 2 4 2 5 2 8 2 9 2 2 2 6 2 7 3 1 3 8 3 9 3 2 3 3 3 4 3 5 3 6 3 7 G l o b a l M a t r i x ( 9 x 9 ) 4 1 4 3 4 5 4 6 4 7 4 8 4 2 4 4 4 9 B l o c k S i z e = 2 x 2 5 1 5 2 5 3 5 4 5 5 5 6 5 7 5 8 5 9 C y c l i c o n 2 x 3 P E G r i d 6 1 6 2 6 4 6 5 6 7 6 9 6 3 6 6 6 8 7 1 7 2 7 3 7 6 7 7 7 4 7 5 7 8 7 9 8 1 8 2 8 6 8 7 8 8 8 9 8 3 8 4 8 5 9 1 9 3 9 4 9 7 9 8 9 9 9 2 9 5 9 6 1 1 1 2 1 3 1 5 1 6 1 7 1 8 1 4 1 9 2 1 2 8 2 3 2 4 2 9 2 5 2 2 2 7 2 6 5 1 5 8 5 4 5 6 5 2 5 7 5 3 5 9 5 5 6 1 6 2 6 7 6 8 6 3 6 4 6 9 6 5 6 6 9 1 9 2 9 7 9 8 9 3 9 4 9 9 9 5 9 6 P E ( 0 , 0 ) P E ( 0 , 1 ) P E ( 0 , 2 ) 3 1 3 2 3 7 3 8 3 3 3 4 3 9 3 5 3 6 4 1 4 2 4 7 4 8 4 3 4 4 4 9 4 5 4 6 7 1 7 2 7 7 7 3 7 6 7 8 7 4 7 9 7 5 8 1 8 2 8 7 8 8 8 9 8 6 8 3 8 4 8 5 P E ( 1 , 0 ) P E ( 1 , 1 ) P E ( 1 , 2 )

Syntax for descinit descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier) IorO arg Description OUT idesc Descriptor IN m Row Size (Global) IN n Column Size (Global) IN mb Row Block Size IN nb Column Size IN i Starting Grid Row IN j Starting Grid Column IN icon BLACS context IN mla Leading Dimension of Local Matrix OUT ier Error number The starting grid location is usually (i=0,j=0).

Application Program Interfaces (API) • Drivers • Solves a Complete Problem • Computational Components • Performs Tasks: LU factorization, etc. • Auxiliary Routines • Scaling, Matrix Norm, etc. • Matrix Redistribution/Copy Routine • Matrix on PE grid1 -> Matrix on PE grid2

DataTyperealdoublecmplxdblecmplx X S D C Z API (cont..) • Prepend LAPACK equivalent names with P. P X Y Y Z Z Z C o m p u t a t i o n P e r f o r m e d M a t r i x T y p e D a t a T y p e s

Matrix Types (YY) PXYYZZZ DB General Band (diagonally dominant-like) DT general Tridiagonal (Diagonally dominant-like) GB General Band GE GEneral matrices (e.g., unsymmetric, rectangular, etc.) GG General matrices, Generalized problem HE Complex Hermitian OR Orthogonal Real PB Positive definite Banded (symmetric or Hermitian) PO Positive definite (symmetric or Hermitian) PT Positive definite Tridiagonal (symmetric or Hermitian) ST Symmetric Tridiagonal Real SY SYmmetric TR TRiangular (possibly quasi-triangular) TZ TrapeZoidal UN UNitary complex

SLLinear Equations (SVX*) SVLU Solver VDSingular Value EVEigenvalue (EVX**) GVXGeneralized Eigenvalue *Estimates condition numbers. *Provides Error Bounds. *Equilibrates System. **Selected Eigenvalues Routine types Drivers (ZZZ) PXYYZZZ

Rules of Thumb • Use a square processor grid, mp=np • Use block size 32 (Cray) or 64 • for efficient matrix multiply • Number of PEs to use = MxN/10**6 • Bandwith/node (MB/sec) > peak rate (MFLOPS)/10 • Latency < 500 usecs • Use vendor’s BLAS

Example program: Linear Equation Solution Ax=b (page 1) program Sample_ScaLAPACK_program implicit none include "mpif.h" REAL :: A ( 1000 , 1000) REAL, DIMENSION(:,:), ALLOCATABLE :: A_local INTEGER, PARAMETER :: M_A=1000, N_A=1000 INTEGER :: DESC_A(1:9) REAL, DIMENSION(:,:), ALLOCATABLE :: b_local REAL :: b( 1000,1 ) INTEGER, PARAMETER :: M_B=1000, N_B=1 INTEGER :: DESC_b(1:9) ,i,j INTEGER :: ldb_y INTEGER :: ipiv(1:1000),info INTEGER :: myrow,mycol !row and column # of PE on PE grid INTEGER :: ictxt !context handle.....like MPIcommunicator INTEGER :: lda_y,lda_x ! leading dimension of local array A(@) ! row and column # of PE grid INTEGER, PARAMETER :: nprow=2, npcol=2 ! ! RSRC...the PE row that owns the first row of A ! CSRC...the PE column that owns the first column of A

Example program (page 2) INTEGER, PARAMETER :: RSRC = 0, CSRC = 0 INTEGER, PARAMETER :: MB = 32, NB=32 ! ! iA...the global row index of A, which points to the beginning !of submatrix that will be operated on. ! INTEGER, PARAMETER :: iA = 1, jA = 1 INTEGER, PARAMETER :: iB = 1, jB = 1 REAL :: starttime, endtime INTEGER :: taskid,numtask,ierr !********************************** ! NOTE: ON THE NPACI MACHINES DO NOT NEED TO CALL INITBUFF ! ON OTHER MACHINES YOU MAY NEED TO CALL INITBUFF !************************************* ! ! This is where you should define any other variables ! INTEGER, EXTERNAL :: NUMROC ! ! Initialization of the processor grid. This routine ! must be called ! ! start MPI initialization routines

Example program (page 3) CALL MPI_INIT(ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD,taskid,ierr) CALL MPI_COMM_SIZE(MPI_COMM_WORLD,numtask,ierr) ! start timing call on 0,0 processor IF (taskid==0) THEN starttime =MPI_WTIME() END IF CALL BLACS_GET(0,0,ictxt) CALL BLACS_GRIDINIT(ictxt,'r',nprow,npcol) ! ! This call returns processor grid information that will be ! used to distribute the global array ! CALL BLACS_GRIDINFO(ictxt, nprow, npcol, myrow, mycol ) ! ! define lda_x and lda_y with numroc ! lda_x...the leading dimension of the local array storing the ! ........local blocks of the distributed matrix A ! lda_x = NUMROC(M_A,MB,myrow,0,nprow) lda_y = NUMROC(N_A,NB,mycol,0,npcol) ! ! resizing A so don't waste space !

Example program (page 4) ALLOCATE( A_local (lda_x,lda_y)) if (N_B < NB )then ALLOCATE( b_local(lda_x,N_B) ) else ldb_y = NUMROC(N_B,NB,mycol,0,npcol) ALLOCATE( b_local(lda_x,ldb_y) ) end if ! ! defining global matrix A ! do i=1,1000 do j = 1, 1000 if (i==j) then A(i,j) = 1000.0*i*j else A(i,j) = 10.0*(i+j) end if end do end do ! defining the global array b do i = 1,1000 b(i,1) = 1.0 end do !

Example program (page 5) ! subroutine setarray maps global array to local arrays ! ! YOU MUST HAVE DEFINED THE GLOBAL MATRIX BEFORE THIS POINT ! CALL SETARRAY(A,A_local,b,b_local,myrow,mycol , lda_x,lda_y ) ! ! initialize descriptor vector for scaLAPACK ! CALL DESCINIT(DESC_A,M_A,N_A,MB,NB,RSRC,CSRC,ictxt,lda_x,info) CALL DESCINIT(DESC_b,M_B,N_B,MB,N_B,RSRC,CSRC,ictxt,lda_x,info) ! ! call scaLAPACK routines ! CALL PSGESV( N_A,1,A_local,iA,jA, DESC_A,ipiv,b_local,iB,jB,DESC_b,info) ! ! check ScaLAPACK returned ok !

Example program (page 6) if (info .ne. 0)then print *,'unsucessfull return...something is wrong! info=',info else print *,'Sucessfull return' end if CALL BLACS_GRIDEXIT( ictxt ) CALL BLACS_EXIT ( 0 ) ! call timing routine on processor 0,0 to get endtime IF (taskid==0) THEN endtime = MPI_WTIME() print*,'wall clock time in second = ',endtime - starttime END IF ! end MPI CALL MPI_FINALIZE(ierr) ! ! end of main ! END

Example program (page 7) SUBROUTINE SETARRAY(AA,A,BB,B,myrow,mycol,lda_x, lda_y ) ! reads in global matrix A_local and ! distributes to the local arrays. ! implicit none REAL :: AA(1000,1000) REAL :: BB(1000,1) INTEGER :: lda_x, lda_y REAL :: A(lda_x,lda_y) REAL ::ll,mm,Cr,Cc INTEGER :: ii,jj,I,J,myrow,mycol,pr,pc,h,g INTEGER , PARAMETER :: nprow = 2, npcol =2 INTEGER, PARAMETER ::N=1000,M=1000,NB=32,MB=32,RSRC=0,CSRC=0 REAL :: B(lda_x,1) INTEGER, PARAMETER :: N_B = 1

Example program (page 8) do I=1,M do J = 1,N ! finding out which PE gets this I,J element Cr = real( (I-1)/MB ) h = RSRC+aint(Cr) pr = mod( h,nprow) Cc = real( (J-1)/MB ) g = CSRC+aint(Cc) pc = mod(g,nprow) ! ! check if on this PE and then set A ! if (myrow ==pr .and. mycol==pc)then ! ii,jj coordinates of local array element ! ii = x + l*MB ! jj = y + m*NB ! ll = real( ( (I-1)/(nprow*MB) ) ) mm = real( ( (J-1)/(npcol*NB) ) ) ii = mod(I-1,MB) + 1 + aint(ll)*MB jj = mod(J-1,NB) + 1 + aint(mm)*NB A(ii,jj) = AA(I,J) end if end do end do

Example Program (page 9) ! reading in and distributing B vector ! do I = 1, M J = 1 ! finding out which PE gets this I,J element Cr = real( (I-1)/MB ) h = RSRC+aint(Cr) pr = mod( h,nprow) Cc = real( (J-1)/MB ) g = CSRC+aint(Cc) pc = mod(g,nprow) ! check if on this PE and then set A if (myrow ==pr .and. mycol==pc)then ll = real( ( (I-1)/(nprow*MB) ) ) mm = real( ( (J-1)/(npcol*NB) ) ) jj = mod(J-1,NB) + 1 + aint(mm)*NB ii = mod(I-1,MB) + 1 + aint(ll)*MB B(ii,jj) = BB(I,J) end if end do end subroutine

PxGEMM (mflops) N= 2000 4000 6000 8000 10000 2x2 4x4 8x8 2x2 4x4 8x8 2x2 4x4 8x8 Cray T3E IBM SP2 Now Berkeley

PxGESV (mflops) N= 2000 5000 75000 10000 15000 1x4 2x8 4x16 1x4 2x8 4x16 1x4 2x8 4x16 1x4 2x8 4x16 1x4 2x8 4x16 1x4 2x8 4x16 Cray T3E IBM SP2 Now Berkeley

Introduction to ScaLAPACK

Introduction to ScaLAPACK

Presentation Transcript

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

The Future of LAPACK and ScaLAPACK www.netlib.org/lapack-dev

Introduction to

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK , PLASMA, MAGMA

Introduction to introduction to introduction to … Optimization

Introduction to

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

Introduction to Bioinformatics Introduction to Databases

ScaLAPACK Tutorial

Lapack / ScaLAPACK / Multi-core and more

Introduction to Engineering Introduction to CAD

Introduction to Introduction to Database Systems

Introduction to Introduction to Psychology

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

Introduction to Concurrency: Introduction to Concurrency

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

Introduction to Psychophysiology Lecture 1- introduction to introduction

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

The Future of LAPACK and ScaLAPACK netlib/lapack-dev

Introduction to Introduction to Artificial Intelligence