1 / 31

Introduction to ScaLAPACK

Introduction to ScaLAPACK. ScaLAPACK. ScaLAPACK is for solving dense linear systems and computing eigenvalues for dense matrices. ScaLAPACK. Scalable Linear Algebra PACKage Developing team from University of Tennessee, University of California Berkeley, ORNL, Rice U.,UCLA, UIUC etc.

john
Download Presentation

Introduction to ScaLAPACK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to ScaLAPACK

  2. ScaLAPACK • ScaLAPACK is for solving dense linear systems and computing eigenvalues for dense matrices

  3. ScaLAPACK • Scalable Linear Algebra PACKage • Developing team from University of Tennessee, University of California Berkeley, ORNL, Rice U.,UCLA, UIUC etc. • Support in Commercial Packages • NAG Parallel Library • IBM PESSL • CRAY Scientific Library • VNI IMSL • Fujitsu, HP/Convex, Hitachi, NEC

  4. Description • A Scalable Linear Algebra Package (ScaLAPACK) • A library of linear algebra routines for MPP and NOW machines • Language : Fortran • Dense Matrix Problem Solvers • Linear Equations • Least Squares • Eigenvalue

  5. Technical Information • ScaLAPACK web page • http://www.netlib.org/scalapack • ScaLAPACK User’s Guide • Technical Working Notes • http://www.netlib.org/lapack/lawn

  6. Packages • LAPACK • Routines are single-PE Linear Algebra solvers, utilities, etc. • BLAS • Single-PE processor work-horse routines (vector, vector-matrix & matrix-matrix) • PBLAS • Parallel counterparts of BLAS. Relies on BLACS for blocking and moving data.

  7. Packages (cont.) • BLACS • Constructs 1-D & 2-D processor grids for matrix decomposition and nearest neighbor communication patterns. Uses message passing libraries (MPI, PVM, MPL, NX, etc) for data movement.

  8. Dependencies & Parallelismof Component Packages Serial Parallel

  9. Data Partitioning • Initialize BLACS routines • Set up 2-D mesh (mp, np) • Set (CorR) Column or Row Major Order • Call returns context handle (icon) • Get row & column through gridinfo (myrow,mycol) • blacs_gridinit( icon, CorR, mp, np ) • blacs_gridinfo( icon, mp, np, myrow, mycol)

  10. Data Partitioning (cont.) • Block-cyclic Array Distribution. • Block: (groups of sequential elements) • Cyclic: (round-robin assignment to PEs) • Example 1-D Partitioning: • Global 9-element array distributed on a 3-PE grid with 2 element blocks: block(2)-cyclic(3)

  11. Block-Cyclic Partitioning

  12. 2-D Partitioning • Example 2-D Partitioning • A(9,9) mapped onto PE(2,3) grid block(2,2) • Use 1-D example to distribute column elements in each row, a block(2)-cyclic(3) distribution. • Assign rows to first PE dimension by block(2)-cyclic(2) distribution.

  13. 1 1 1 2 1 3 1 5 1 6 1 4 1 7 1 8 1 9 G l o b a l M a t r i x 2 1 2 3 2 4 2 5 2 8 2 9 2 2 2 6 2 7 3 1 3 8 3 9 3 2 3 3 3 4 3 5 3 6 3 7 G l o b a l M a t r i x ( 9 x 9 ) 4 1 4 3 4 5 4 6 4 7 4 8 4 2 4 4 4 9 B l o c k S i z e = 2 x 2 5 1 5 2 5 3 5 4 5 5 5 6 5 7 5 8 5 9 C y c l i c o n 2 x 3 P E G r i d 6 1 6 2 6 4 6 5 6 7 6 9 6 3 6 6 6 8 7 1 7 2 7 3 7 6 7 7 7 4 7 5 7 8 7 9 8 1 8 2 8 6 8 7 8 8 8 9 8 3 8 4 8 5 9 1 9 3 9 4 9 7 9 8 9 9 9 2 9 5 9 6 1 1 1 2 1 3 1 5 1 6 1 7 1 8 1 4 1 9 2 1 2 8 2 3 2 4 2 9 2 5 2 2 2 7 2 6 5 1 5 8 5 4 5 6 5 2 5 7 5 3 5 9 5 5 6 1 6 2 6 7 6 8 6 3 6 4 6 9 6 5 6 6 9 1 9 2 9 7 9 8 9 3 9 4 9 9 9 5 9 6 P E ( 0 , 0 ) P E ( 0 , 1 ) P E ( 0 , 2 ) 3 1 3 2 3 7 3 8 3 3 3 4 3 9 3 5 3 6 4 1 4 2 4 7 4 8 4 3 4 4 4 9 4 5 4 6 7 1 7 2 7 7 7 3 7 6 7 8 7 4 7 9 7 5 8 1 8 2 8 7 8 8 8 9 8 6 8 3 8 4 8 5 P E ( 1 , 0 ) P E ( 1 , 1 ) P E ( 1 , 2 )

  14. Syntax for descinit descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier) IorO arg Description OUT idesc Descriptor IN m Row Size (Global) IN n Column Size (Global) IN mb Row Block Size IN nb Column Size IN i Starting Grid Row IN j Starting Grid Column IN icon BLACS context IN mla Leading Dimension of Local Matrix OUT ier Error number The starting grid location is usually (i=0,j=0).

  15. Application Program Interfaces (API) • Drivers • Solves a Complete Problem • Computational Components • Performs Tasks: LU factorization, etc. • Auxiliary Routines • Scaling, Matrix Norm, etc. • Matrix Redistribution/Copy Routine • Matrix on PE grid1 -> Matrix on PE grid2

  16. DataTyperealdoublecmplxdblecmplx X S D C Z API (cont..) • Prepend LAPACK equivalent names with P. P X Y Y Z Z Z C o m p u t a t i o n P e r f o r m e d M a t r i x T y p e D a t a T y p e s

  17. Matrix Types (YY) PXYYZZZ DB General Band (diagonally dominant-like) DT general Tridiagonal (Diagonally dominant-like) GB General Band GE GEneral matrices (e.g., unsymmetric, rectangular, etc.) GG General matrices, Generalized problem HE Complex Hermitian OR Orthogonal Real PB Positive definite Banded (symmetric or Hermitian) PO Positive definite (symmetric or Hermitian) PT Positive definite Tridiagonal (symmetric or Hermitian) ST Symmetric Tridiagonal Real SY SYmmetric TR TRiangular (possibly quasi-triangular) TZ TrapeZoidal UN UNitary complex

  18. SLLinear Equations (SVX*) SVLU Solver VDSingular Value EVEigenvalue (EVX**) GVXGeneralized Eigenvalue *Estimates condition numbers. *Provides Error Bounds. *Equilibrates System. **Selected Eigenvalues Routine types Drivers (ZZZ) PXYYZZZ

  19. Rules of Thumb • Use a square processor grid, mp=np • Use block size 32 (Cray) or 64 • for efficient matrix multiply • Number of PEs to use = MxN/10**6 • Bandwith/node (MB/sec) > peak rate (MFLOPS)/10 • Latency < 500 usecs • Use vendor’s BLAS

  20. Example program: Linear Equation Solution Ax=b (page 1) program Sample_ScaLAPACK_program implicit none include "mpif.h" REAL :: A ( 1000 , 1000) REAL, DIMENSION(:,:), ALLOCATABLE :: A_local INTEGER, PARAMETER :: M_A=1000, N_A=1000 INTEGER :: DESC_A(1:9) REAL, DIMENSION(:,:), ALLOCATABLE :: b_local REAL :: b( 1000,1 ) INTEGER, PARAMETER :: M_B=1000, N_B=1 INTEGER :: DESC_b(1:9) ,i,j INTEGER :: ldb_y INTEGER :: ipiv(1:1000),info INTEGER :: myrow,mycol !row and column # of PE on PE grid INTEGER :: ictxt !context handle.....like MPIcommunicator INTEGER :: lda_y,lda_x ! leading dimension of local array A(@) ! row and column # of PE grid INTEGER, PARAMETER :: nprow=2, npcol=2 ! ! RSRC...the PE row that owns the first row of A ! CSRC...the PE column that owns the first column of A

  21. Example program (page 2) INTEGER, PARAMETER :: RSRC = 0, CSRC = 0 INTEGER, PARAMETER :: MB = 32, NB=32 ! ! iA...the global row index of A, which points to the beginning !of submatrix that will be operated on. ! INTEGER, PARAMETER :: iA = 1, jA = 1 INTEGER, PARAMETER :: iB = 1, jB = 1 REAL :: starttime, endtime INTEGER :: taskid,numtask,ierr !********************************** ! NOTE: ON THE NPACI MACHINES DO NOT NEED TO CALL INITBUFF ! ON OTHER MACHINES YOU MAY NEED TO CALL INITBUFF !************************************* ! ! This is where you should define any other variables ! INTEGER, EXTERNAL :: NUMROC ! ! Initialization of the processor grid. This routine ! must be called ! ! start MPI initialization routines

  22. Example program (page 3) CALL MPI_INIT(ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD,taskid,ierr) CALL MPI_COMM_SIZE(MPI_COMM_WORLD,numtask,ierr) ! start timing call on 0,0 processor IF (taskid==0) THEN starttime =MPI_WTIME() END IF CALL BLACS_GET(0,0,ictxt) CALL BLACS_GRIDINIT(ictxt,'r',nprow,npcol) ! ! This call returns processor grid information that will be ! used to distribute the global array ! CALL BLACS_GRIDINFO(ictxt, nprow, npcol, myrow, mycol ) ! ! define lda_x and lda_y with numroc ! lda_x...the leading dimension of the local array storing the ! ........local blocks of the distributed matrix A ! lda_x = NUMROC(M_A,MB,myrow,0,nprow) lda_y = NUMROC(N_A,NB,mycol,0,npcol) ! ! resizing A so don't waste space !

  23. Example program (page 4) ALLOCATE( A_local (lda_x,lda_y)) if (N_B < NB )then ALLOCATE( b_local(lda_x,N_B) ) else ldb_y = NUMROC(N_B,NB,mycol,0,npcol) ALLOCATE( b_local(lda_x,ldb_y) ) end if ! ! defining global matrix A ! do i=1,1000 do j = 1, 1000 if (i==j) then A(i,j) = 1000.0*i*j else A(i,j) = 10.0*(i+j) end if end do end do ! defining the global array b do i = 1,1000 b(i,1) = 1.0 end do !

  24. Example program (page 5) ! subroutine setarray maps global array to local arrays ! ! YOU MUST HAVE DEFINED THE GLOBAL MATRIX BEFORE THIS POINT ! CALL SETARRAY(A,A_local,b,b_local,myrow,mycol , lda_x,lda_y ) ! ! initialize descriptor vector for scaLAPACK ! CALL DESCINIT(DESC_A,M_A,N_A,MB,NB,RSRC,CSRC,ictxt,lda_x,info) CALL DESCINIT(DESC_b,M_B,N_B,MB,N_B,RSRC,CSRC,ictxt,lda_x,info) ! ! call scaLAPACK routines ! CALL PSGESV( N_A,1,A_local,iA,jA, DESC_A,ipiv,b_local,iB,jB,DESC_b,info) ! ! check ScaLAPACK returned ok !

  25. Example program (page 6) if (info .ne. 0)then print *,'unsucessfull return...something is wrong! info=',info else print *,'Sucessfull return' end if CALL BLACS_GRIDEXIT( ictxt ) CALL BLACS_EXIT ( 0 ) ! call timing routine on processor 0,0 to get endtime IF (taskid==0) THEN endtime = MPI_WTIME() print*,'wall clock time in second = ',endtime - starttime END IF ! end MPI CALL MPI_FINALIZE(ierr) ! ! end of main ! END

  26. Example program (page 7) SUBROUTINE SETARRAY(AA,A,BB,B,myrow,mycol,lda_x, lda_y ) ! reads in global matrix A_local and ! distributes to the local arrays. ! implicit none REAL :: AA(1000,1000) REAL :: BB(1000,1) INTEGER :: lda_x, lda_y REAL :: A(lda_x,lda_y) REAL ::ll,mm,Cr,Cc INTEGER :: ii,jj,I,J,myrow,mycol,pr,pc,h,g INTEGER , PARAMETER :: nprow = 2, npcol =2 INTEGER, PARAMETER ::N=1000,M=1000,NB=32,MB=32,RSRC=0,CSRC=0 REAL :: B(lda_x,1) INTEGER, PARAMETER :: N_B = 1

  27. Example program (page 8) do I=1,M do J = 1,N ! finding out which PE gets this I,J element Cr = real( (I-1)/MB ) h = RSRC+aint(Cr) pr = mod( h,nprow) Cc = real( (J-1)/MB ) g = CSRC+aint(Cc) pc = mod(g,nprow) ! ! check if on this PE and then set A ! if (myrow ==pr .and. mycol==pc)then ! ii,jj coordinates of local array element ! ii = x + l*MB ! jj = y + m*NB ! ll = real( ( (I-1)/(nprow*MB) ) ) mm = real( ( (J-1)/(npcol*NB) ) ) ii = mod(I-1,MB) + 1 + aint(ll)*MB jj = mod(J-1,NB) + 1 + aint(mm)*NB A(ii,jj) = AA(I,J) end if end do end do

  28. Example Program (page 9) ! reading in and distributing B vector ! do I = 1, M J = 1 ! finding out which PE gets this I,J element Cr = real( (I-1)/MB ) h = RSRC+aint(Cr) pr = mod( h,nprow) Cc = real( (J-1)/MB ) g = CSRC+aint(Cc) pc = mod(g,nprow) ! check if on this PE and then set A if (myrow ==pr .and. mycol==pc)then ll = real( ( (I-1)/(nprow*MB) ) ) mm = real( ( (J-1)/(npcol*NB) ) ) jj = mod(J-1,NB) + 1 + aint(mm)*NB ii = mod(I-1,MB) + 1 + aint(ll)*MB B(ii,jj) = BB(I,J) end if end do end subroutine

  29. PxGEMM (mflops) N= 2000 4000 6000 8000 10000 2x2 4x4 8x8 2x2 4x4 8x8 2x2 4x4 8x8 Cray T3E IBM SP2 Now Berkeley

  30. PxGESV (mflops) N= 2000 5000 75000 10000 15000 1x4 2x8 4x16 1x4 2x8 4x16 1x4 2x8 4x16 1x4 2x8 4x16 1x4 2x8 4x16 1x4 2x8 4x16 Cray T3E IBM SP2 Now Berkeley

More Related