Mixed Precision Iterative Refinement Solver Procedure Using High Performance Linpack (HPL)

/* * Find norm_1( A ). */ if( nq > 0 ) { work = (double*)malloc( nq * sizeof( double ) ); if( work == NULL ) { HPL_pabort( __LINE__, "HPL_pdlange", "Memory allocation failed" ); } for( jj = 0; jj < nq; jj++ ) { s = HPL_rzero; for( ii = 0; ii < mp; ii++ ) { s += Mabs( *A ); A++; } work[jj] = s; A += LDA - mp; } float* float HPL_pslange (float)HPL_rzero Mixed Precision Iterative Refinement Solver Procedure Using High Performance Linpack (HPL) H. Che2, E. D’Azevedo1, M. Sekachev3, K. Wong3 1Oak Ridge National Laboratory, 2Chinese University of Hong Kong, 3National Institute for Computational Sciences Motivation Methodology Numerical Experiments Results • Data-type Change: • Variable Data-type: double A → double complex A • MPI communication Data-type: • MPI_Send(A,…,MPI_DOUBLE,…) → • MPI_Send(A,…,MPI_DOUBLE_COMPLEX,…) • Function return type: • double HPL_rand → • double complex HPL_zrand • For example, HPL_pdlange.cbecomes HPL_pslange.cwith the following change in content: • High Performance Linpack (HPL) benchmark often outperforms ScaLAPACK in solving a large dense system of equation, but is only commonly used for performance benchmark in double precision. => HPL for LU decomposition • Single Instruction Multiple Data (SIMD) capability in most processors can achieve higher performance in 32-bit over 64-bit operations. => Iterative refinement procedure • The numerical experiments were performed on the athena Cray XT4 supercomputer at the NICS. • Athena nodes consist of a quad-core 2.3 GHz AMD Opteron processor with 4GB of memory. Using Streaming SIMD Extension (SSE), each core has peak performance of 9.2 Gflops (18.4 Gflops) in 64-bit (32-bit) arithmetic. Table 2: Performance of HPL Mixed Precision – REAL MATRIX Results Table 1: Performance comparison: HPL vs. ScaLAPACK Goals • Deliver a user library by extending HPL to perform single, double, complex, double complex, and mixed precisions calculations. • Deliver an interface compatible to ScaLAPACK calling convention, which allows simple code modifications to achieve top performance using this modified HPL (libmhpl.a) library. • Use mixed precision solver by performing costly LU factorization in 32-bit but achieve 64-bit accuracy by iterative refinement method. • Peak Rate of 16K nodes = 150.73 TFLOPS (Double precision) • Peak Rate of 16K nodes = 301.46 TFLOPS (Single precision) Table 3: Performance of HPL Mixed Precision – COMPLEX MATRIX • Unchanged: • Timing variables: /hpl/testing /ptimer, /timer • Norm Variables: Anorm1 in HPL_pztest.c • Residue Variables: resid0 in HPL_pztest.c • Function return Data-type: double HPL_pzlange High Performance Linpack (HPL) • HPL is written in portable C to evaluate parallel performance of Top500 computers by solving a (random) dense linear system in double precision (64-bit) arithmetic on distributed-memory computers. • HPL uses a right-looking variant of the LU factorization of random matrix with row partial pivoting. 2-D block-cyclic distribution is employed. • It has tunable parameters to implement multiple look-ahead depths, recursive panel factorization with pivot search and column broadcast combined, various virtual panel broadcast topologies, bandwidth reducing swap broadcast algorithm, and a ring broadcast algorithm in the backward substitution ScaLAPACK Based Application Code Modifications • Performing simple modifications of ScaLAPACK source codes one can embed calling more efficient HPL functions, simply linked with libmhpl.a provided here. • For example the test program provided by ScaLAPACK (pdludriver.f) should be modified as follow as follows (pdhpl_driver.f): 1. Two extra integers are declared: integer hpl_lld, hpl_ineed 2. HPL extra functions are called: call blacs_barrier(ictxt,'A') call hpl_dblacsinit( ictxt ) call hpl_dmatinit( n, NB, hpl_lld,hpl_ineed) call descinit(descA,n,n+1,NB,NB,0,0,ICTXT,hpl_lld,ierr(1)) 3. Original ScaLAPACK function CALL PDGETRF(M,N,MEM(IPA),1,1,DESCA,MEM(IPPIV),INFO ) is replaced by HPL function call hpl_pdgesv(n, mem(IPA), descA,mem(ippiv), info) Summary • HPL based dense LU solver is more efficient than standard ScaLAPACK, and achieved about 75% of the peak performance. • HPL performs parallel LU factorization in double but uses hybrid left/right-looking panel method and look-ahead algorithms. • Application interface compatible to ScaLAPACK. • Integrated to AORSA fusion INCITE application. Acknowledgements Mixed Precision Using HPL • This research is partially sponsored by the Office of Advanced Scientific Computing Research; U.S. Department of Energy. • This research used resources of the National Institute for Computational Sciences (NICS), which is supported by the National Science Foundation (NSF). • Summer internships for H. Che, T. Chan, D. Lee, and R. Wong were supported by the Department of Mathematics, The Chinese University of Hong Kong (CUHK). Internship opportunity was provided by the Joint Institute for Computational Sciences (JICS), the University of Tennessee, and the Oak Ridge National Laboratory. • HPL updates upper triangular matrix only • Need to update lower triangular matrix to prepare iterative refinement • Find the global pivot vector from HPL and use it to swap the rows of the lower triangular matrix • Update lower triangular (to be compatible with ScaLAPACK) • Modified data structure code to assemble and return pivot vector Iterative Refinement Methodology • A better performance is gained by using LU factorization in single precision (much faster than in double). Then using ScaLAPACK routines for triangular solve, perform matrix-vector multiply and iterative refinement to gain double precision. • Solve the matrix • Using ‘call hpl_psgesv(…)’ to solve matrix by HPL • Instead of ‘call psgetrf(…)’ (ScaLAPACK) • Major computational cost • Solve the matrix: O(N3), N x N is the size of the matrix • To get three additional precisions rewrite original HPL source codes by modifying data types and function names using the following convention (same as ScaLAPACK) in naming files and functions: • ‘s’ – stands for SINGLE REAL • ‘d’ – stands for DOUBLE REAL • ‘c’ – stands for SINGLE COMPLEX • ‘z’ – stands for DOUBLE COMPLEX Contact information

Mixed Precision Iterative Refinement Solver Procedure Using High Performance Linpack (HPL)

Mixed Precision Iterative Refinement Solver Procedure Using High Performance Linpack (HPL)

Presentation Transcript

Exploiting Mixed Precision: Iterative Refinement for the Solution of Linear Systems

Incorporating Iterative Refinement with Sparse Cholesky

Stepwise Refinement -- A procedure calls another procedure

Iterative MapReduce and High Performance Datamining

Using the Library Catalogue for RA Services

Guiding Motif Discovery by Iterative Pattern Refinement

How to Conduct High Performance Leadership (HPL) Program 進行高效能領導計劃

Software Verification using Predicate Abstraction and Iterative Refinement: Part 1

FATCOP: A Mixed Integer Program Solver

Playback-buffer Equalization for Streaming Media using Stateless Transport Prioritization

Motion Vector Refinement for High-Performance Transcoding

Refinement procedure

Iterative Refinement of Computational Circuits using Genetic Programming

TLI - Training on HPL (High Performance Leadership)

HIGH PERFORMANCE LED (HPL) SUPERIOR LIGHT OUTPUT LONGER LIFECYCLE LOWER COST CUSTOMIZABLE

High Performance Leadership

Global High Pressure Laminate (HPL) Market Research Report 2015

Global Decorative High-pressure Laminate (HPL) Sales Market Report 2018

Ultra Shield Exterior HPL High Pressure Laminates Sheet India - AICA Sunmica

With Insights from: Bill Klein (HPL Linear), Rod Burt (HPL LInear),

High Performance Precision CNC Machining Manufacturer

Top High Pressure Laminate Brands