Lapack / ScaLAPACK / Multi-core and more

Lapack / ScaLAPACK / Multi-core and more Julie Langou @ UCSB - April 16th 2007

My background • Education • Bachelor degree in Applied Mathematics, Toulouse France • Master degree in Computer Science related to Space and Aircraft industries, Toulouse, France. • Work • Software Engineer for CDC, the French leader in insurance and financial services (I.e AIG, ING Direct, etc…), Paris, France. • Principal Software Engineer on HMIS (Homeless shelters) in Knoxville,TN • Research Associate at ICL - Innovative Computing Laboratory, Knoxville TN since 2004. • Julien (CERFACS / ICL / CU Denver): Research Interests: Numerical Linear Algebra * Iterative methods for finding eigenvalues of Hermitian matrices with application in nanotechnology, * Iterative methods for multiple right-hand sides * LAPACK, ScaLAPACK: working on the new release, due for 2007 * FTLA: Fault Tolerant Linear Algebra * Stability of numerical schemes in particular numerical stability of orthogonalization methods * Allreduce Algorithms - QR factorization of a tall and skinny matrix

Plan • My occupation at ICL • Software Engineering LAPACK • Single-double Iterative Refinement Routines for LAPACK • Windows support for LAPACK/ScaLAPACK/ATLAS • Multicore • PLASMA (Parallel Linear Algebra for Scalable Multi-core Architectures) • Benchmarking of Neumann • Surprise…

Plan • My occupation at ICL • Software Engineering LAPACK • Single-double Iterative Refinement Routines for LAPACK • Windows support for LAPACK/ScaLAPACK/ATLAS • Multicore • PLASMA (Parallel Linear Algebra for Scalable Multi-core Architectures) • Benchmarking of Neumann

LAPACK – 3.1 (released for SC06) ============================= == LAPACK 3.1: What's new == ============================= 1) Hessenberg QR algorithm with the small bulge multi-shift QR algorithm together with aggressive early deflation. This is an implementation of the 2003 SIAM SIAG LA Prize winning algorithm of Braman, Byers and Mathias, that significantly speeds up the nonsymmetric eigenproblem. 2) Improvements of the Hessenberg reduction subroutines. These accelerate the first phase of the nonsymmetric eigenvalue problem. See the reference by G. Quintana-Orti and van de Geijn below. 3) New MRRR eigenvalue algorithms that also support subset computations. These implementations of the 2006 SIAM SIAG LA Prize winning algorithm of Dhillon and Parlett are also significantly more accurate than the version in LAPACK 3.0. 4) Mixed precision iterative refinement subroutines for exploiting fast single precision hardware. On platforms like the Cell processor that do single precision much faster than double, linear systems can be solved many times faster. Even on commodity processors there is a factor of 2 in speed between single and double precision. These are prototype routines in the sense that their interfaces might changed based on user feedback. 5) New partial column norm updating strategy for QR factorization with column pivoting. This fixes a subtle numerical bug dating back to LINPACK that can give completely wrong results. See the reference by Drmac and Bujanovic below. 6) Thread safety: Removed all the SAVE and DATA statements (or provided alternate routines without those statements), increasing reliability on SMPs. 7) Additional support for matrices with NaN/subnormal elements, optimization of the balancing subroutines, improving reliability. 8) Several bug fixes.

LAPACK-3.1: faster • xGEHRD (reduction to Hessenberg form) contributor: Daniel Kressner ref: Gregorio Quintana-Orti and Robert van de Geijn, Improving the Performance of Reduction to Hessenberg Form. ACM Transactions on Mathematical Software, 32(2):180-194, June 2006. • xHSEQR (QR algorithm for Hessenberg matrix) contributor: Karen Braman and Ralph Byers ref: K. Braman, R. Byers and R. Mathias, The Multi-Shift QR Algorithm Part I : Maintaining Well Focused Shifts, and Level 3 Performance, SIAM Journal of Matrix Analysis, volume 23, pages 929-947, 2002. ref: K. Braman, R. Byers and R. Mathias, The Multi-Shift QR Algorithm Part II: Aggressive Early Deflation, SIAM Journal of Matrix Analysis, volume 23, pages 948--973, 2002.

Some results for the full nonsymmetric eigenvalue problem ARCH: Intel Pentium 4 ( 3.4 GHz ) F77 : GNU Fortran (GCC) 3.4.4 BLAS: libgoto_prescott32p-r1.00.so (one thread)

Some results for the full nonsymmetric eigenvalue problem ARCH: Intel Pentium 4 ( 3.4 GHz ) F77 : GNU Fortran (GCC) 3.4.4 BLAS: libgoto_prescott32p-r1.00.so (one thread) Eigenvalues only Eigenvalues and eigenvectors

To see how to hack Matlab to get now see: http://icl.cs.utk.edu/lapack-forum/viewtopic.php?t=294 • ==================================================Experiments using Matlab on Pentium IV 3.00GHz, 512KB cache, running Linux ================================================== n Matlab hacked speedup 100 0.06 0.05 1.20 150 0.17 0.12 1.41 200 0.38 0.21 1.80 250 0.78 0.36 2.16 300 1.35 0.58 2.32 350 2.15 0.85 2.52 400 3.09 1.14 2.71 450 4.65 1.51 3.07 500 6.13 1.99 3.08 550 8.45 2.70 3.12 600 10.43 3.43 3.04 650 13.25 4.09 3.23 700 16.33 4.73 3.45 750 20.40 5.55 3.67 800 24.55 6.26 3.92 850 29.57 7.27 4.06 900 34.15 8.25 4.13 950 40.56 9.40 4.31 1000 47.66 10.66 4.47 1050 55.20 12.11 4.55 1100 61.73 13.50 4.57 1150 71.57 15.03 4.76 1200 83.23 16.63 5.00 1250 93.10 18.32 5.08 1300 104.99 20.39 5.14 1350 116.85 22.50 5.19 1400 128.84 24.21 5.32 1450 146.04 26.95 5.41 1500 192.96 29.46 6.54 1550 ---.-- 32.28 -.-- 1600 ---.-- 35.38 -.-- 1650 ---.-- 37.82 -.-- 1700 ---.-- 40.76 -.-- 1750 ---.-- 44.23 -.-- 1800 ---.-- 47.82 -.-- 1850 ---.-- 51.47 -.-- 1900 ---.-- 55.31 -.-- 1950 ---.-- 60.74 -.-- 2000 ---.-- 63.55 -.-- ===================================================

Promises made • Goal 1 – Better Numerics • MRRR algorithm for SEP • Up to 5x faster HQR • Up to 10x faster HQZ • Recursive • Jacobi for SVD and SEP • Iterative refinement for Ax=b, least squares • Goal 2 – Expanded Content • • Make content of ScaLAPACK mirror LAPACK as much as possible: driver for complex SVD (Du), driver for NEP. • • Add new functions (examples) • – Updating / downdating of factorizations • – More generalized SVDs • – More generalized Sylvester/Lyapunov eqns– Structured eigenproblems • • O(n2) version of roots(p) • • Selected matrix polynomials • • Arbitrary precision, quads • New (not in the proposal) • single-double iterative refinement routines • Drmac: QR column pivoting • Gustavson: RFP (Rectangular Full Packed)format

ScaLAPACK: Symmetric Eigenvalue Problem SYEV: Symmetric QR: O(n3) SYEVD: Divide and Conquer: between O(n2) and O(n3) SYEVX: Bisection and Inverse iteration between O(n2) and O(n3) SYEVR: MRRR Multiple relative Robust Representation O(n2) Experiments done with random matrices: This advantage (a lot) SYEVX and SYEVD !

LAPACK / ScaLAPACK release • Release • [November 11, 2006] LAPACK 3.1.0 • [February 26, 2007] LAPACK-3.1.1 • LAPACK 4.0 …September 2007? • [Feb 01, 2006] scalapack-1.7.1 • [Feb 23, 2006] scalapack-1.7.2 • [March 22, 2006] scalapack-1.7.3 • [May 26, 2006] scalapack-1.7.4 • [January 16, 2007] scalapack-1.7.5 • [April 5, 2007] scalapack-1.8.0 • Software Engineering • Sca/LAPACK Program Style • Trac website: web-based software project management. • Automate testing system: buildbot. • LAPACK forum : working better than expected. • Still lots of email also on the lapack@cs.utk.edu • New search engine

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr LuszczekAlfredo Buttari Julien Langou Jakub KurzakJack Dongarra

Architecture of the Cell Cell is a heterogeneous chip multiprocessor that consists of an IBM 64-bit Power Architecture™ core, augmented with eight specialized co-processors based on a novel single-instruction multiple-data (SIMD) architecture. Clock: 3.2Ghz single-instruction multiple-data (SIMD) architecture : Synergistic Processor Unit (SPU) Vector:128 bits Fused add-multiply Latency for double operations: 6 cycles IBM 64-bit Power Architecture™ core AltiVec/VMX Vector Unit 4 multiply-add /cycle for single1 multiply-add/cycle for double http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.htmlhttp://www.research.ibm.com/cell/cell_chip.htmlhttp://www-128.ibm.com/developerworks/power/library/pa-cellperf/

Performance of the cell This is how single precision performance breaks down for the CBE by “Jakub” • The vector unit of one SPE can do one vector operation each cycle. A vector is 128 bits and can hold 4 floats. So, with fused add-multiply, you can have 8 single-precision floating point operation per cycle. The target clock is 3.2 GHz (our cell is 2.1), so you get 3.2 * 8 = 25.6 GFLOPS on a single SPU. You have 8 SPUs, so in total you can get 8 * 25.6 = 204.8 GFLOPS from the 8 SPUs. However the PPU has an Altivec/VMX vector unit which can also do 4 multiply-add operations per cycle. So if you add it all together, the total single-precision performance of the whole CBE is 230.4 GFLOPS. • Double precision can also be processed in a vectorized fashion on an SPE, but a vector can store only 2 64-bit doubles, so performance is cut by 2. Additionally, the double precision operations are not fully pipelined, so one operation has a latency of 6 cycles, so you are 6 times slower. So in total you are 6*2=12 times slower for a double precision operation. And also the Altivec/VMX on the PPU cannot process doubles in a vectorized fashion, so on the PPU you rely on the standard floating point unit which should still do a single multiply-add in one cycle. • To sum up, for double precision on a 3.2 GHz chip, you have 2.13 GFLOPS on a single SPU, so 2.13 * 8 = 17 in total, plus 6.4 on the PPU = 23.47 total. So, around 230 GFLOPS single precision and around 23 GFLOPS double precision • Peak performance for single precision: ~230 GFlops • Peak performance for double precision: ~24 GFlops

Problematic on the cell • Single are 10 times faster than double! • But single are 2 times less accurate than double Can we get double accuracy results with a performance not too far from the peak of single precision?

use iterative refinement! • Factorize the matrix in single precision, O(n3) FLOP • Then use this factorization as a preconditioner in a double precision iterativemethod, O(n2) FLOP • The single LU factorization is a POOR double LU factorization but an EXCELLENT preconditioner Typically the iterative methods will converge in few steps • O(n3) FLOP vs O(n2): single computations dominate the # of FLOP: • Speed of Single precision • Convergence in a double precision iterative solver: • Accuracy in Double precision

Introduction to Iterative Refinement Algorithm • L U = lu(A) %LU factorization(SINGLE) O(n3) • x = L\(U\b) % Solve(SINGLE) O(n2) • r = b – Ax % Residual (DOUBLE) O(n2)(DOUBLE) O(n2) while ( || r || not small enough ), %stopping criteria • z = L\(U\r) %LU factorization on the residual (SINGLE) O(n2) • x = x + z % new solution (DOUBLE) O(n2) • r = b - Ax % new residual (DOUBLE) O(n2) • End • COST:(SINGLE) O(n3)+ #ITER * (DOUBLE) O(n2)

limitation c(n).es.k (A) < 1

limitation c(n).es.k (A) < 1 • the condition number of A cannot be too large (typically k(A) < 108for this scheme as opposed to k(A) < 1016 for regular double LU) • Limitation also with the range of the numbers used (overflow quicker in single)

Number of steps • Approximately … • double precision: ed = 10-16-> td= -log10(ed) = 16 • single precision: es = 10-8-> ts= -log10(es) = 8 • condition number: k (A) = 104 -> tk = log10(k (A) ) = 4 • Max # of steps = td/(ts-tk) = 16/(8-4) = 4

Single vs Double

OPERATION PER CYCLE TIME MMX : Set of "MultiMedia eXtensions" to the x86 ISA. Mainly new instructions for integer performance, and maybe some prefetch. For Intel, all chips starting with the PentiumMMX processor possess these extensions. For AMD, all chips starting with the K6 possess these extensions. SSE : Streaming SIMD (Single Instruction Multiple Data) Extensions. SSE is a superset of MMX.These instructions are used to speed up single precision (32 bit) floating point arithmetic. For Intel, all chips listed starting with the Pentium III possess SSE extensions. For AMD, all chips starting from Athlon4 possess SSE. 3DNow! :AMD's extension to MMX that does almost the exact same thing SSE does, except the single precision arithmetic is not IEEE compliant . It is also a superset of MMX (but not of SSE; 3DNow! was released before SSE). It is supported only on AMD, starting with the K6-2 chip. SSE2 :Additional instructions that perform double precision floating arithmetic. Allows for 2 double precision FLOPs every cycle. For Intel, supported on the Pentium 4 and for AMD, supported on the Opteron.

AMD Opteron(tm) Processor 240 (1.4GHz), Goto BLAS (1 thread)

Sun UltraSPARC-IIe (502 MHz), SunPerf

Intel Pentium III CopperMine (0.9GHz), Goto BLAS

Intel Pentium Xeon Northwood (2.4GHz), Goto BLAS (1 thread)

Intel Pentium IV Prescott (3.4GHz), Goto BLAS (1 thread)

CRAY X1, libsci

Power PC G5, vecLib (2 threads)

Compaq Alpha EV6, XML

SGI Octane (270 MHz), ATLAS 3.6 (1 thread)

IBM SP RS/6000, Power 3, (1.5 GHz), ESSL

Final Results very effective on a number but not for all the architectures.

Run on parallel machines • the cost of the iterative refinement O(n2) becomes negligible with respect to PDGETRF O(n3). • Using PDSGESV is almost twice as fast (1.83) as opposed to using PDGESV for the same accuracy

Quad / double • Expected Accuracy: 10-32 • No more than 3 steps of iterative refinement are needed. • The speedup goes from 10 (n=100) to close to 100 (n=1000). • Intel Xeon 3.2Ghz • ifort –O3 • Reference Blas

Implementation and Testing • DSGESV computes the solution to a real system of linear equations • A * X = B, where A is an N-by-N matrix and X and B are N-by-NRHS matrices. • DSGESV first triesto factorize the matrix in SINGLE PRECISION and use this factorization within an iterative refinement procedure to produce a solution with DOUBLE PRECISION normwise backward error quality. • If the approach fails the method switch to a DOUBLE PRECISION factorization and solve. • The iterative refinement process is stopped if • ITER < ITERMAX = 30 • or • Backward error = RNRM / ( XNRM * ANRM ) < MIN(4,SQRT(N/6) * EPS • where • ITERis the number of iteration in the iterative refinement process • RNRMis the 2-norm of the residual • XNRMis the 2-norm of the solution • ANRMis the Frobenius-norm of the matrix A • EPSis the relative machine precision returned by DLAMCH

Related work / Originality • Iterative refinement is not a new subject. Lots of literature. • Never been done for speed before, only for accuracy • Iterative refinement was used to • Cope with not stable LU factorization (SuperLU, pivot growth) • Improve forward error (cf. Berkeley guys) • Most of the theorems on mixed-precision iter-ref are: “what is the SINGLE accuracy I can get with iterative refinement single/double?” • Our problem is : “what is the DOUBLE accuracy I can get using iterative refinement single/double?” • See theoretical analysis at the end of technical report: http://www.cs.utk.edu/~library/TechReports/2006/ut-cs-06-574.pdf

extensions • More Algorithms: • Cholesky / QR / eigenvalue / singular value • More data format • GB : General band matrix, GE : General matrix (already in Lapack 3.1.0), GT : General tridiagonal matrix , PB : Positive Definite Symmetric Band Matrix, PO : Positive Definite Symmetric Matrix, PP : Packed Storage Definite Symmetric Matrix, PT : Packed Storage, SP : Symmetric Matrix Packed Storage, SY :Symmetric Matrix • Various precisions • Quad/double • Change the outer iterative methods • Richardson -> GCR, GMRES • Change the inner solve (SPARSE) • Instead of backward/forward solve in LU single, any solver (iterative methods) will do. See Alfredo Buttari, Jack Dongarra, Jakub Kurzak, Piotr Luszczek, and Stanimire Tomov. Computations to Enhance the Performance while Achieving the 64-bit Accuracy. UT-CS-06-584, November 2006.

IBM Cell 3.2 Ghz AX=b Performance

Windows support – 100% MS Native • ATLAS Automatically Tuned Linear Algebra Software (ATLAS) Portable BLAS • LAPACK Linear Algebra PACKage • ScaLAPACK Dense linear algebra package like LAPACK but for clusters • SUA • Powershell • Microsoft Visual Studio

Future Scaling without Innovation If we scale current peak performance numbers for various architectures and allowing system peak doubling every 18 months. Trouble ahead Slide adapted from Rick Stevens, ANL

Power is an Industry Wide Problem “Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006 • Google facilities • leveraging hydroelectric power • old aluminum plants • >500,000 servers worldwide New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006

Free lunch is over • Problem • physical barriers to further increases in clock speeds: too much heat, too much power consumption, and too much leaking voltage • heat dissipation • physical limits on the number of pins and bandwidth on a single chip: gap between processor performance and memory performance, which was already bad, will get increasingly worse. • the design trade-offs being made to address the previous two factors will render commodity processors, absent any further augmentation, inadequate for the purposes of tera- and petascale systems for advanced applications. • Solution • reduce clock and increase execution units = Multicore • Result • Non-parallel software won't run any faster. A new approach to programming is required. • Programmers will not be able to consider these cores independently (i.e., multi-core is not “the new SMP") because they share on-chip resources in ways that separate processors do not.

Power Cost of Frequency • Power ∝ Voltage2 x Frequency(V2F) • Frequency ∝ Voltage • Power ∝Frequency3

Change is coming 24 GHz, 1 Core 12 GHz, 1 Core 6 GHz 1 Core 3 GHz, 4 Cores 3 GHz, 8 Cores 3 GHz 2 Cores 3GHz1 Core 1 Core No Free Lunch For Traditional Software (Without highly concurrent software it won’t get any faster!) Free Lunch For Traditional Software (It just runs twice as fast every 18 monthswith no change to the code!) 2 Cores Operations per second for serial code 4 Cores 8 Cores Additional operations per second if code can take advantage of concurrency

What is a Multicore processor, BTW? • “a processor that combines two or more independent processors into a single package” (wikipedia) • Single chip • Multiple distinct processing engine • Multiple, independent threads of control (or program counters MIMD) • What about: • types of core? • homogeneous (AMD Opteron, Intel Woodcrest...) • heterogeneous (STI Cell, Sun Niagara...) • memory? • how is it arranged? • bus? • is it going to be fast enough? • cache? • shared? (Intel/AMD) • non present at all? (STI Cell) • Communications?

Lapack / ScaLAPACK / Multi-core and more

Lapack / ScaLAPACK / Multi-core and more

Presentation Transcript

Symmetric Eigensolvers in Sca/LAPACK

The Lapack for Clusters (LFC) Project

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK , PLASMA, MAGMA

The Lapack For Clusters (LFC) Project

Lapack wrapper for matlab