Trends in Sparse Linear Research and Software Developments in France

Trends in Sparse Linear Research and Software Developments in France P. Amestoy1, M. Daydé1, I. Duff2, L. Giraud1, A. Haidar3, S. Lanteri4, J.-Y. L’Excellent5, P. Ramet6 1 IRIT – INPT / ENSEEIHT 2 CERFACS & RAL 3 CERFACS 4INRIA, nachos project-team 5 INRIA / LIP-ENSL 6 INRIA Futurs / LaBRI 1st French-Japanese Workshop « Petascale Applications, Algorithms and Programming (PAAP) », Riken, Tokyo, NOv. 1&2, 2007

Scope of this overview • Mainly teams involved in the development of computational kernels / software for scientific libraries • Not intended to be exhaustive • Some aspects already covered by talks of: • Serge Petiton • Nahid Emad • Stéphane Lanteri • Franck Cappello

Overview of sparse linear algebra techniques From Iain Duff CERFACS & RAL

Introduction • Solution of Ax = b • Where A is large (may be 106 or greater) and sparse

Direct methods • Key idea : factorize the matrix into the product of matrices easy to invert (triangular) with possibly some permutations to preserve sparsity and maintain numerical stability • E.g. Gaussian Elimination : PAQ = LU where • P and Q are rows / columns permutations, • L : Lower triangular (sparse), • U: Upper triangular (sparse) • Then forward / backward substitution : Ly = Pb UQTx = y • Good points • Can solve very large problems even in 3D • Generally are fast ... even on fancy machines • Are robust and well packaged • Bad points • Need to have A either element-wise or assembled : can have very high storage requirement • Can be costly

Iterative methods • Generate a sequence of vectors converging towards the solution of the linear system • E.g. iterative methods based on Krylov subspaces : Кk(A; r0) = sp{r0, Ar0, · · · Ak−1r0} where r0 = b − Ax0. • The idea is then to choose a suitable xk x0 + Кk (A; r0) • For example, so that it minimizes || b − Axk ||2 (GMRES). • There are many Krylov methods, depending on criteria used for choosing xk. • Good points • May not require to form A • Usually very low storage requirements • Can be efficient on 3D problems • Bad points • May require a lot of iterations for convergence • May require preconditioning

Hybrid methods • Attempt to retain good properties of both direct and iterative methods • Not just preconditioning • ILU Sparse approximate inverse or Incomplete Cholesky: core techniques of direct methods are used • Cf MERAM method (talk by N. Emad) • We focus on • using a direct method/code combined with an iterative method.

Hybrid methids (con’t) • Generic examples of hybrid methods are: • Domain decomposition ... using direct method on local subdomains and/or direct preconditioner on interface (cf. talk by Stéphane Lanteri) • Block iterative methods .... direct solver on subblocks • Factorization of nearby problem as a preconditioner

Sparse direct methods

~10 GB ~100 MB MUMPS 71 GFlops 512 proc T3E Simulate a phenomenon Factorize the matrix Solve a sparse linear system Sparse Direct Solvers • Usually three steps • Pre-processing : symbolic factorization • Numercial factorization • Forward-backward substitution

MUMPS and sparse direct methods MUMPS team http://graal.ens-lyon.fr/MUMPSandhttp://mumps.enseeiht.fr

History • Main contributors since 1996 : Patrick Amestoy, Iain Duff, Abdou Guermouche, Jacko Koster, Jean-Yves L’Excellent, Stéphane Pralet • Current development team : • Patrick Amestoy, ENSEEIHT-IRIT • Abdou Guermouche, LABRI-INRIA • Jean-Yves L’Excellent, INRIA • Stéphane Pralet, now working for SAMTECH • Phd Students • Emmanuel Agullo, ENS-Lyon • Tzvetomila Slavova, CERFACS.

Users • Around 1000 users, 2 requests per day • Academics or industrials • Type of applications : • Structural mechanics, CAD • Fluid dynamics, Magnetohydrodynamic, Physical Chemistry • Wave propagation and seismic imaging, Ocean modelling • Acoustics and electromagnetics propagation • Biology • Finite Element Analysis, Numerical Optimization, Simulation • . . .

MUMPS: A MUltifrontal Massively Parallel Solver • MUMPSsolves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT • Symmetric or unsymmetric matrices (partial pivoting) • Parallel factorization and solution phases (uniprocessor version also available) • Iterative refinement and backward error analysis • Various matrix input formats • assembled format • distributed assembled format • sum of elemental matrices • Partial factorization and Schur complement matrix • Version for complex arithmetic • Several orderings interfaced: AMD, AMF, PORD, METIS, SCOTCH

The multifrontal method (Duff, Reid’83) • Memory is divided into two parts (that can • overlap in time) : • the factors • the active memory Elimination tree represents tasks dependencies

Implementation • Distributed multifrontal solver (MPI / F90 based) • Dynamic distributed scheduling to accomodate both numerical fill-in and multi-user environments • Use of BLAS, ScaLAPACK A fully asynchronous distributed solver (VAMPIR trace, 8 processors)

MUMPS • 3 main steps (plus initialization and termination) : • JOB=-1 : initialize solver type (LU, LDLT ) and default parameters • JOB=1 : analyse the matrix, build an ordering, prepare factorization • JOB=2 : (parallel) numerical factorization A = LU • JOB=3 : (parallel) solution step forward and backward substitutions (Ly = b,Ux = y) • JOB=-2 : termination deallocate all MUMPS data structures

AVAILABILITY MUMPS is available free of charge It is used on a number of platforms (CRAY, SGI, IBM, Linux, …) and is downloaded once a day on average (applications in chemistry, aeronautics, geophysics, ...) If you are interested in obtaining MUMPS for your own use, please refer to the MUMPShome page Car body 148770 unknowns and 5396386 nonzeros MSC.Software Some MUMPS users: Boeing, BRGM, CEA, Dassault, EADS, EDF, MIT, NASA, SAMTECH, ...

COMPETITIVE PERFORMANCE Comparison with SuperLU extracted from ACM TOMS 2001 and obtained with S. Li Recent performance results (ops = nb of operations) Factorization time in seconds of large matrices on the CRAY T3E (1 proc: not enough memory)

Functionalities, Features • Recent features • Symmetric indefinite matrices : preprocessing and 2-by-2 pivots • Hybrid scheduling • 2D cyclic distributed Schur complement • Sparse, multiple right-hand sides • Singular matrices with detection of null pivots • Interfaces to MUMPS : Fortran, C, Matlab (S. Pralet, while at ENSEEIHT-IRIT) and Scilab (A. Fèvre, INRIA) • Future functionalities • Out-of-core execution • Parallel analysis phase • Rank revealing algorithms • Hybrid direct-iterative solvers (with Luc Giraud)

On-going research on Out-of-Core solvers • (Ph.D. E. Agullo, ENS Lyon and Ph.D. M. Slavova, CERFACS) • Use disk storage to solve • very large problems • Parallel out-of-core • Factorization • Preprocessing to minimize • volume of I/O • Scheduling for out-of-core • solution Ratio of active and total memory peak on different numbers of processors for several large problems

Hybrid Scheduling • Both memory and workload information are used to obtain a better behaviour in terms of estimated memory, memory used and factorization time in the context of parallel factorization algorithms • Estimated memory much closer to effective memory used Estimated and effective memory (millions of reals) for the factorization on 64 processors Max: maximum amount of memory Avg: average memory per processor

Memory minimizing schedules • Multifrontal methods can use a large amount of temporary data • By decoupling task allocation and task processing, we can reduce the amount of temporary data: new optimal schedule proposed in this context (Guermouche, L'Excellent, ACM TOMS) • Memory gains: Active memory ratio (new algorithm vs Liu's ordering) Remark: Gains relative to Liu's algorithm are equal to 27.1, 17.5 and 19.6 for matrices 8, 9, and 10 (gupta matrices), respectively

Preprocessing for symmetric matrices (S. Pralet, ENSEEIHT-IRIT) • Preprocessing: new scaling available, symmetric weighted matching and automatic tuning of the preprocessing strategies • Pivoting strategy (2-by-2 pivot and static pivoting) • Improvement: • factorization time • robustness in particular on KKT systems arising from optimization • memory estimation Factorization time on a Linux PC (Pentium4 2.80 GHz)

Scotch, PasTIX PaStiX Team INRIA – Futurs / LaBri

PaStiX solver Functionnalities • LLt, LDLt, LU factorization (symmetric pattern) with supernodal implementation • Static pivoting (Max. Weight Matching) + It. Raff. / CG / GMRES • 1D/2D block distribution + Full BLAS3 • Support external ordering library (provided Scotch ordering) • MPI/Threads implementation (SMP node / Cluster / Multi-core / NUMA) • Simple/Double precision + Float/Complex operations • Require only C + MPI + Posix Thread • Multiple RHS (direct factorization) • Incomplete factorization ILU(k) preconditionner

PaStiX solver (con’t) Available on INRIA Gforge • All-in-One source code • Easy to install on Linux or AIX systems • Simple API (WSMP like) • Thread safe (can be called from multiple threads in multiple MPI communicators) Current works • Use of parallel ordering (PT-Scotch) and parallel symbolic factorization) • Dynamic scheduling inside SMP nodes (static mapping) • Out-of Core implementation • Generic Finite Element Assembly (domaine decomposition associated to matrix distribution)

DistributedfactorizedsolverMatrix DistributedsolverMatrix graph partition symbolMatrix Scotch (ordering &amalgamation) Fax (block symbolicfactorization) Blend (refinement &mapping) Sopalin(factorizing &solving) Distributedsolution Direct solver chain (in PaStiX) Analyze (sequential steps) // fact. and solve

DistributedfactorizedsolverMatrix DistributedsolverMatrix graph partition symbolMatrix Scotch (ordering &amalgamation) Fax (block symbolicfactorization) Blend (refinement &mapping) Sopalin(factorizing &solving) Distributedsolution Direct solver chain (in PaStiX) • Sparse matrix ordering (minimizes fill-in) • Scotch: an hybrid algorithm • incomplete Nested Dissection • the resulting subgraphs being ordered with an Approximate Minimum Degree method under constraints (HAMD)

DistributedfactorizedsolverMatrix DistributedsolverMatrix graph partition symbolMatrix Scotch (ordering &amalgamation) Fax (block symbolicfactorization) Blend (refinement &mapping) Sopalin(factorizing &solving) Distributedsolution Direct solver chain (in PaStiX) The symbolic block factorization • Q(G,P)→Q(G,P)*=Q(G*,P)=> linear in number of blocks! • Dense block structures → only a few extra pointers to store the matrix

DistributedfactorizedsolverMatrix DistributedsolverMatrix graph partition symbolMatrix Scotch (ordering &amalgamation) Fax (block symbolicfactorization) 1 2 3 4 5 6 7 8 Blend (refinement &mapping) Sopalin(factorizing &solving) 5 6 7 8 Distributedsolution 1 2 3 4 5 6 7 8 1 2 3 4 5 5 1 2 6 7 2 3 4 4 8 2 6 7 7 1 2 3 Direct solver chain (in PaStiX) CPU time prediction Exact memory ressources

DistributedfactorizedsolverMatrix DistributedsolverMatrix graph partition symbolMatrix Scotch (ordering &amalgamation) Fax (block symbolicfactorization) Blend (refinement &mapping) Sopalin(factorizing &solving) Distributedsolution Direct solver chain (in PaStiX) • Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation (all processors in the same SMP node work directly in share memory) • Less MPI communication and lower the parallel memory overcost

Incomplete factorization in PaStiX • Start from the acknowledgement that it is difficult to build a generic and robust pre-conditioner • Large scale 3D problems • High performance computing • Derive direct solver techniques to pre-conditioner • What’s new: (dense) block formulation • Incomplete block symbolic factorization: • Remove blocks with algebraic criteria • Use amalgamation algorithm to get dense blocks • Provide incomplete factorization LDLt, Cholesky, LU (with static pivoting for symmetric pattern)

Numerical experiments (TERA1) • Successful approach for a large collection of industrial test cases (PARASOL, Boeing Harwell, CEA) on IBM SP3 • TERA1 supercomputer of CEA Ile-de-France (ES45 SMP 4 procs) • COUPOLE40000 : 26.5 106 of unknowns 1.5 1010 NNZL and 10.8Tflops • 356 procs: 34s • 512 procs: 27s • 768 procs: 20s • (>500Gflop/s about 35% peak perf.)

Numerical experiments (TERA10) • Successful approach on 3D mesh problem with about 30 millions of unkowns on TERA10 supercomputer • But memory is the bottleneck !!! • ODYSSEE code of French CEA/CESTA • Electro-magnetism code (Finite Element Meth. + Integral Equation) • Complex double precision, Schur Compl.

Industrial applications • OSSAU code of French CEA/CESTA • 2D / 3D structural mechanics code • ODYSSEE code of French CEA/CESTA • Electro-magnetism code (Finite Element Meth. + Integral Equation) • Complex double precision, Schur Compl. • Fluid mechanics • LU factorization with static pivoting (SuperLU approach like)

Other parallel sparse direct codes • Shared-memory codes

Other parallel sparse direct codes • Distributed-memory codes

Sparse solver for Ax = b: only a black box ? • Preprocessing and postprocessing : • Symmetric permutations to reduce fill-in :(Ax = b => PAPtPx = b) • Numerical pivoting, scaling to preserve numerical accuracy • Maximum transversal (set large entries on the diagonal) • Preprocessing for parallelism (influence of task mapping on parallelism) • Iterative refinement, error analysis • Default (often automatic/adaptive) setting of the options is often available. However, a better knowledge of the options can help the user to further improve : • memory usage, • time for solution, • numerical accuracy.

The GRID-TLSE Projet A web expert site for sparse linear algebra

Supported by Overview of GRID-TLSE: http://gridtlse.org • Supported by • ANR LEGO Project • ANR SOLSTICE Project • CNRS / JST Program : REDIMPS Project • ACI GRID-TLSE Project • Partners:

Sparse Matrices Expert Site ? • Expert site: Help users in choosing the right solvers and its parameters for a given problem • Chosen approach: Expert scenarios which answer common user requests • Main goal: Provide a friendly test environment for expert and non-expert users of sparse linear algebra software.

Sparse Matrices Expert Site ? • Easy access to: • Software and tools; • A wide range of computer architectures; • Matrix collections; • Expert Scenarios. • Also : Provide a testbed for sparse linear algebra software

Why using the grid ? • Sparse linear algebra software makes use of sophisticated algorithms for (pre-/post-) processing the matrix. • Multiple parameters interfere for efficient execution of a sparse direct solver: • Ordering; • Amount of memory; • Architecture of computer; • Available libraries. • Determining the best combination of parameter values is a multi-parametric problem. • Well-suited for execution over a Grid.

User 1 User i ? • Websolve: an Web interface to start services on he grid • Weaver: high level administrator for the deployment and the exploitation of services on the grid Websolve Weaver How do software X and Y compare in terms of memory and CPU on my favourite matrix A ? Components Software components Diet / ITBL / AEGIS • Middleware: DIET developed within GRID-ASP (LIP, Loria Resedas, LIFC-SDRP) and soon ITBL (?)

Hybrid Solvers

Parallel hybrid iterative/direct solver for the solution of large sparse linear systems arising from 3D elliptic discretization L. Giraud1 A. Haidar2 S. Watson3 1ENSEEIHT, Parallel Algorithms and Optimization Group 2 rue Camichel, 31071 Toulouse, France CERFACS - Parallel Algorithm Project 42 Avenue Coriolis, 31057 Toulouse, France 3Departments of Computer Science and Mathematics, Virginia Polytechnic Institute, USA

Non-overlapping domain decomposition • Natural approach for PDE’s, extend to general sparse matrices • Partition the problem into subdomains, subgraphs • Use a direct solver on the subdomains (MUMPS package) • Robust algebraically preconditioned iterative solver on the interface (Algebraic Additive Schwarz preconditioner, possibly with sparsified and mixed arithmetic variants)

Numerical behaviour of the preconditioners Convergence history on a 43 millions dof problems on 1000 System X processors - Virginia Tech.

Parallel scaled scalability study Parallel elapsed time with fixed sub-problem size (43 000 dof) when the number of procesors varies from 27 (1.1 106 dof) up-to 1000 (43 106 dof)

Trends in Sparse Linear Research and Software Developments in France

Trends in Sparse Linear Research and Software Developments in France

Presentation Transcript

Trends and Policy Developments in Assisted Living

Developments in sustainability policy in France

Research and Developments of Software-Defined Radio Technologies in Japan

International Trends and Developments in School Librarianship

Developments and Trends in Fully - Funded Pensions in Bulgaria

JSB Market Research - Construction in France - Key Trends an

Developments and Trends in Wireless Communications

Policy and research developments in Australia

Trends and developments in grain trading

Latest Trends and Developments in Lighting

Trends and Developments in Online Education

Recent Developments in Ensemble Research

Recent Developments on Fathers and Leave in France

Technical Developments and Cultural Trends in Sport

Trends and Developments in Online Education

Developments in International Rice Research

Trends and developments

OS research in France

Developments in carer research

Trends in Sparse Linear Research and Software Developments in France

Developments in Bio-adhesive, Trends Bioadhesive Market-Ken Research

International Trends and Developments in School Librarianship