Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ahmed Sameh and Ananth Grama Computer Science Department Purdue University. http://www.cs.purdue.edu/people/{sameh/ayg} Linear Solvers Grant Kickoff Meeting, 9/26/06.

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Project Overview • Objectives and Methodology • Design scalable sparse solvers (direct, iterative, and hybrid) and evaluate their scaling/communication characteristics. • Evaluate architectural features and their impact on scalable solver performance. • Evaluate performance and productivity aspects of programming models -- PGAs (CAF, UPC) and MPI. • Challenges and Impact • Generalizing the space of linear solvers. • Implementation and analysis on parallel platforms • Performance projection to the petascale • Guidance for architecture and programming model design / performance envelope. • Benchmarks and libraries for HPCS. • Milestones / Schedule • Final deliverable: Comprehensive evaluation of scaling properties of existing (and novel solvers). • Six month target: Comparative performance of solvers on multicore SMPs and clusters. • Twelve-month target: Comprehensive evaluation on Cray X1, BG, JS20/21, of CAF/UPC/MPI implementations.

Introduction • A critical aspect of High-Productivity relates to the identification of points/regions in the algorithm/ architecture/ programming model space that are amenable to petascale systems. • This project aims to identify such points in the context of commonly used sparse linear system solvers and to develop novel solvers. • These novel solvers emphasize reduction in memory/remote accesses at the expense of (possibly) higher FLOP counts – yielding much better actual performance.

Project Rationale • Sparse solvers form the most commonly used kernels on HPC machines. • Design of HPC architectures and programming models must be influenced by their suitability to this (and related) kernels. • Extreme need for concurrency and novel architectural models require fundamental re-examination of conventional solvers.

Project Goals • Develop a generalization of direct and iterative solvers – the Spike polyalgorithm. • Implement this generalization on various architectures (multicore, multicore SMP, multicore SMP aggregates) and programming models (PGAs, Messaging APIs) • Analytically quantify performance and project to petascale platforms. • Compare relative performance, identify architecture/programming model features, and guide algorithm/ architecture/ programming model co-design.

Background • Personnel: • Ahmed Sameh, Samuel Conte Chair in Computer Science, has worked on development of parallel sparse solvers for four decades. • Ananth Grama, Professor and University Scholar, has worked both on numerical aspects of sparse solvers, as well as analytical frameworks for parallel systems. • (To be named – Postdoctoral Researcher)* will be primarily responsible for implementation and benchmarking. *We have identified three candidates for this position and will shortly be hiring one of them.

Background • Technical • We have built extensive infrastructure on parallel sparse solvers – including the Spike parallel toolkit, augmented-spectral ordering techniques, and multipole-based preconditioners • We have diverse hardware infrastructure, including Intel/AMP multicore SMP clusters, JS20/21 Blade servers, BlueGene/L, Cray X1.

Background • Technical (continued) • We have initiated installation of Co-Array Fortran and Unified Parallel C on our machines and porting our toolkits to these PGAs. • We have extensive experience in analysis of performance and scalability of parallel algorithms, including development of the isoefficiency metric for scalability.

Technical Highlights • The SPIKE Toolkit • (Dr. Sameh, could you include a few slides here).

Technical Highlights • Analysis of Scaling Properties • In early work, we developed the Isoefficiency metric for scalability. • With the likely scenario of utilizing up to 100K processing cores, this work becomes critical. • Isoefficiency quantifies the performance of a parallel system (a parallel program and the underlying architecture) as the number of processors is increased.

Technical Highlights • Isoefficiency Analysis • The efficiency of any parallel program running on a given problem instance goes down with increasing number of processors. • For a family of parallel programs (formally referred to as scalable programs), increasing the problem size results in an increase in efficiency.

Technical Highlights • Isoefficiency is the rate at which problem size must be increased w.r.t. number of processors, to maintain constant efficiency. • This rate is critical, since it is ultimately limited by total memory size. • Isoefficiency is a key indicator of a program’s ability to scale to very large machine configurations. • Isoefficiency analysis will be used extensively for performance projection and scaling properties.

Architecture • We target the following currently available architectures • IBM JS20/21 and BlueGene/L platforms • Cray X1/XT3 • AMD Opteron multicore SMP and SMP clusters • Intel Xeon multicore SMP and SMP clusters • These platforms represent a wide range of currently available architectural extremes.

Implementation • Current implementations are MPI based. • The Spike tooklit (iterative as well as direct solvers) will be ported to • POSIX and OpenMP • UPC and CAF • Titanium and X10 (if releases are available) • These implementations will be comprehensively benchmarked across platforms.

Benchmarks/Metrics • We aim to formally specify a number of benchmark problems (sparse systems arising in structures, CFD, and fluid-structure interaction) • We will abstract architecture characteristics – processor speed, memory bandwidth, link bandwidth, bisection bandwidth. • We will quantify solvers on the basis of wall-clock time, FLOP count, parallel efficiency, scalability, and projected performance to petascale systems.

Progress/Accomplishments • Implementation of the parallel Spike polyalgorithm toolkit • Incorporation of a number of direct (SuperLU, MUMPS) and iterative solvers into Spike (preconditioned Krylov subspace methods) • Evaluation of Spike on IBM/SP and Intel multicore platforms, integration into the Intel MKL library.

Milestones • Final deliverable: Comprehensive evaluation of scaling properties of existing (and new solvers). • Six month target: Comparative performance of solvers on multicore SMPs and clusters. • Twelve-month target: Comprehensive evaluation on Cray X1, BG, JS20/21, of CAF/UPC/MPI implementations.

Financials • The total cost of this project is approximately $150K for its one-year duration. • The budget primarily accounts for a post-doctoral researcher’s salary/benefits and minor summer-time for the PIs. • Together, these three project personnel are responsible for accomplishing project milestones and reporting.

Concluding Remarks • This project takes a comprehensive view of linear system solvers and the suitability of petascale HPC systems. • Its results directly influence ongoing and future development of HPC systems. • A number of major challenges are likely to emerge, both as a result of this project, and from impending architectural innovations.

Concluding Remarks • Architectural features include • Scalable multicore platforms: 64 to 128 cores on the horizon • Heterogeneous multicore: It is likely that cores are likely to be heterogeneous – some with floating point units, others with vector units, yet others with programmable hardware (indeed such chips are commonly used in cell phones) • Significantly higher pressure on the memory subsystem

Concluding Remarks • Impact of architectural features on algorithms and programming models. • Affinity scheduling is important for performance – need to specify tasks that must be co-scheduled (suitable programming abstractions needed). • Programming constructs for utilizing heterogeneity.

Concluding Remarks • Impact of architectural features on algorithms and programming models. • FLOPS are cheap, memory references are expensive – explore new families of algorithms that optimize for (minimize) latter • Algorithmic techniques and programming constructs for specifying algorithmic asynchrony (used to mask system latency) • Many of the optimizations are likely to be beyond the technical reach of applications programmers – need for scalable library support • Increased emphasis on scalability analysis

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

Presentation Transcript

L11: Sparse Linear Algebra on GPUs

Factorization-based Sparse Solvers and Preconditioners

Avoiding Communication in Sparse Iterative Solvers

Scalable Parallel Computing on Clouds

Amesos Interfaces to sparse direct solvers

Parallel Architectures

Evaluating Architectures: ATAM

Sparse Direct Solvers on High Performance Computers

Parallel Architectures

Scalable Parallel Architectures and their Software

Parallel Flexible Iterative Solvers for Sparse Equation Systems from Circuit Simulation

Evaluating Architectures

L12: Sparse Linear Algebra on GPUs

L11: Sparse Linear Algebra on GPUs

Scalable Web Architectures

Scalable Web Architectures

Scalable Web Architectures

A comparison between a direct and a multigrid sparse linear solvers

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

Parallel Architectures

Parallel Architectures

Sparse Direct Solvers on High Performance Computers