1 / 22

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures. Ahmed Sameh and Ananth Grama Computer Science Department Purdue University. http://www.cs.purdue.edu/people/{sameh/ayg}. Linear Solvers Grant Kickoff Meeting, 9/26/06.

olisa
Download Presentation

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ahmed Sameh and Ananth Grama Computer Science Department Purdue University. http://www.cs.purdue.edu/people/{sameh/ayg} Linear Solvers Grant Kickoff Meeting, 9/26/06.

  2. Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Project Overview • Objectives and Methodology • Design scalable sparse solvers (direct, iterative, and hybrid) and evaluate their scaling/communication characteristics. • Evaluate architectural features and their impact on scalable solver performance. • Evaluate performance and productivity aspects of programming models -- PGAs (CAF, UPC) and MPI. • Challenges and Impact • Generalizing the space of linear solvers. • Implementation and analysis on parallel platforms • Performance projection to the petascale • Guidance for architecture and programming model design / performance envelope. • Benchmarks and libraries for HPCS. • Milestones / Schedule • Final deliverable: Comprehensive evaluation of scaling properties of existing (and novel solvers). • Six month target: Comparative performance of solvers on multicore SMPs and clusters. • Twelve-month target: Comprehensive evaluation on Cray X1, BG, JS20/21, of CAF/UPC/MPI implementations.

  3. Introduction • A critical aspect of High-Productivity relates to the identification of points/regions in the algorithm/ architecture/ programming model space that are amenable to petascale systems. • This project aims to identify such points in the context of commonly used sparse linear system solvers and to develop novel solvers. • These novel solvers emphasize reduction in memory/remote accesses at the expense of (possibly) higher FLOP counts – yielding much better actual performance.

  4. Project Rationale • Sparse solvers form the most commonly used kernels on HPC machines. • Design of HPC architectures and programming models must be influenced by their suitability to this (and related) kernels. • Extreme need for concurrency and novel architectural models require fundamental re-examination of conventional solvers.

  5. Project Goals • Develop a generalization of direct and iterative solvers – the Spike polyalgorithm. • Implement this generalization on various architectures (multicore, multicore SMP, multicore SMP aggregates) and programming models (PGAs, Messaging APIs) • Analytically quantify performance and project to petascale platforms. • Compare relative performance, identify architecture/programming model features, and guide algorithm/ architecture/ programming model co-design.

  6. Background • Personnel: • Ahmed Sameh, Samuel Conte Chair in Computer Science, has worked on development of parallel sparse solvers for four decades. • Ananth Grama, Professor and University Scholar, has worked both on numerical aspects of sparse solvers, as well as analytical frameworks for parallel systems. • (To be named – Postdoctoral Researcher)* will be primarily responsible for implementation and benchmarking. *We have identified three candidates for this position and will shortly be hiring one of them.

  7. Background • Technical • We have built extensive infrastructure on parallel sparse solvers – including the Spike parallel toolkit, augmented-spectral ordering techniques, and multipole-based preconditioners • We have diverse hardware infrastructure, including Intel/AMP multicore SMP clusters, JS20/21 Blade servers, BlueGene/L, Cray X1.

  8. Background • Technical (continued) • We have initiated installation of Co-Array Fortran and Unified Parallel C on our machines and porting our toolkits to these PGAs. • We have extensive experience in analysis of performance and scalability of parallel algorithms, including development of the isoefficiency metric for scalability.

  9. Technical Highlights • The SPIKE Toolkit • (Dr. Sameh, could you include a few slides here).

  10. Technical Highlights • Analysis of Scaling Properties • In early work, we developed the Isoefficiency metric for scalability. • With the likely scenario of utilizing up to 100K processing cores, this work becomes critical. • Isoefficiency quantifies the performance of a parallel system (a parallel program and the underlying architecture) as the number of processors is increased.

  11. Technical Highlights • Isoefficiency Analysis • The efficiency of any parallel program running on a given problem instance goes down with increasing number of processors. • For a family of parallel programs (formally referred to as scalable programs), increasing the problem size results in an increase in efficiency.

  12. Technical Highlights • Isoefficiency is the rate at which problem size must be increased w.r.t. number of processors, to maintain constant efficiency. • This rate is critical, since it is ultimately limited by total memory size. • Isoefficiency is a key indicator of a program’s ability to scale to very large machine configurations. • Isoefficiency analysis will be used extensively for performance projection and scaling properties.

  13. Architecture • We target the following currently available architectures • IBM JS20/21 and BlueGene/L platforms • Cray X1/XT3 • AMD Opteron multicore SMP and SMP clusters • Intel Xeon multicore SMP and SMP clusters • These platforms represent a wide range of currently available architectural extremes.

  14. Implementation • Current implementations are MPI based. • The Spike tooklit (iterative as well as direct solvers) will be ported to • POSIX and OpenMP • UPC and CAF • Titanium and X10 (if releases are available) • These implementations will be comprehensively benchmarked across platforms.

  15. Benchmarks/Metrics • We aim to formally specify a number of benchmark problems (sparse systems arising in structures, CFD, and fluid-structure interaction) • We will abstract architecture characteristics – processor speed, memory bandwidth, link bandwidth, bisection bandwidth. • We will quantify solvers on the basis of wall-clock time, FLOP count, parallel efficiency, scalability, and projected performance to petascale systems.

  16. Progress/Accomplishments • Implementation of the parallel Spike polyalgorithm toolkit • Incorporation of a number of direct (SuperLU, MUMPS) and iterative solvers into Spike (preconditioned Krylov subspace methods) • Evaluation of Spike on IBM/SP and Intel multicore platforms, integration into the Intel MKL library.

  17. Milestones • Final deliverable: Comprehensive evaluation of scaling properties of existing (and new solvers). • Six month target: Comparative performance of solvers on multicore SMPs and clusters. • Twelve-month target: Comprehensive evaluation on Cray X1, BG, JS20/21, of CAF/UPC/MPI implementations.

  18. Financials • The total cost of this project is approximately $150K for its one-year duration. • The budget primarily accounts for a post-doctoral researcher’s salary/benefits and minor summer-time for the PIs. • Together, these three project personnel are responsible for accomplishing project milestones and reporting.

  19. Concluding Remarks • This project takes a comprehensive view of linear system solvers and the suitability of petascale HPC systems. • Its results directly influence ongoing and future development of HPC systems. • A number of major challenges are likely to emerge, both as a result of this project, and from impending architectural innovations.

  20. Concluding Remarks • Architectural features include • Scalable multicore platforms: 64 to 128 cores on the horizon • Heterogeneous multicore: It is likely that cores are likely to be heterogeneous – some with floating point units, others with vector units, yet others with programmable hardware (indeed such chips are commonly used in cell phones) • Significantly higher pressure on the memory subsystem

  21. Concluding Remarks • Impact of architectural features on algorithms and programming models. • Affinity scheduling is important for performance – need to specify tasks that must be co-scheduled (suitable programming abstractions needed). • Programming constructs for utilizing heterogeneity.

  22. Concluding Remarks • Impact of architectural features on algorithms and programming models. • FLOPS are cheap, memory references are expensive – explore new families of algorithms that optimize for (minimize) latter • Algorithmic techniques and programming constructs for specifying algorithmic asynchrony (used to mask system latency) • Many of the optimizations are likely to be beyond the technical reach of applications programmers – need for scalable library support • Increased emphasis on scalability analysis

More Related