1 / 28

What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?

What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?. Shengxin Zhu The University of Oxford. Prof. Xingping Liu and Prof. Tongxiang Gu National Key Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics.

Download Presentation

What is the most important kernel of sparse linear solvers for heterogeneous supercomputers?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is the most important kernel of sparse linear solvers for heterogeneous supercomputers? Shengxin Zhu The University of Oxford Prof. Xingping Liu and Prof. Tongxiang Gu National Key Laboratory of Computational Physics Institute of Applied Physics and Computational Mathematics SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  2. Outlines Brief introduction on Heterogeneous supper-computers Computation kernels of Krylov methods Influence of communications Case study: GPBiCG(m,l) Challenging problems Conclusion SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  3. Introduction to heterogeneous supper-computers 3 • Dawning5000A • Nodes: • Bandwidth: • Memory: SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  4. Computational kernels of Krylov methods Vector update: parallel in nature Mat-vec: Computation intensive; multi-core technology CUDA/OpenMP Inner product: Communication intensive (CPU/MPI). SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  5. Influence of communicationfirst glance Computation cheap Communication expensive Based on Aztec by Prof. Tuminaro et al @ Sandia S Zhu, MSc Thesis, CAEP, 2010 SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  6. Real reason for time-consuming communications Small workshops: focus less preparing time Conference: diversity more preparing time SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  7. Strategies for minimizing communications • Replacing dot by others (semi-Chebyshev ) : workshop only no conference if possible. Inner product free , Gu, Liu, Mo(2002) • Reorganizing algorithm such that: (reduce number of conference and each conference accept more talks) residual replacement strategies due to Von de Vorst (2000s). CA –KSMs, Demmel et al (2008) • Overlapping communication over computation SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  8. A case study, Paralleling GPBiCG(m,l) (S. Fujino, 2002) • GPBiCG(1,0) BiCGSTAB • GPBiCG(0,1) GPBiCG • GPBiCG(1,1) BiCGSTAB2 Could be used to design breakdown free BiCGSTAB method. SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  9. GPBiCG(m,l) (S. Fujino, 2002) SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  10. GPBiCG(m,l) (S. Fujino, 2002) SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  11. Algorithm Design of PGPBiCG(m,l) Method SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  12. PGPBiCG(m,l) Method(reduce # global commun.) Algorithm reconstruct: three GobalCs to one! reconstruct Global synch. Global synch. Global synch. Global synch. SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  13. Performance Based on Aztec by Prof. R.S. Tuminaro et al @ Sandia SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  14. Convergence analysis Residual replacements strategies Backward stable analysis SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  15. Challenging problemAccurate compute dot • Why Mindless by Kahan • Accurate compute inner product. • Ogita and Rump –et-al, Accurate sum and dot product, SIAM Sci Compt. 2005 cited 188 times. (but) …. • PLASMA team • Backward stable analysis of residual replacement methods. • Carson and Demmel, A residual replacement strategy for improving the maximum attainable accuracy of communication avoiding Krylov subspace Methods, April 20 2012 • Reliable dot computation algorithm SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  16. Conclusion: • Avoiding communication • Reliable computation • Inner product computation is very likely to be the most challenging kernel for HHPC, while Mat_vec important for both… • Software abstraction and threads programming are helpful, together with re-designing algorithms will do better Math/Algorithm CS/Performance Applications interface Aztec POSKI POSKI Hyper, PETSc; Trilinos (Parallel Optimized Sparse Kernel Interface LIbrary) Poski v.1.0 May 02/2012 SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  17. Thanks ! SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  18. Initial study on communication complexity More than ten thousand processors are connected by network Global Communication becomes more and more serious SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  19. Methods in literatures • Based on the former two strategies • de Sturler and van der Vorst: Parallel GMRES(m) and CG methods (1995) • Bucker and Sauren: Parallel QMR method (1997) • Yang and Brent: Improved CGS, BiCG and BiCGSTAB methods (2002-03) • Gu and Liu et al.: ICR, IBiCR, IBiCGSTAB(2) and PQMRCGSTAB methods (2004-2010) • Demmel et al CA-KSMs (2008---) • Gu, Liu and Mo: MSD-CG: multiple search direction conjugate gradient method (2004) • replaced the inner products computation by solving linear systems with small size. Eliminates global inner products completely. • The idea have been generated to MPCG by Grief and Bridson (2006) SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  20. Comparison of computational count of two Algorithms SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  21. Comparison of computational count of two Algorithms SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  22. Mathematical model of the time consummation SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  23. Scalability analysis SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  24. Popt The optimal number of processors SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  25. Convergence Analysis SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  26. Numerical Experiments: timing and improvements SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  27. Numerical Experiments: Speedup SNSCC'12, shengxin.zhu@maths.ox.ac.uk

  28. Conclusions • PGPBiCG(m,l) method is more scalable and parallel for solving large sparse unsymmetrical linear systems on distributed parallel architectures • Performance, isoefficiency analysis and numerical experiments have been done for PGPBiCG(m,l) and GPBiCG(m,l) methods • The parallel communication performance can be improved by a factor of larger than 3. • The PGPBiCG(m,l) method has better parallel speed up compared with the GPBiC(m,l) method. • For further performance improvements: overlap of computation with communication, numerical stability. SNSCC'12, shengxin.zhu@maths.ox.ac.uk

More Related