1 / 25

Benchmarks for Parallel Systems

Benchmarks for Parallel Systems. Sources/Credits:

neylan
Download Presentation

Benchmarks for Parallel Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benchmarks for Parallel Systems Sources/Credits: “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:http://www.netlib.org/benchmark/performance.ps http://www.top500.org FAQ: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html Courtesy: Jack Dongarra (Top500) http://www.top500.org The LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/

  2. LINPACK (Dongarra: 1979) • Dense system of linear equations • Initially used as a user’s guide for LINPACK package • LINPACK – 1979 • N=100 benchmark, N=1000 benchmark, Highly Parallel Computing benchmark

  3. LINPACK benchmark • Implemented on top of BLAS1 • 2 main operations – DGEFA(Gaussian elimination - O(n3)) and DGESL(Ax = b – O(n2)) • Major operation (97%) – DAXPY: y = y + α.x • Called n3/3 + n2 times. Hence 2n3/3 + 2n2 flops (approx.) • 64-bit floating point arithmetic

  4. LINPACK • N=100, 100x100 system of equations. No change in code. User asked to give a timing routine called SECOND, no compiler optimizations • N=1000, 1000x1000 – user can implement any code, should provide the required accuracy: Towards Peak Performance (TPP). Driver program always uses 2n3/3 +2n2 • “Highly Parallel Computing” benchmark – any software, matrix size can be chosen. Used in Top500 • Based on 64-bit floating point arithmetic

  5. LINPACK • 100x100 – inner loop optimization • 1000x1000 – three-loop/whole program optimization • Scalable parallel program – Largest problem that can fit in memory

  6. HPL (High Performance LINPACK)

  7. HPL Algorithm • 2-D block-cyclic data distribution • Right-looking LU • Panel factorization: various options • - Crout, left or right-looking recursive variants based on matrix multiply • - Number of sub-panels • - recursive stopping criteria • - pivot search and broadcast by binary-exchange

  8. HPL algorithm • Panel broadcast: - • Update of trailing matrix: - look-ahead pipeline • Validity check - should be O(1)

  9. Top500 (www.top500.org) • Top500 – 1993 • Twice a year – June and November • Top500 gives Nmax, Rmax, N1/2, Rpeak

  10. 24th List: The TOP 5

  11. 24th List: India

  12. Manufacturer

  13. Architecture

  14. Processor Generation

  15. System Processor Count

  16. NAS Parallel Benchmarks - NPB • Also for evaluation of Supercomputers • A set of 8 programs from CFD • 5 kernels, 3 pseudo applications • NPB 1 – Original benchmarks • NPB 2 – NAS’s MPI implementation. NPB 2.4 Class D has more work and more I/O • NPB 3 – based on OpenMP, HPF, Java • GridNPB3 – for computational grids • NPB 3 multi-zone – for hybrid parallelism

  17. NPB 1.0 (March 1994) • Defines class A and class B versions • “Paper and pencil” algorithmic specifications • Generic benchmarks as compared to MPI-based LinPack • General rules for implementations – Fortran90 or C, 64-bit arithmetic etc. • Sample implementations provided

  18. Kernel Benchmarks • EP – embarrassingly parallel • MG – multigrid. Regular communication • CG – conjugate gradient. Irregular long distance communication • FT – a 3-D PDE using FFT. Rigorous test of long distance communication • IS – large integer sort • Detailed rules regarding - brief statement of the problem - algorithm to be practiced - validation of results - where to insert timing calls - method for generating random numbers - submission of results

  19. Pseudo applications / Synthetic CFDs • Benchmark 1 – perform few iterations of the approximate factorization algorithm (SP) • Benchmark 2 - perform few iterations of diagonal form of the approximate factorization algorithm (BT) • Benchmark 3 - perform few iterations of SSOR (LU)

  20. Class A and Class B Class A Sample Code Class B

  21. NPB 2.0 (1995) • MPI and Fortran 77 implementations • 2 parallel kernels (MG, FT) and 3 simulated applications (LU, SP, BT) • Class C – bigger size • Benchmark rules – 0%, 5%, >5% change in source code

  22. NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003) • EP and IS added • FT rewritten • NPB 2.4 – class D and rationale for class D sizes • 2.4 I/O – a new benchmark problem based on BT (BTIO) to test the output capabilities • A MPI implementation of the same (MPI-IO) – different options using collective buffering or not etc.

  23. Game, Set & Match !

More Related