PARALLEL PROCESSING

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout

NASA (NAS Devision)

NASA (NAS Devision) Aims • NASA Advanced Supercomputing Division • Develop, demonstrate, and deliver innovative computing capabilities to enable NASA projects and missions • Demonstrate by the next millennium an operational computing system capable of simulating, in one to several hours, an entire aerospace vehicle system throughout its mission and life cycle.

NPB Introduction • NAS Parallel Benchmarks suite (NPB) has been used widely to evaluate modern parallel systems • Measure objectively the performance of highly parallel computers and to compare their performance with that of conventional supercomputers • NPB is based on Fortran 77 and the MPI message passing standard • Consists of eight benchmark problems derived from important classes of Arophysics applications.

Benchmark Problems • EP Embarrassingly Parallel • IS Integer sort • CG Conjugate gradient • MG Multigrid method for Poisson eqn • FT Spectral method (FFT) for Laplace eqn • BT ADI; Block-Tridiagonal systems • SP ADI; Scalar Pentadiagonal systems • LU Lower-Upper symmetric Gauss-Seidel

The Embarrassingly Parallel Benchmark (EP) In this benchmark, 2-dimensional statistics are accumulated from a large number of Gaussian pseudo-random numbers. This problem requires almost no communication, in some sense this benchmark provides an estimate of the upper achievable limits for floating-point performance on a particular system. • SP benchmark It is called the scalar pentadiagonal (SP) benchmark. In this benchmark, multiple independent systems of non-diagonally dominant, scalar pentadiagonal equations are solved. A complete solution of the SP requires 400 iteration. • MultiGrid (MG) Benchmark MG uses a multigrid method to compute the solution of the three-dimensional scalar Poisson equation. This code is a good test of both short and long distance highly structured communication.

3-D FFT PDE (FT) Benchmark FT contains the computational kernel of a three dimensional FFT-based spectral method. • BT Simulated CFD benchmark BT solve systems of equations resulting from an approximately factored finite difference discretization of the Navier-Stokes equations.

Class Benchmarks • Since the 1991 specifications of NPB 1.0, computer speed and memory sizes have grown and correspondingly so have representative problem sizes. • NPB 1.0 specifies two problem sizes for each benchmark – class “A” and a larger class “B”. • The class A benchmarks can now be run on a moderately • powerful workstation, and class B benchmarks on high-end workstations or small parallel systems. • To retain the focus on high-end supercomputing, we now add a class “C” for all of the NAS benchmarks.

Weakness Points • Implementations of the NAS Benchmarks are usually highly tuned by computer vendors • largest problems (class B) no longer reflect the largest problems being done on present-day supercomputers

Why 8 Different Benchmarks ?

Comparing World Wide Clusters • Loki and Hyglac In September 1996 two medium-scale parallel systems called “Loki” and “Hyglac” were installed. Each consisted of sixteen Pentium Pro (200 MHz) PCs with 16 Mbytes of memory and 3.2 and 2.5 Gbytes of disks per node, respectively. Each system was integrated using two fast Ethernet NICs in each node. Both sites had performed a complex N-body gravitational simulation of 2 million particles using an advanced tree-code algorithm. Each of these systems achieved a sustained performance of 1.19 Gflops and 1.26 Gflops, respectively. When the systems were connected together The same code was run again and achieved a sustained capability of over 2 Gflops without further optimization of the code for this new configuration.

Berkeley NOW The hardware configuration of the Berkeley NOW (Network Of Workstation) system comprise 105 Sun Ultra 170 workstations connected by Myricom networks. Each node includes 167MHz Ultra 1 microprocessor with 512 KB cache, 128 MB of RAM, two 2.3 GB disk space.

Cray T3E The Cray T3E-1200 is a scalable shared-memory multiprocessor based on the DEC Alpha 21164 microprocessor. It provides a shared physical address space of up to 2048 processors over a 3D torus interconnect. Each node of the system contains an Alpha 21164 processor each of which is capable of 1200 Mflops. The system logic runs at 75 MHz, and the processor runs at some multiple of this, such as 600 MHz for Cray T3E-1200. Torus links provide a raw bandwidth of 650 MBps in each direction to maintain system balance with the faster processors and memory.

NPB Graph Results

The Dwarves –Hardware • Old PII at 300MHz processors –Will be removed soon. • 8 PIII at 450MHz processors • 4 PIII at 733MHz processors • The new machines: – Dual AMD Athlon(tm) MP 2000+ @ 1,666MHz. 1GB Memory.

In The Next 2 Weeks • Install the NPB 2.2 on the Dwarves cluster • Run the Benchmark tests on the Dwarves Cluster • Run tests on several different configurations (different number of dwarves) • Estimate Network Bandwidth and latency. • Compare the Dwarves cluster performance to similar clusters in the world

Questions will not be answered !!! GOOD NIGHT

PARALLEL PROCESSING