linux cluster production readiness n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Linux Cluster Production Readiness PowerPoint Presentation
Download Presentation
Linux Cluster Production Readiness

Loading in 2 Seconds...

play fullscreen
1 / 114

Linux Cluster Production Readiness - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Linux Cluster Production Readiness. Egan Ford IBM egan@us.ibm.com egan@sense.net. Agenda. Production Readiness Diagnostics Benchmarks STAB Case Study SCAB. What is Production Readiness?. Production readiness is a series of tests to help determine if a system is ready for use.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Linux Cluster Production Readiness' - zwi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
linux cluster production readiness

Linux Cluster Production Readiness

Egan Ford

IBM

egan@us.ibm.com

egan@sense.net

agenda
Agenda
  • Production Readiness
  • Diagnostics
  • Benchmarks
  • STAB
  • Case Study
  • SCAB
what is production readiness
What is Production Readiness?
  • Production readiness is a series of tests to help determine if a system is ready for use.
  • Production readiness falls into two categories:
    • diagnostic
    • benchmark
  • The purpose is to confirm that all hardware is good and identical (per class).
  • The search for consistency and predictability.
what are diagnostics
What are diagnostics?
  • Diagnostic tests are usually pass/fail and include but are not limited to
    • simple version checks
      • OS, BIOS versions
    • inventory checks
      • Memory, CPU, etc…
    • configuration checks
      • Is HT off?
    • vendor supplied diagnostics
      • DOS on a CD
why benchmark
Why benchmark?
  • Diagnostics are usually pass/fail.
    • Thresholds may be undocumented.
    • ‘Why’ is difficult to answer.
  • Diagnostics may be incomplete.
    • They may not test all subsystems.
  • Other issues with diagnostics:
    • False positives.
    • Inconsistent from vendor to vendor.
    • Do no real work, cannot check for accuracy.
    • Usually hardware based.
      • What about software?
      • What about the user environment?
why benchmark1
Why benchmark?
  • Benchmarks can be checked for accuracy.
  • Benchmarks can stress all used subsystems.
  • Benchmarks can stress all used software.
  • Benchmarks can be measured and you can determine the thresholds.
benchmark or diagnostics
Benchmark or diagnostics?
  • Do both.
  • All diagnostics should pass first.
  • Benchmarks will be inconsistent if diagnostics fail.
warning
WARNING!
  • The following slides will contain the word ‘statistics’.
  • Statistics cannot prove anything.
  • Exercise commonsense.
a few words on statistics
A few words on statistics
  • Statistics increases human knowledge through the use of empirical data.
  • ”There are three kinds of lies: lies, damned lies and statistics.” -- Benjamin Disraeli (1804-1881)
  • ”There are three kinds of lies: lies, damned lies and linpack.”
what is stab
What is STAB?
  • STatistical Analysis of Benchmarks
  • A systematic way of running a series of increasing complex benchmarks to find avoidable inconsistencies.
  • Avoidable inconsistencies may lead to performance problems.
  • GOAL: consistent, repeatable, accurate results.
what is stab1
What is STAB?
  • Each benchmark is run one or more times per node, then the best representative of each node (ignore for multinode tests) is grouped together and analyzed as a single population.  The results are not as interesting as the shape of the distribution of the results.  Empirical evidence for all the benchmarks in the STAB HOWTO suggest that they should all form a normal distribution.
  • A normal distribution is the classic bell curve that appears so frequently in statistics.  It is the sum of smaller, independent (may be unobservable), identically-distributed variables or random events.
uniform distribution
Uniform Distribution
  • Plot below is of 20000 random dice.
normal distribution
Normal Distribution
  • Sum of 5 dice thrown 10000 times.
normal distribution1
Normal Distribution
  • Benchmarks also have many small independent (may be unobservable) identically-distributed variables that may affect performance, e.g.:
    • Competing processes
    • Context switching
    • Hardware interrupts
    • Software interrupts
    • Memory management
    • Process/Thread scheduling
    • Cosmic rays
  • The above may be unavoidable, but is in part the source a normal distribution.
non normal distribution
Non-normal Distribution
  • Benchmarks may also have non-identically-distributed observable variables that may affect performance, e.g.:
    • Memory configuration
    • BIOS Version
    • Processor speed
    • Operating system
    • Kernel type (e.g. NUMA vs SMP vs UNI)
    • Kernel version
    • Bad memory (e.g. excessive ECCs)
    • Chipset revisions
    • Hyper-Threading or SMT
    • Non-uniform competing processes (e.g. httpd running on some nodes, but not others)
    • Shared library versions
    • Bad cables
    • Bad administrators
    • Users
  • The above is avoidable and is the purpose of the STAB HOWTO.  Avoidable inconsistencies may lead to multimodal or non-normal distributions.
stab toolkit
STAB Toolkit
  • The STAB Tools are a collection of scripts to help run selected benchmarks and to analyze their results.
    • Some of the tools are specific to a particular benchmark.
    • Others are general and operate on the data collected by the specific tools.
  • Benchmark specific tools comprise of benchmark launch scripts, accuracy validation scripts, miscellaneous utilities, and analysis scripts to collect the data, report some basic descriptive statistics, and create input files to be used with general STAB tools for additional analysis.
stab toolkit1
STAB Toolkit
  • With a goal of consistent, repeatable, accurate results it is best to start with as few variables as possible.  Start with single node benchmarks, e.g., STREAM.  If all machines have similar STREAM results, then memory can be ruled out as a factor with other benchmark anomalies.  Next, work your way up to processor and disk benchmarks, then two node (parallel) benchmarks, then multi-node (parallel) benchmarks.  After each more complicated benchmark run a check for consistent, repeatable, accurate results before continuing.
the stab benchmarks
The STAB Benchmarks
  • Single Node (serial) Benchmarks:
    • STREAM (memory MB/s)
    • NPB Serial (uni-processor FLOP/s and memory)
    • NPB OpenMP (multi-processor FLOP/s and memory)
    • HPL MPI Shared Memory (multi-processor FLOP/s and memory)
    • IOzone (disk MB/s, memory, and processor)
  • Parallel Benchmarks (for MPI systems only):
    • Ping-Pong (interconnect µsec and MB/s)
    • NAS Parallel (multi-node FLOP/s, memory, and interconnect)
    • HPL Parallel (multi-node FLOP/s, memory, and interconnect)
getting stab
Getting STAB
  • http://sense.net/~egan/bench
    • bench.tgz
      • Code with source (all script)
    • bench-oss.tgz
      • OSS code (e.g. Gnuplot)
    • bench-examples.tgz
      • 1GB of collected data (all text, 186000+ files)
    • stab.pdf (currently 150 pages)
      • Documentation (WIP, check back before 11/30/2005)
install stab
Install STAB
  • Extract bench*.tgz into home directory:cd ~tar zxvf bench.tgztar zxvf bench-oss.tgztar zxvf bench-examples.tgz
  • Add STAB tools to PATH:export PATH=~/bench/bin:$PATH
  • Append to .bashrc:export PATH=~/bench/bin:$PATH
install stab1
Install STAB
  • STAB requires Gnuplot 4 and it must be built a specific way:cd ~/bench/srctar zxvf gnuplot-4.0.0.tar.gzcd gnuplot-4.0.0./configure --prefix=$HOME/bench --enable-thin-splinesmakemake install
stab benchmark tools
STAB Benchmark Tools
  • Each benchmark supported in this document contains an anal (short for analysis) script.  This script is usually run from a output directory, e.g.:cd ~/bench/benchmark/output../analbenchmark nodes low high % mean median std devbt.A.i686 4 615.77 632.08 2.65 627.85 632.02 8.06cg.A.i686 4 159.78 225.08 40.87 191.05 193.16 26.86ep.A.i686 4 11.51 11.53 0.17 11.52 11.52 0.01ft.A.i686 4 448.05 448.90 0.19 448.63 448.81 0.39lu.A.i686 4 430.60 436.59 1.39 433.87 434.72 2.51mg.A.i686 4 468.12 472.54 0.94 470.86 472.12 2.00sp.A.i686 4 449.01 449.87 0.19 449.58 449.72 0.39
  • The anal scripts produce statistics about the results to help find anomalies.  The theory is that if you have identical nodes then you should be able to obtain identical results (not always true).  The anal scripts will also produce plot.* files for use by dplot to graphically represent the distribution of the results, and by cplot to plot 2D correlations.
rant vs normal distribution
Rant: % vs. normal distribution
  • % is good?
    • % variability can tell you something about the data with respect to itself without knowing anything about the data
    • It is non-dimensional with a range (usually 0-100) that has meaning to anyone.
    • IOW, management understands percentages.
  • % is not good?
    • It minimizes the amount of useful empirical data.
    • It hides the truth.
is not good exhibit a
% is not good, exhibit A
  • Clearly this is a normal distribution, but the variability is 500%.  This is an extreme case where all the possible values exist for predetermined range.
is not good exhibit b
% is not good, exhibit B
  • Low variability can hide a skewed distribution.  Variability is low, only 1.27%.  But the distribution is clearly skewed to the right.
is not good exhibit c
% is not good, exhibit C
  • A 5.74% variability hides a bimodal distribution.  Bimodal distributions are clear indicators that there is an observable difference between two different sets of nodes.
stab general analysis tools
STAB General Analysis Tools
  • dplot is for plotting distributions.
    • All the graphical output used as illustrations in this document up to this point was created with dplot.
    • dplot provides a number of options for binning the data and analyzing the distribution.
  • cplot is for correlating the results between two different sets of results.
    • E.g., does poor memory performance correlate to poor application performance?
  • danal is very similar to the output provided by the custom anal scripts provided with each benchmark, but has additional output options.
    • You can safely discard any anal screen output because it can be recreated with danal and the resulting plot.benchmark file.
  • Each script will require one or more plot.benchmark files.
    • dplot and danal are less strict and will work with any file of numbers as long as the numbers are in the first column; subsequent columns are ignored.
    • cplot however requires the 2nd column; it is impossible to correlate two sets of results without an index.
dplot
dplot
  • The first argument to dplot must be the number of bins, auto, or whole.  auto (or a) will use the square root of the number of results to determine the bin sizes and is usually the best place to start.  whole (or w) should only be used if your results are whole numbers and if the data contains all possible values between low and high.  This is only useful for creating plots like the dice examples at the beginning of this document.
  • The second argument is the plotfile.  The plotfile must contain one value per line in the first column, subsequent columns are ignored.  The order of the data is unimportant.
dplot a plot c ppc64 text
dplot a plot.c.ppc64 –text

108 +--------------[]--------------------------------+ 0.22

| [] |

| [] |

| [] |

86 +--------------[]--------------------------------+ 0.18

| [][] |

| ::[][] |

| [][][] |

65 +------------[][][]------------------------------+ 0.13

| [][][] |

| [][][] |

| [][][].. |

43 +------------[][][][]----------------------------+ 0.09

| [][][][] |

| [][][][] |

| ::[][][][][] |

22 +----------[][][][][][]--------------------------+ 0.05

| [][][][][][] [].... |

| [][][][][][][]:: ..[][][][][].. |

| ..::::[][][][][][][][]::..[][][][][][][][][] |

0 +-------+-------+-------+-------+-------+-------++ 0.00

2023 2046 2068 2090 2112 2134 2156

abusing chi squared
Abusing chi-squared

$ findn plot.c_omp.ppc64X^2: 26.75, scale: 0.43, bins: 21, normal distribution probability: 14.30%X^2: 13.29, scale: 0.25, bins: 12, normal distribution probability: 27.50%X^2: 24.34, scale: 0.45, bins: 22, normal distribution probability: 27.70%X^2: 22.04, scale: 0.41, bins: 20, normal distribution probability: 28.20%X^2: 4.65, scale: 0.12, bins: 6, normal distribution probability: 46.00%X^2: 8.68, scale: 0.21, bins: 10, normal distribution probability: 46.70%X^2: 16.79, scale: 0.37, bins: 18, normal distribution probability: 46.90%X^2: 12.52, scale: 0.29, bins: 14, normal distribution probability: 48.50%X^2: 16.77, scale: 0.39, bins: 19, normal distribution probability: 53.90%X^2: 8.55, scale: 0.23, bins: 11, normal distribution probability: 57.50%X^2: 12.33, scale: 0.31, bins: 15, normal distribution probability: 58.00%X^2: 13.25, scale: 0.33, bins: 16, normal distribution probability: 58.30%X^2: 2.84, scale: 0.1, bins: 5, normal distribution probability: 58.40%X^2: 10.22, scale: 0.27, bins: 13, normal distribution probability: 59.70%X^2: 6.27, scale: 0.19, bins: 9, normal distribution probability: 61.70%X^2: 1.36, scale: 0.08, bins: 4, normal distribution probability: 71.60%X^2: 11.28, scale: 0.35, bins: 17, normal distribution probability: 79.20%X^2: 3.36, scale: 0.17, bins: 8, normal distribution probability: 85.00%X^2: 2.27, scale: 0.14, bins: 7, normal distribution probability: 89.30%

cplot
cplot
  • cplot or correlation plot is a perl front-end to Gnuplot to graphically represent the correlation between any two sets of indexed numbers.
  • Correlation measures the relationship between two sets of results, e.g. processor performance and memory throughput.
  • Correlations are often expressed as a correlation coefficient; a numerical value with a range from -1 to +1.
  • A positive correlation would indicate that if one set of results increased, the other set would increase, e.g. better memory throughput increases processor performance.
  • A negative correlation would indication that if one set of results increases, the other set would decrease, e.g. better processor performance decreases latency.
  • A correlation of zero would indicate that there is no relationship at all, IOW, they are independent.
  • Any two sets of results with a non-zero correlation is considered dependent, however a check should be performed to determine if a dependent set of results is statistically significant.
cplot1
cplot
  • A strong correlation between two sets of results should produce more questions, not quick answers.
  • It is possible for two unrelated results to have a strong correlation because they share something in common.
    • E.g.  You can show a positive correlation with the sales of skis and snowboards.  It is unlikely that increased ski sales increased snowboard sales, the mostly likely cause is an increase in the snow depth (or a decrease in temperature) at your local resort, i.e., something that is in common.  The correlation is valid, but it does not prove the cause of the correlation.
case study
Case Study
  • 484 JS20 blades
    • dual PPC970
    • 2GB RAM
  • Myrinet D
    • Full Bisection Switch
  • Cisco GigE
    • 14:1 over subscribed
diagnostics
Diagnostics
  • Vendor supplied (passed)
  • BIOS versions (failed)
  • Inventory
    • Number of CPUs (passed)
    • Total Memory (failed)
  • OS/Kernel Versions (passed)
bios versions failed
BIOS Versions (failed)
  • All nodes but node443 have BIOS dated 10/21/04. node443 is dated 09/02/2004.
  • Inconsistent BIOS versions can affect performance.Command output:# rinv compute all | tee /tmp/foo# cat /tmp/foo | grep BIOS | awk '{print $4}' | sort | uniq09/02/200410/21/2004# cat /tmp/foo | grep BIOS | grep 09/02/2004node433: VPD BIOS: 09/02/2004
memory quantity failed
Memory quantity (failed)
  • All nodes except node224 have 2GB RAM.Command output:# psh compute free | grep Mem | awk '{print $3}' | sort | uniq146011619772041977208#psh compute free | grep Mem | grep 1460116node224: Mem: 1460116 ...
stream
STREAM
  • The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
  • STREAM C, FORTRAN, and C OMP are run 10 times on each node, then the best result from each node is taken to be used to compare consistency. Each result is also tested for accuracy.
stream validation results
STREAM validation results
  • node483 failed to pass OMP test 3 of 10 test for accuracy. Try replacing memory, processors, and then system board in that order.Command output:# cd ~/bench/stream/output.raw# ../checkresultschecking stream_c_omp.ppc64.node483.3...failed
stream consistency results
STREAM consistency results

# cd ~/bench/stream/output# ../analstream results

benchmark nodes low high % mean median std dev

c.ppc64 484 2031.43 2147.98 5.74 2077.03 2069.02 23.20

c_omp.ppc64 484 1993.49 2124.24 6.56 2050.00 2050.51 22.86

f.ppc64 484 2007.16 2092.68 4.26 2039.20 2034.63 17.87

nas serial
NAS Serial
  • The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications.
  • The NAS Serial Benchmarks are the same as the NAS Parallel Benchmarks except that MPI calls have been taken out and they run on one processor.
  • bt.B, cg.B, ep.B, ft.B, lu.B, mg.B, and sp.B are run 5 times on each node, then the best result from each node is taken to be used to compare consistency. Each result is also tested for accuracy.
nas serial validation results
NAS Serial validation results
  • node483 failed to pass to a number of tests. Try replacing memory, processors, and then system board in that order.Command output:# cd ~/bench/NPB3.2/NPB3.2-SER/output.raw# ../checkresultschecking bt.B.ppc64.node483.1...failed checking bt.B.ppc64.node483.2...failed checking bt.B.ppc64.node483.3...failed checking bt.B.ppc64.node483.4...failed checking bt.B.ppc64.node483.5...failed checking cg.B.ppc64.node483.4...failed checking ep.B.ppc64.node483.3...failed checking ft.B.ppc64.node483.1...failed checking ft.B.ppc64.node483.2...failed checking ft.B.ppc64.node483.3...failed checking ft.B.ppc64.node483.4...failed checking lu.B.ppc64.node483.1...failed checking mg.B.ppc64.node483.1...failed checking mg.B.ppc64.node483.3...failed checking sp.B.ppc64.node483.1...failed checking sp.B.ppc64.node483.2...failed checking sp.B.ppc64.node483.3...failed checking sp.B.ppc64.node483.4...failed checking sp.B.ppc64.node483.5...failed
nas serial consistency results
NAS Serial consistency results

# cd ~/bench/NPB3.2/NPB3.2-SER/output# ../anal NPB Serial

benchmark nodes low high % mean median std dev

bt.B.ppc64 484 1077.69 1099.28 2.00 1087.60 1087.67 4.67

cg.B.ppc64 484 40.93 45.30 10.68 41.94 41.38 1.31

ep.B.ppc64 484 9.88 10.07 1.92 9.96 9.96 0.04

ft.B.ppc64 484 480.87 503.33 4.67 487.07 486.23 3.71

lu.B.ppc64 484 516.88 579.25 12.07 543.08 542.88 12.46

mg.B.ppc64 484 618.16 654.23 5.84 638.31 638.85 6.76

sp.B.ppc64 484 530.48 556.67 4.94 541.01 540.77 3.99

statistically significant
Statistically significant?
  • Command output:

$ findc plot* | grep plot.c.ppc64

0.13 0.13 00 plot.bt.B.ppc64 plot.c.ppc64

0.62 0.62 00 plot.c.ppc64 plot.c_omp.ppc64

0.93 0.93 00 plot.c.ppc64 plot.cg.B.ppc64

0.19 0.19 00 plot.c.ppc64 plot.ep.B.ppc64

0.89 0.89 00 plot.c.ppc64 plot.f.ppc64

0.17 0.17 00 plot.c.ppc64 plot.ft.B.ppc64

0.11 0.11 02 plot.c.ppc64 plot.lu.B.ppc64

0.50 -0.50 00 plot.c.ppc64 plot.mg.B.ppc64

0.05 -0.05 27 plot.c.ppc64 plot.sp.B.ppc64

nas omp
NAS OMP
  • The NAS OpenMP Benchmarks are the same as the NAS Parallel Benchmarks except that the MPI calls have been replaced with OpenMP calls to run on multiple processors on a shared memory system (SMP).
  • bt.B, cg.B, ep.B, ft.B, lu.B, mg.B, and sp.B are run 5 times on each node, then the best result from each node is taken to be used to compare consistency. Each result is also tested for accuracy.
nas omp validation results
NAS OMP validation results
  • node483 failed to pass to a number of tests. Try replacing memory, processors, and then system board in that order.Command output:

# cd ~/bench/NPB3.2/NPB3.2-OMP/output.raw

# ../checkresultschecking bt.B.ppc64.node483.1...failed

checking bt.B.ppc64.node483.2...failed

checking bt.B.ppc64.node483.3...failed

checking bt.B.ppc64.node483.4...failed

checking bt.B.ppc64.node483.5...failed

checking ft.B.ppc64.node483.1...failed

checking ft.B.ppc64.node483.2...failed

checking ft.B.ppc64.node483.3...failed

checking ft.B.ppc64.node483.4...failed

checking ft.B.ppc64.node483.5...failed

checking lu.B.ppc64.node483.1...failed

checking lu.B.ppc64.node483.3...failed

checking lu.B.ppc64.node483.4...failed

checking mg.B.ppc64.node483.1...failed

checking mg.B.ppc64.node483.2...failed

checking mg.B.ppc64.node483.3...failed

checking mg.B.ppc64.node483.4...failed

checking mg.B.ppc64.node483.5...failed

checking sp.B.ppc64.node483.1...failed

checking sp.B.ppc64.node483.2...failed

checking sp.B.ppc64.node483.3...failed

checking sp.B.ppc64.node483.4...failed

checking sp.B.ppc64.node483.5...failed

nas omp consistency results
NAS OMP consistency results

# cd ~/bench/NPB3.2/NPB3.2-OMP/output# ../anal NPB OpenMP

benchmark nodes low high % mean median std dev

bt.B.ppc64 484 1850.99 1898.65 2.57 1871.41 1870.45 9.25

cg.B.ppc64 484 67.31 73.30 8.90 68.96 68.44 1.49

ep.B.ppc64 484 19.69 20.36 3.40 19.88 19.88 0.09

ft.B.ppc64 484 593.39 615.77 3.77 604.74 604.61 4.06

lu.B.ppc64 484 739.30 820.71 11.01 773.09 772.05 16.76

mg.B.ppc64 484 751.40 819.38 9.05 792.03 797.10 15.26

sp.B.ppc64 484 722.73 824.39 14.07 745.99 747.33 8.51

statistically significant1
Statistically significant?
  • Command output:

$ findc plot* | grep plot.f.ppc64

0.37 0.37 00 plot.bt.B.ppc64 plot.f.ppc64

0.89 0.89 00 plot.c.ppc64 plot.f.ppc64

0.64 0.64 00 plot.c_omp.ppc64 plot.f.ppc64

0.77 0.77 00 plot.cg.B.ppc64 plot.f.ppc64

0.07 -0.07 12 plot.ep.B.ppc64 plot.f.ppc64

0.20 -0.20 00 plot.f.ppc64 plot.ft.B.ppc64

0.29 -0.29 00 plot.f.ppc64 plot.lu.B.ppc64

0.81 -0.81 00 plot.f.ppc64 plot.mg.B.ppc64

0.65 -0.65 00 plot.f.ppc64 plot.sp.B.ppc64

$ findc plot* | grep plot.c_omp.ppc64

0.29 0.29 00 plot.bt.B.ppc64 plot.c_omp.ppc64

0.62 0.62 00 plot.c.ppc64 plot.c_omp.ppc64

0.54 0.54 00 plot.c_omp.ppc64 plot.cg.B.ppc64

0.03 -0.03 51 plot.c_omp.ppc64 plot.ep.B.ppc64

0.64 0.64 00 plot.c_omp.ppc64 plot.f.ppc64

0.06 -0.06 19 plot.c_omp.ppc64 plot.ft.B.ppc64

0.20 -0.20 00 plot.c_omp.ppc64 plot.lu.B.ppc64

0.56 -0.56 00 plot.c_omp.ppc64 plot.mg.B.ppc64

0.44 -0.44 00 plot.c_omp.ppc64 plot.sp.B.ppc64

slide69
HPL
  • HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.
  • xhpl is run 10 times on each node, then the best result from each node is taken to be used to compare consistency. Each result it also tested for accuracy.
  • NOTE: nodes 215 and 224 were excluded from this test. node215 would not boot up. node224 only had 1.5GB of RAM. This test used 1.8GB RAM.
hpl validation test
HPL validation test
  • node483 failed to pass any test. Try replacing memory, processors, and then system board in that order.
  • Command output:

# cd ~/bench/hpl/output.raw.single

# ../checkresultschecking xhpl.ppc64.node483.1...failed

checking xhpl.ppc64.node483.10...failed

checking xhpl.ppc64.node483.2...failed

checking xhpl.ppc64.node483.3...failed

checking xhpl.ppc64.node483.4...failed

checking xhpl.ppc64.node483.5...failed

checking xhpl.ppc64.node483.6...failed

checking xhpl.ppc64.node483.7...failed

checking xhpl.ppc64.node483.8...failed

checking xhpl.ppc64.node483.9...failed

hpl consistency and correlation
HPL consistency and correlation

# cd ~/bench/hpl/output# ../anal HPL results

benchmark nodes low high % mean median std dev

xhpl.ppc64 482 11.62 12.04 3.61 11.89 11.89 0.08

ping pong
Ping-Pong
  • Ping-Pong is a simple benchmark that measures latency and bandwidth for different message sizes.
  • Ping-Pong benchmarks should be run for each network (e.g. Myrinet and GigE).  First run the serial Ping-Pongs and then the parallel Ping-Pongs.  The purpose of the serial benchmarks is to find any single node or set of nodes that is not performing as well as the other nodes. The purpose of the parallel benchmarks is to help calculate bisectional bandwidth and test that system wide MPI jobs can be run.
  • There are four patterns, 3 deterministic and 1 random. The purpose for all four is to help isolate poor performing nodes and possibly poor performing routes or trunks (e.g. bad uplink cable).
ping pong1
Ping-Pong
  • Sorted
myrinet consistency check
Myrinet consistency check

# cd ~/bench/PMB2.2.1/output.gm# ../anal spp sort bw

spp sort bw results

bytes pairs low high % mean median std dev

1 242 0.08 0.11 37.50 0.11 0.11 0.00

...

4194304 242 87.62 234.93 168.12 232.49 233.43 9.38# ../anal spp cut bw...

4194304 242 87.13 234.99 169.70 232.16 233.15 9.40

# ../anal spp fold bw...4194304 242 87.17 235.04 169.63 232.13 233.16 9.39

# ../anal spp shuffle bw...4194304 242 87.61 234.77 167.97 232.14 232.70 9.36

The 4194304 results the mean and median are very close together and also close to the high indicating a one or a few nodes with poor performance.

myrinet consistency
Myrinet consistency

# head -5 plot.spp.*.bw.4194304

==> plot.spp.cut.bw.4194304 <==

87.13 node164-node406

230.95 node107-node349

231.36 node147-node389

231.41 node091-node333

231.43 node045-node287

==> plot.spp.fold.bw.4194304 <==

87.17 node079-node406

227.58 node214-node271

229.34 node010-node475

231.40 node091-node394

231.48 node177-node308

==> plot.spp.shuffle.bw.4194304 <==

87.61 node024-node406

231.47 node091-node166

231.51 node227-node003

231.55 node110-node293

231.57 node013-node231

==> plot.spp.sort.bw.4194304 <==

87.62 node405-node406

228.64 node039-node040

231.64 node231-node232

231.66 node091-node092

231.66 node481-node482

bisectional bandwidth
Bisectional Bandwidth

ppp cut bw results

bytes pairs low high % mean median std dev

4194304 242 60.28 233.44 287.26 138.94 137.92 36.87

Demonstrated BW = 242 * 138.94 = 33623.48 MB/s ~= 32.8 GB/s (262.4 Gb/s)

ip consistency check
IP consistency check

# cd ~/bench/PMB2.2.1/output.ip# ../anal spp sort bw

spp sort bw results

bytes pairs low high % mean median std dev

1 241 0.01 0.01 0.00 0.01 0.01 0.00...

4194304 241 60.76 101.76 67.48 99.91 100.26 3.53# ../anal spp cut bw...

4194304 241 45.54 89.88 97.36 86.96 88.60 6.58# ../anal spp fold bw...4194304 241 50.91 100.60 97.60 87.33 88.48 6.30

# ../anal spp shuffle bw...4194304 241 49.31 100.71 104.24 87.26 88.53 6.72

ip consistency check1
IP consistency check
  • The sorted pair output will be easiest to analyze for problem since each pair will be restricted to a single switch within each Bladecenter. The other tests will run across the network and may have higher variability.
  • Running the following command reviles that the pairs in bold performed poorly:# head -5 plot.spp.sort.bw.4194304==> plot.spp.sort.bw.4194304 <==60.76 node025-node02668.97 node023-node02479.97 node325-node32698.83 node067-node06898.85 node071-node07298.94 node337-node33898.98 node175-node17699.02 node031-node03299.11 node401-node40299.16 node085-node086
  • This may or may not be a problem. The uplink performance will be less 60MB/s/node because BC can at best provide an average of 35MB/s per blade (with a 4 cable trunk). Many Myrinet-based clusters only use GigE for management and NFS, both have greater bottlenecks elsewhere.
  • You may want to check the switch logs and consider reseating the switches and blades.
ip consistency check2
IP consistency check

Running the following command reviles that there may be an uplink problem with nodes in BC #2. i.e. node015-node028.

# head -20 plot.spp.cut.bw.4194304 plot.spp.fold.bw.4194304 plot.spp.shuffle.bw.4194304

==> plot.spp.cut.bw.4194304 <==

45.54 node025-node268

50.47 node026-node269

54.85 node024-node267

56.27 node002-node245

57.08 node022-node265

58.50 node023-node266

62.74 node020-node263

69.37 node016-node259

69.48 node015-node258

69.56 node021-node264

69.73 node018-node261

71.06 node028-node271

71.42 node019-node262

71.45 node042-node285

72.06 node027-node270

72.31 node017-node260

84.69 node224-node465

86.40 node225-node466

87.10 node001-node244

87.54 node084-node327

ip consistency check3
IP consistency check

==> plot.spp.fold.bw.4194304 <==

50.91 node026-node459

51.72 node023-node462

55.32 node002-node483

58.39 node025-node460

60.24 node024-node461

65.66 node018-node467

68.09 node022-node463

68.28 node020-node465

69.96 node021-node464

70.23 node015-node470

70.27 node016-node469

70.61 node019-node466

71.12 node027-node458

71.50 node017-node468

74.35 node028-node457

84.75 node235-node252

85.02 node236-node251

85.79 node237-node250

85.94 node238-node24987.19 node118-node367

ip consistency check4
IP consistency check

==> plot.spp.shuffle.bw.4194304 <==

49.31 node001-node126

49.46 node029-node026

51.25 node024-node063

56.34 node274-node025

58.14 node023-node100

68.00 node019-node248

68.67 node443-node015

68.88 node018-node228

69.29 node020-node091

69.38 node028-node240

70.68 node022-node102

70.80 node027-node106

71.63 node021-node423

71.96 node291-node017

72.52 node460-node411

72.66 node016-node040

78.61 node031-node011

83.85 node041-node050

84.82 node407-node393

85.08 node420-node399

The cut, fold, and shuffle tests run from BC to BC, and the nodes in BC #2 repeatable show up. Consider checking the uplink cables, ports, and the BC switch.

bisectional bandwidth1
Bisectional Bandwidth

ppp cut bw results

bytes pairs low high % mean median std dev

4194304 241 6.18 17.36 180.91 7.95 7.28 1.82

Demonstrated BW = 241 * 7.95 = 1915.95 MB/s ~= 1.87 GB/s (14.96 Gb/s)

nas mpi 8 node 2ppn
NAS MPI (8 node, 2ppn)
  • The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications.
  • bt.B, cg.B, ep.B, ft.B, is.B, lu.B, mg.B, and sp.B are run 10 times on each set of 8 unique nodes using 2 different node set methods: sorted and shuffle.
    • Sorted. Sets of 8 nodes are selected from a sorted list and assigned adjacently, e.g. node001-node008, node009-node016, etc…, this is used to find consistency within the same set of nodes.
    • Shuffle. Sets of 8 nodes are selected from a shuffled list. Nodes are reshuffled between runs.
  • Both sorted and shuffle sets are run in parallel, i.e. all the sorted sets of 8 are run at the same time, then all the shuffle sets are run at the same time.
  • NOTE: node215 and node446 were not included in the shuffle and sorted tests. node215 failed to boot, node446 failed to startup Myrinet.
nas mpi verification
NAS MPI verification

Verification command output:

# cd ~/bench/NPB3.2/NPB3.2-MPI/output.raw.shuffleThis command will find the failed results and place the names of the results filenames into the file ../failed:

# ../checkresults ../failedThis command will find the common nodes in all failed results in the file ../failed and sort them by number of occurrences (occurrences are counted by processor, not node):# xcommon ../failed | tail

node395 12

node440 12

node056 12

node464 12

node043 12

node429 14

node297 14

node391 20

node174 22

node483 96

nas mpi consistency check
NAS MPI Consistency check
  • Consistency check command output:

# cd ~/bench/NPB3.2/NPB3.2-MPI/output.raw.shuffle# ../analmNPB MPI

benchmark runs low high % mean median std dev

bt.B.16 600 9089.46 10415.15 14.58 10204.94 10217.94 143.14

cg.B.16 600 1095.60 1685.61 53.85 1570.48 1575.38 57.70

ep.B.16 600 155.81 160.64 3.10 158.48 158.37 0.59

ft.B.16 600 2102.39 3232.49 53.75 3052.71 3066.45 130.37

is.B.16 600 87.06 185.29 112.83 155.97 154.39 12.94

lu.B.16 600 5069.36 5892.62 16.24 5529.00 5531.17 111.84

mg.B.16 600 3265.89 3898.99 19.39 3737.80 3739.77 74.91

sp.B.16 600 2156.46 2404.05 11.48 2340.00 2340.05 26.89

nas mpi consistency
NAS MPI Consistency

The leading cause of variable for a stable system is switch contention. The only way to determine what is normal is to run the same set of benchmarks multiple times on an isolated set of stable nodes (nodes that passed single node tests) with the rest of the switch not in use. I did not have time to run a series of serial parallel tests, but this is close:

# cd ~/bench/NPB3.2/NPB3.2-MPI/output.raw.sort# ../analm $(nr –l node001-node080)

NPB MPI

benchmark runs low high % mean median std dev

bt.B.16 100 10025.30 10266.00 2.40 10129.42 10120.54 44.30

cg.B.16 100 1678.27 1787.76 6.52 1714.04 1712.43 15.39

ep.B.16 100 150.45 160.02 6.36 158.49 158.38 1.03

ft.B.16 100 3248.41 3694.40 13.73 3563.50 3575.43 81.22

is.B.16 100 159.31 168.14 5.54 163.91 164.22 1.98

lu.B.16 100 5156.19 5522.79 7.11 5346.95 5350.06 87.51

mg.B.16 100 3491.76 3685.78 5.56 3613.65 3614.44 37.25

sp.B.16 100 2259.08 2308.16 2.17 2289.66 2290.30 9.55

The above results are from the first 80 nodes run sorted. Each set of 8 nodes were isolated to a single Myrinet line card reducing switch contention (however each 2 sets of nodes did share a single line card). Also to avoid possible variability because of memory performance I limited the report to the first 80 nodes.

nas mpi correlation bt stream vs perf1
NAS MPI Correlation BT STREAM vs. Perf

$ CPLOTOPTS="-dy ," findc plot* | grep plot.c.ppc64

0.09 -0.09 05 plot.c.ppc64 plot.cg.B.16

0.00 0.00 100 plot.c.ppc64 plot.ep.B.16

0.14 -0.14 00 plot.c.ppc64 plot.ft.B.16

0.22 -0.22 00 plot.c.ppc64 plot.is.B.16

0.21 -0.21 00 plot.c.ppc64 plot.lu.B.16

0.41 -0.41 00 plot.c.ppc64 plot.mg.B.16

0.42 -0.42 00 plot.c.ppc64 plot.sp.B.16

hpl mpi
HPL MPI
  • HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.
  • xhpl is run 10 (15 times for sorted) times on each set of 8 unique nodes using 2 different node set methods: sorted and shuffle.
    • Sorted. Sets of 8 nodes are selected from a sorted list and assigned adjacently, e.g. node001-node008, node009-node016, etc…, this is used to find consistency within the same set of nodes.
    • Shuffle. Sets of 8 nodes are selected from a shuffled list. Nodes are reshuffled between runs.
  • Both sorted and shuffle sets are run in parallel, i.e. all the sorted sets of 8 are run at the same time, then all the shuffle sets are run at the same time.
hpl mpi verification
HPL MPI verification

# cd ~/bench/hpl/output.raw.shuffleThis command will find the failed results and place the names of the results filenames into the file ../failed:

# ../checkresults ../failedThis command will find the common nodes in all failed results in the file ../failed and sort them by number of occurrences (occurrences are counted by processor, not node):# xcommon ../failed | tail

node073 2

node121 2

node090 2

node406 2

node308 2

node276 2

node103 2

node199 2

node435 4

node483 20

hpl mpi consistency
HPL MPI consistency

# cd ~/bench/hpl/output.raw.shuffle# ../analm

HPL results

benchmark runs low high % mean median std dev

xhpl.16.15000 600 51.14 60.66 18.62 59.31 59.48 1.00

xhpl.16.30000 600 69.34 78.48 13.18 77.16 77.35 1.08

summary
Summary
  • node483 has accuracy issues.
  • node406 has weak Myrinet performance.
  • BC2 has a switch or uplink issue.
  • nodes 1-84 has a different memory configuration that does correlate to application performance.
  • Applications at large scales my experience no performance anomalies.
what is scab
What is SCAB?
  • SCalability Analysis of Benchmarks
  • The purpose of the SCAB HOWTO is to verify that the cluster you just built actually can do work at scale.  This can be accomplished by running a few industry accepted benchmarks.
  • The STAB/SCAB tools provide tools to plot the scalability for visual analysis.
  • The STAB HOWTO should be completed first to rule out any inconsistencies that may appear as scaling issues.
the benchmarks
The Benchmarks
  • PMB (Pallas MPI Benchmark)
  • NPB (NAS Parallel Benchmark)
  • HPL (High Performance Linpack)
slide103
PMB
  • The Pallas MPI Benchmark (PMB) provides a concise set of benchmarks targeted at measuring the most important MPI functions.
  • NOTE:  Pallas has been acquired by Intel.  Intel has released the IMB (Intel MPI Benchmark).  The IMB is a minor update of the PMB.  The IMB were not used because they failed to execute properly for all MPI implementations that I tested.
  • IMPORTANT:  Consistent PMB Ping-Ping should be achieved before running this benchmark (STAB Lab).  Unresolved inconsistencies in the interconnect may appear as scaling issues.
  • The main purpose of this test is as a diagnostic to answer the following questions:
    • Are my MPI implementation basic functions complete?
    • Does my MPI implementation scale?
slide104
PMB
  • Example plot from larger BC cluster.
  • Very impressive.  For the Sendrecv benchmark this cluster scales from 2 nodes to 240!  Could this be a non-blocking GigE configuration?  Another benchmark can help answer that question.
slide105
PMB
  • Example plot from larger BC cluster.
  • Quite revealing.  The sorted benchmark has the 4M message size performing at ~115MB/s for all node counts, but shuffled it falls gradually as the number of nodes increase to ~10MB/s.  Why?
slide106
PMB
  • This cluster is partitioned into 14 nodes/BladeCenter Chassis.  Each chassis has a GigE switch with only 4 uplinks, 3 of the 4 uplinks are bonded together to form a single 3Gbit uplink to a stacked SMC GigE core switch.  Assuming no blocking with the core switch, this solution blocks at 14:3.
  • The Sendrecv benchmark is based on MPI_Sendrecv, the processes form a periodic communication chain. Each process sends to the right and receives from the left neighbor in the chain.
slide107
PMB
  • Based on the previous illustration it is easy to see why the sorted list performed so well.  Most of the traffic was isolated to good performing local switches and the jump from chassis to chassis through the SMC core switch only requires the bandwidth of a single link (1Gb full duplex).
  • The shuffled list has small odds that its left neighbor (receive from) and its right neighbor (send to) will be on the same switch.  This was illustrated in the second plot.
  • Moral of the story.
    • Don’t trust interconnect vendors that do not provide the node list.
    • Ask for sorted and shuffled benchmarks.
questions w answers
Questions w/ Answers
  • Egan Ford, IBMegan@us.ibm.comegan@sense.net