Linux Cluster Production Readiness

Linux Cluster Production Readiness Egan Ford IBM egan@us.ibm.com egan@sense.net

Agenda • Production Readiness • Diagnostics • Benchmarks • STAB • Case Study • SCAB

What is Production Readiness? • Production readiness is a series of tests to help determine if a system is ready for use. • Production readiness falls into two categories: • diagnostic • benchmark • The purpose is to confirm that all hardware is good and identical (per class). • The search for consistency and predictability.

What are diagnostics? • Diagnostic tests are usually pass/fail and include but are not limited to • simple version checks • OS, BIOS versions • inventory checks • Memory, CPU, etc… • configuration checks • Is HT off? • vendor supplied diagnostics • DOS on a CD

Why benchmark? • Diagnostics are usually pass/fail. • Thresholds may be undocumented. • ‘Why’ is difficult to answer. • Diagnostics may be incomplete. • They may not test all subsystems. • Other issues with diagnostics: • False positives. • Inconsistent from vendor to vendor. • Do no real work, cannot check for accuracy. • Usually hardware based. • What about software? • What about the user environment?

Why benchmark? • Benchmarks can be checked for accuracy. • Benchmarks can stress all used subsystems. • Benchmarks can stress all used software. • Benchmarks can be measured and you can determine the thresholds.

Benchmark or diagnostics? • Do both. • All diagnostics should pass first. • Benchmarks will be inconsistent if diagnostics fail.

WARNING! • The following slides will contain the word ‘statistics’. • Statistics cannot prove anything. • Exercise commonsense.

A few words on statistics • Statistics increases human knowledge through the use of empirical data. • ”There are three kinds of lies: lies, damned lies and statistics.” -- Benjamin Disraeli (1804-1881) • ”There are three kinds of lies: lies, damned lies and linpack.”

What is STAB? • STatistical Analysis of Benchmarks • A systematic way of running a series of increasing complex benchmarks to find avoidable inconsistencies. • Avoidable inconsistencies may lead to performance problems. • GOAL: consistent, repeatable, accurate results.

What is STAB? • Each benchmark is run one or more times per node, then the best representative of each node (ignore for multinode tests) is grouped together and analyzed as a single population. The results are not as interesting as the shape of the distribution of the results. Empirical evidence for all the benchmarks in the STAB HOWTO suggest that they should all form a normal distribution. • A normal distribution is the classic bell curve that appears so frequently in statistics. It is the sum of smaller, independent (may be unobservable), identically-distributed variables or random events.

Uniform Distribution • Plot below is of 20000 random dice.

Normal Distribution • Sum of 5 dice thrown 10000 times.

Normal Distribution • Benchmarks also have many small independent (may be unobservable) identically-distributed variables that may affect performance, e.g.: • Competing processes • Context switching • Hardware interrupts • Software interrupts • Memory management • Process/Thread scheduling • Cosmic rays • The above may be unavoidable, but is in part the source a normal distribution.

Non-normal Distribution • Benchmarks may also have non-identically-distributed observable variables that may affect performance, e.g.: • Memory configuration • BIOS Version • Processor speed • Operating system • Kernel type (e.g. NUMA vs SMP vs UNI) • Kernel version • Bad memory (e.g. excessive ECCs) • Chipset revisions • Hyper-Threading or SMT • Non-uniform competing processes (e.g. httpd running on some nodes, but not others) • Shared library versions • Bad cables • Bad administrators • Users • The above is avoidable and is the purpose of the STAB HOWTO. Avoidable inconsistencies may lead to multimodal or non-normal distributions.

STAB Toolkit • The STAB Tools are a collection of scripts to help run selected benchmarks and to analyze their results. • Some of the tools are specific to a particular benchmark. • Others are general and operate on the data collected by the specific tools. • Benchmark specific tools comprise of benchmark launch scripts, accuracy validation scripts, miscellaneous utilities, and analysis scripts to collect the data, report some basic descriptive statistics, and create input files to be used with general STAB tools for additional analysis.

STAB Toolkit • With a goal of consistent, repeatable, accurate results it is best to start with as few variables as possible. Start with single node benchmarks, e.g., STREAM. If all machines have similar STREAM results, then memory can be ruled out as a factor with other benchmark anomalies. Next, work your way up to processor and disk benchmarks, then two node (parallel) benchmarks, then multi-node (parallel) benchmarks. After each more complicated benchmark run a check for consistent, repeatable, accurate results before continuing.

The STAB Benchmarks • Single Node (serial) Benchmarks: • STREAM (memory MB/s) • NPB Serial (uni-processor FLOP/s and memory) • NPB OpenMP (multi-processor FLOP/s and memory) • HPL MPI Shared Memory (multi-processor FLOP/s and memory) • IOzone (disk MB/s, memory, and processor) • Parallel Benchmarks (for MPI systems only): • Ping-Pong (interconnect µsec and MB/s) • NAS Parallel (multi-node FLOP/s, memory, and interconnect) • HPL Parallel (multi-node FLOP/s, memory, and interconnect)

Getting STAB • http://sense.net/~egan/bench • bench.tgz • Code with source (all script) • bench-oss.tgz • OSS code (e.g. Gnuplot) • bench-examples.tgz • 1GB of collected data (all text, 186000+ files) • stab.pdf (currently 150 pages) • Documentation (WIP, check back before 11/30/2005)

Install STAB • Extract bench*.tgz into home directory:cd ~tar zxvf bench.tgztar zxvf bench-oss.tgztar zxvf bench-examples.tgz • Add STAB tools to PATH:export PATH=~/bench/bin:$PATH • Append to .bashrc:export PATH=~/bench/bin:$PATH

Install STAB • STAB requires Gnuplot 4 and it must be built a specific way:cd ~/bench/srctar zxvf gnuplot-4.0.0.tar.gzcd gnuplot-4.0.0./configure --prefix=$HOME/bench --enable-thin-splinesmakemake install

STAB Benchmark Tools • Each benchmark supported in this document contains an anal (short for analysis) script. This script is usually run from a output directory, e.g.:cd ~/bench/benchmark/output../analbenchmark nodes low high % mean median std devbt.A.i686 4 615.77 632.08 2.65 627.85 632.02 8.06cg.A.i686 4 159.78 225.08 40.87 191.05 193.16 26.86ep.A.i686 4 11.51 11.53 0.17 11.52 11.52 0.01ft.A.i686 4 448.05 448.90 0.19 448.63 448.81 0.39lu.A.i686 4 430.60 436.59 1.39 433.87 434.72 2.51mg.A.i686 4 468.12 472.54 0.94 470.86 472.12 2.00sp.A.i686 4 449.01 449.87 0.19 449.58 449.72 0.39 • The anal scripts produce statistics about the results to help find anomalies. The theory is that if you have identical nodes then you should be able to obtain identical results (not always true). The anal scripts will also produce plot.* files for use by dplot to graphically represent the distribution of the results, and by cplot to plot 2D correlations.

Rant: % vs. normal distribution • % is good? • % variability can tell you something about the data with respect to itself without knowing anything about the data • It is non-dimensional with a range (usually 0-100) that has meaning to anyone. • IOW, management understands percentages. • % is not good? • It minimizes the amount of useful empirical data. • It hides the truth.

% is not good, exhibit A • Clearly this is a normal distribution, but the variability is 500%. This is an extreme case where all the possible values exist for predetermined range.

% is not good, exhibit B • Low variability can hide a skewed distribution. Variability is low, only 1.27%. But the distribution is clearly skewed to the right.

% is not good, exhibit C • A 5.74% variability hides a bimodal distribution. Bimodal distributions are clear indicators that there is an observable difference between two different sets of nodes.

STAB General Analysis Tools • dplot is for plotting distributions. • All the graphical output used as illustrations in this document up to this point was created with dplot. • dplot provides a number of options for binning the data and analyzing the distribution. • cplot is for correlating the results between two different sets of results. • E.g., does poor memory performance correlate to poor application performance? • danal is very similar to the output provided by the custom anal scripts provided with each benchmark, but has additional output options. • You can safely discard any anal screen output because it can be recreated with danal and the resulting plot.benchmark file. • Each script will require one or more plot.benchmark files. • dplot and danal are less strict and will work with any file of numbers as long as the numbers are in the first column; subsequent columns are ignored. • cplot however requires the 2nd column; it is impossible to correlate two sets of results without an index.

dplot • The first argument to dplot must be the number of bins, auto, or whole. auto (or a) will use the square root of the number of results to determine the bin sizes and is usually the best place to start. whole (or w) should only be used if your results are whole numbers and if the data contains all possible values between low and high. This is only useful for creating plots like the dice examples at the beginning of this document. • The second argument is the plotfile. The plotfile must contain one value per line in the first column, subsequent columns are ignored. The order of the data is unimportant.

dplot a numbers.1000

dplot a numbers.1000 -n

dplot 19 numbers.1000 -n

dplot a plot.c.ppc64 -bi

dplot a plot.c.ppc64 –bi -std

dplot a plot.c.ppc64 –text 108 +--------------[]--------------------------------+ 0.22 | [] | | [] | | [] | 86 +--------------[]--------------------------------+ 0.18 | [][] | | ::[][] | | [][][] | 65 +------------[][][]------------------------------+ 0.13 | [][][] | | [][][] | | [][][].. | 43 +------------[][][][]----------------------------+ 0.09 | [][][][] | | [][][][] | | ::[][][][][] | 22 +----------[][][][][][]--------------------------+ 0.05 | [][][][][][] [].... | | [][][][][][][]:: ..[][][][][].. | | ..::::[][][][][][][][]::..[][][][][][][][][] | 0 +-------+-------+-------+-------+-------+-------++ 0.00 2023 2046 2068 2090 2112 2134 2156

GUI vs Text

dplot a plot.c_omp.ppc64 –n -chi

chi-squared and scale

Abusing chi-squared $ findn plot.c_omp.ppc64X^2: 26.75, scale: 0.43, bins: 21, normal distribution probability: 14.30%X^2: 13.29, scale: 0.25, bins: 12, normal distribution probability: 27.50%X^2: 24.34, scale: 0.45, bins: 22, normal distribution probability: 27.70%X^2: 22.04, scale: 0.41, bins: 20, normal distribution probability: 28.20%X^2: 4.65, scale: 0.12, bins: 6, normal distribution probability: 46.00%X^2: 8.68, scale: 0.21, bins: 10, normal distribution probability: 46.70%X^2: 16.79, scale: 0.37, bins: 18, normal distribution probability: 46.90%X^2: 12.52, scale: 0.29, bins: 14, normal distribution probability: 48.50%X^2: 16.77, scale: 0.39, bins: 19, normal distribution probability: 53.90%X^2: 8.55, scale: 0.23, bins: 11, normal distribution probability: 57.50%X^2: 12.33, scale: 0.31, bins: 15, normal distribution probability: 58.00%X^2: 13.25, scale: 0.33, bins: 16, normal distribution probability: 58.30%X^2: 2.84, scale: 0.1, bins: 5, normal distribution probability: 58.40%X^2: 10.22, scale: 0.27, bins: 13, normal distribution probability: 59.70%X^2: 6.27, scale: 0.19, bins: 9, normal distribution probability: 61.70%X^2: 1.36, scale: 0.08, bins: 4, normal distribution probability: 71.60%X^2: 11.28, scale: 0.35, bins: 17, normal distribution probability: 79.20%X^2: 3.36, scale: 0.17, bins: 8, normal distribution probability: 85.00%X^2: 2.27, scale: 0.14, bins: 7, normal distribution probability: 89.30%

Abusing chi-squared

cplot • cplot or correlation plot is a perl front-end to Gnuplot to graphically represent the correlation between any two sets of indexed numbers. • Correlation measures the relationship between two sets of results, e.g. processor performance and memory throughput. • Correlations are often expressed as a correlation coefficient; a numerical value with a range from -1 to +1. • A positive correlation would indicate that if one set of results increased, the other set would increase, e.g. better memory throughput increases processor performance. • A negative correlation would indication that if one set of results increases, the other set would decrease, e.g. better processor performance decreases latency. • A correlation of zero would indicate that there is no relationship at all, IOW, they are independent. • Any two sets of results with a non-zero correlation is considered dependent, however a check should be performed to determine if a dependent set of results is statistically significant.

cplot • A strong correlation between two sets of results should produce more questions, not quick answers. • It is possible for two unrelated results to have a strong correlation because they share something in common. • E.g. You can show a positive correlation with the sales of skis and snowboards. It is unlikely that increased ski sales increased snowboard sales, the mostly likely cause is an increase in the snow depth (or a decrease in temperature) at your local resort, i.e., something that is in common. The correlation is valid, but it does not prove the cause of the correlation.

cplot plot.c.ppc64 plot.cg.B.ppc64

cplot plot.c.ppc64 plot.mg.B.ppc64

Correlation of temperature to memory performance

Correlation of 100 random numbers

Statistical Significance

Case Study • 484 JS20 blades • dual PPC970 • 2GB RAM • Myrinet D • Full Bisection Switch • Cisco GigE • 14:1 over subscribed

Diagnostics • Vendor supplied (passed) • BIOS versions (failed) • Inventory • Number of CPUs (passed) • Total Memory (failed) • OS/Kernel Versions (passed)

BIOS Versions (failed) • All nodes but node443 have BIOS dated 10/21/04. node443 is dated 09/02/2004. • Inconsistent BIOS versions can affect performance.Command output:# rinv compute all | tee /tmp/foo# cat /tmp/foo | grep BIOS | awk '{print $4}' | sort | uniq09/02/200410/21/2004# cat /tmp/foo | grep BIOS | grep 09/02/2004node433: VPD BIOS: 09/02/2004

Linux Cluster Production Readiness