BENCHMARKS. Ramon Zatarain. INDEX. Benchmarks and Benchmarking Relation of Benchmarks with Empirical Methods Benchmark definition Types of benchmarks Benchmark suites Measuring performance (CPU, comparing of performance, etc.) Common system benchmarks Examples of software benchmarks
community have no generally accepted methods or benchmarks for measuring and comparing the quality and utility of our research results”.
Some definitions are:
Examples: Compilers, text processing software, etc.
Examples: Livermore Loops and Linpack.
Examples: Sieve of Eratosthenes, Puzzle, and Quicksort.
Examples: Whetstone and Dhrystone.
Example: SPEC92 benchmark suite (20 programs)
Benchmark Source Lines of code description
Espresso C 13,500 Minimize Boolean functions
Li C 7,413 Lisp interpreter (9 queen probl.)
Eqntott C 3,376 translate boolean equations
Compress C 1,503 Data compression
Sc C 8,116 Computation in a spreadsheet
Gcc C 83,589 GNU C compiler
Spice2g6 Fortran 18,476 Circuit Simulation Package
Doduc Fortran 5,334 Simulation of nuclear reactor
Mdljdp2 Fortran 4,458 Chemical application
Wave5 Fortran 7,628 Electromagnetic Simulation
Tomcatv Fortran 195 Mesh generation program
Ora Fortran 535 Traces rays through optical syst.
Alvinn C 272 Simulation in neural networks
Ear C 4,483 Inner ear model
Note: Sometimes MIPS can mean “meaningless indicators of performance for salesmen”.
Computer A Computer B Computer C
Program P1 (secs)
Program P2 (secs)
Program P3 (secs)
1 10 20
1000 100 20
1001 110 40
Execution times of three programs on three machines
i=1CPU Performance Measures
TOTAL EXECUTION TIME:
An average of the execution times that tracks total execution time is the arithmetic
Where Timei is the execution for the
ith program of a total of n in the workload
When performance is expressed as a rate we use Harmonic mean:
Where Ratei is a function of 1/timei , the
execution time for the ith of n programs
in the workload. It is used when performance
Is measured in MIPS or MFLOPS
WEIGHTED EXECUTION TIME
A question arises: What is the proper mixture of programs for the workload?
In the arithmetic mean we assume programs P1 and P2 run equally in the
A weighted arithmetic mean is given by
Weighti x Timei
Where Weighti is the frequency of the
ith program in the workload and Timei
Is the execution time of the program ‘i’
Comp A Comp B Comp C W1 W2 W3
Program P1 (secs)
Program P2 (secs)
1 10 20 .50 .909 .999
1000 100 20 .50 .091 .001
500.50 55.0 20.0
91.91 18.19 20.0
2.0 10.09 20.0
Weighted arithmetic mean execution times using three weightings
Designed to simulate a CAD/CAM environment. Tests: - Pointer traversals over cached data; disk resident data; sparse traversals; and dense traversals - Updates: indexed and unindexed object fields; repeated updates; sparse updates; updates of cached data; and creation and deletion of objects - Queries: exact match lookup; ranges; collection scan; path-join; ad-hoc join; and single-level make. Originator: University of Wisconsin Versions: Unknown Availability of Source: Free from ftp.cs.wisc.edu:/007 Availability of Results: Free from ftp.cs.wisc.edu:/007 Entry Last Updated: Thursday April 15 15:08:07 1993
AIM Technology, Palo Alto Two suites (III and V)
Suite III: simulation of applications (task- or device specific) - Task specific routines (word processing, database management, accounting) - Device specific routines (memory, disk, MFLOPs, IOs) - All measurements represent a percentage of VAX 11/780 performance (100%) In general, Suite III gives an overall indication of performance.
Suite V: measures throughput in a multitasking workstation environment by testing: - Incremental system loading - Multiple aspects of system performance The graphically displayed results plot the workload level versus time. Several different models characterize various user environments (financial, publishing, software engineering). The published reports are copyrighted.
An example of AIM benchmark results(in .pdf format)
DhrystoneShort synthetic benchmark program intended to be representative of system
(integer) programming. Based on published statistics on use of programming
language features; see original publication in CACM 27,10 (Oct. 1984), 1013-1030.
Originally published in Ada, now mostly used in C. Version 2 (in C) published
in SIGPLAN Notices 23,8 (Aug. 1988), 49-62, together with measurement rules.
Version 1 is no longer recommended since state-of-the-art compilers can eliminate
too much "dead code" from the benchmark (However, quoted MIPS numbers are
often based on Version 1.) Problems: Due to its small size (100 HLL statements, 1-1.5 KB code), the memory
system outside the cache is not tested; compilers can too easily optimize for
Dhrystone; and string operations are somewhat over represented.Recommendation: Use it for controlled experiments only; don't blindly trust single
Dhrystone MIPS numbers quoted somewhere (as a rule, don't do this for any
benchmark). Originator: Reinhold Weicker, Siemens Nixdorf (email@example.com) Versions in C: 1.0, 1.1, 2.0, 2.1 (final version, minor corrections compared with 2.0) See also: R.P.Weicker, A Detailed Look ... (see Publications, 4.3) Availability of source: firstname.lastname@example.org, ftp.nosc.mil:pub/aburto Availability of results (no guarantee of correctness): Same as above
Multipurpose benchmark used in various periodicals.
Originator: Workstation Labs Versions: unknown Availability of Source: not free Availability of Results: UNIX Review
LINPACKKernel benchmark developed from the "LINPACK" package of linear algebra
routines. Originally written and commonly used in FORTRAN; a C version also
exists. Almost all of the benchmark's time is spent in a subroutine ("saxpy" in
the single-precision version, "daxpy" in the double-precision version) doing the
inner loop for frequent matrix operations: y(i) = y(i) + a * x(i) The standard
version operates on 100x100 matrices; there are also versions for sizes 300x300
and 1000x1000, with different optimization rules. Problems: Code is representative only for this type of computation. LINPACK
is easily vectorizable on most systems. Originator: Jack Dongarra, Computer Science Deptartment,
University of Tennessee, email@example.com
Designed by Ken J. McDonell at the Monash University in Australia a very
good benchmark of disk throughput and the multi-user simulation.
Compile, create the directories and the workload for simulated users, and
execute the simulation three times by measuring cpu and elapsed time.
The workload is constituted by 11 commands (cc, rm, ed, ls, cp, spell, cat,
mkdir, export, chmod, and a nroff-like spooler) and 5 programs (syscall,
randmem, hanoi, pipe, and fstime). This is a very complete test which is a
significant measurement of the CPU speed, C compiler and UNIX quality,
file system performances and multi-user capabilities, disk throughput, and
memory management implementation.
A benchmark intended to measure the performance of file servers that follow
the NFS protocol. The work in this area continued within the LADDIS group
and finally within SPEC. The SPEC benchmark 097.LADDIS is intended to
It is superior to Nhfsstone in several aspects (multi-client capability, less
SPECSPEC stands for Standard Performance Evaluation Corporation, a non-profit
organization whose goal is to "establish, maintain and endorse a standardized
set of relevant benchmarks that can be applied to the newest generation of
high performance computers" (from SPEC's bylaws). The SPEC benchmarks
and more information can be obtained from: SPEC [Standard Performance Evaluation Corporation] c/o NCGA [National Computer Graphics Association] 2722 Merrilee Drive Suite 200 Fairfax, VA 22031 USA Phone: +1-703-698-9600 Ext. 325 FAX: +1-703-560-2752 E-Mail: firstname.lastname@example.org
The current SPEC benchmark suites are:
CINT92 (CPU intensive integer benchmarks) CFP92 (CPU intensive floating point benchmarks) SDM (UNIX Software Development Workloads) SFS (System level file server (NFS) workload)
The SSBA is the result of the studies of the AFUU (French Association of
UNIX Users) Benchmark Working Group. This group, consisting of some
30 active members of varied origins (universities, public and private research,
manufacturers, end users), has assigned itself the task of assessing the
performance of data processing systems, collecting a maximum number
of tests available throughout the world, dissecting the codes and results,
discussing the utility, fixing versions, and supplying them with various
comments and procedures.
A sample output of the SSBA suite of UNIX benchmark tests
Sieve of EratosthenesAn integer program that generates prime numbers using a method
known as the Sieve of Eratosthenes.
TPC-A is a standardization of the Debit/Credit benchmark which was first published
in DATAMATION in 1985. It is based on a single, simple, update-intensive
transaction which performs three updates and one insert across four tables.
Transactions originate from terminals, with a requirement of 100 bytes in and
200 bytes out. There is a fixed scaling between tps rate, terminals, and
database size. TPC-A requires an external RTE (remote terminal emulator) to
drive the SUT (system under test). The system performs five kinds of transactions:
entering a new order, delivering orders, posting customer payments, retrieving a
customer's most recent order, and monitoring the inventory level of recently ordered
The first major synthetic benchmark program, intended to be representative
for numerical (floating point intensive) programming. Based on statistics
gathered at National Physical Laboratory in England, using an Algol 60 compiler
which translated Algol into instructions for the imaginary Whetstone machine.
The compilation system was named after the small town outside the City of
Leicester, England, where it was designed (Whetstone). Problems: Due to the small size of its modules, the memory system outside
the cache is not tested; compilers can too easily optimize for Whetstone;
mathematical library functions are over represented. Originator: Brian Wichmann, NPL
One of the first and very popular benchmarks, the WHETSTONE was
originally published in 1976 by Curnow and Wichman in algol and
subsequently translated into FORTRAN. This synthetic mix of elementary
Whetstone instructions is modeled with statistics from about 1000 scientific
and engineering applications. The WHETSTONE is rather small and, due
to its straightforward coding, may be prone to particular (and unintentional)
treatment by intelligent compilers. It is very sensitive to the transcendental
and trigonometric functions processing, and completely dependent on fast
or additional mathematics coprocessor. The WHETSTONE is a good
predictor for engineering and scientific applications.
SYSmark93 provides benchmarks that can be used to measure
performance of IBM PC-compatible hardware for the tasks users
perform on a regular basis. SYSmark93 benchmarks represent
the workloads of popular programs in such applications as word
processing, spreadsheets, database, desktop graphics, and
A collection of C routines developed in 1988 at Stanford University
(J. Hennessy, P. Nye). Its two modules, Stanford Integer and Stanford
Floating Point, provide a baseline for comparisons between Reduced
Instruction Set (RISC) and Complex Instruction Set (CISC) processor
Stanford Integer: - Eight applications (integer matrix multiplication, sorting algorithm
[quick, bubble, tree], permutation, hanoi, 8 queens puzzle)
Stanford Floating Point: - Two applications (Fast Fourier Transform [FFT] and matrix multiplication)
The characteristics of the programs vary, but most of them have array
accesses. There seems to be no official publication (only a printing in a
performance report). Secondly, there is no defined weighting of the
results (Sun and MIPS compute the geometric mean).
This is a file system benchmark that attempts to study bottlenecks.
Specifically, these are the types of filesystem activity that have been
observed to be bottlenecks in I/O-intensive applications, in particular
the text database work done in connection with the New Oxford
English Dictionary Project at the University of Waterloo. It performs
a series of tests on a file of known size. By default, that size is
100 Mb (but that's not enough - see below). For each test, Bonnie
reports the bytes processed per elapsed second, per CPU second,
and the percent CPU usage (user and system). In each case,
an attempt is made to keep optimizers from noticing it's all bogus.
The idea is to make sure that these are real transfers to/from user
space to the physical disk.
IOBENCHIOBENCH is a multi-stream benchmark that uses a controlling process
(iobench) to start, coordinate, and measure a number of "user" processes
(iouser); the Makefile parameters used for the SPEC version of IOBENCH
cause ioserver to be built as a "do nothing" process.
IOZONEThis test writes an X MB sequential file in Y byte chunks, then rewinds it
and reads it back. [The size of the file should be big enough to factor out
the effect of any disk cache.] Finally, IOZONE deletes the temporary file. The file is written (filling any cache buffers), and then read.
If the cache is >= X MB, then most if not all of the reads will be satisfied
from the cache. However, if the cache is <= .5X MB, then NONE of the
reads will be satisfied from the cache. This is because after the file is written,
a .5X MB cache will contain the upper .5 MB of the test file, but we will start
reading from the beginning of the file (data which is no longer in the cache).
In order for this to be a fair test, the length of the test file must be AT LEAST
2X the amount of disk cache memory for your system. If not, you are really
testing the speed at which your CPU can read blocks out of the cache
(not a fair test).
ByteThis famous test taken from Byte (1984), originally targeted at microcomputers,
is a benchmark suite similar in spirit to SPEC, except that it is smaller and
contains mostly things like "Sieve of Eratosthenes" and "Dhrystone".
If you are comparing different UNIX machines for performance, this gives
fairly good numbers. NetperfA networking performance benchmark/tool. Includes throughput (bandwidth)
and request/response (latency) tests for TCP and UDP using the
BSD sockets API, DLPI, UNIX Domain Sockets, the Fore ATM API,
and HP HiPPI Link Level Access. See ftp://ftp.cup.hp.com/dist/networking/benchmarks and ftp://sgi.com NettestA network performance analysis tool developed at Cray.
between two systems. TTCP times the transmission and reception of data
between two systems using the UDP or TCP protocols. It differs from
common "blast" tests, which tend to measure the remote Internet daemon (inetd)
as much as the network performance, and which usually do not allow
measurements at the remote end of a UDP transmission.
This program was created at the US Army Ballistics Research Laboratory (BRL). CPU2The CPU2 benchmark was invented by Digital Review
(now Digital News and Review). To quote DEC, describing DN&R's benchmark,
CPU2 "...is a floating point intensive series of FORTRAN programs and consists
of thirty-four separate tests. The benchmark is most relevant in predicting the
performance of engineering and scientific applications. Performance is
expressed as a multiple of MicroVAX II Units of Performance.
The CPU2 benchmark is available via anonymous ftp from
swedishchef.lerc.nasa.gov in the drlabs/cpu directory.
Get cpu2.unix.tar.Z for unix systems or cpu2.vms.tar.Z for VMS systems."
HartstoneHartstone is a benchmark for measuring various aspects of hard real time
systems from the Software Engineering Institute at Carnegie Mellon. PC Bench/WinBench/NetBenchPC Bench 9.0, WinBench 95 Version 1.0, Winstone 95 Version 1.0, MacBench 2.0, NetBench 3.01, and ServerBench 2.0 are the current names and versions of the benchmarks available from the Ziff-Davis Benchmark Operation (ZDBOp) SimAn integer program that compares DNA segments for similarity. FhourstonesA small integer-only program that solves positions in the game of connect-4
using exhaustive search with a very large transposition table. Written in C. HeapsortAn integer program that uses the "heap sort" method of sorting a random
array of long integers up to 2 MB in size.
HanoiAn integer program that solves the Towers of Hanoi puzzle using recursive
function calls. Flops CEstimates MFLOPS rating for specific floating point add, subtract, multiply,
and divide (FADD, FSUB, FMUL, and FDIV) instruction mixes. Four distinct
MFLOPS ratings are provided based on the FDIV weightings from 25% to 0%
and using register-to-register operations. Works with both scalar and vector
machines. C LINPACKThe LINPACK floating point program converted to C. TFFTDPThis program performs FFTs using the Duhamel-Hollman method for FFTs
from 32 to 262,144 points in size. Matrix Multiply (MM)This program contains nine different algorithms for doing matrix
multiplication (500 X 500 standard size). Results illustrate the effects
of cache thrashing versus algorithm, machine, compiler, and compiler options.
different aspects to test are: CPU, I/O, File
documenting these setups, measuring results, and documenting these results (goal: maximize comparability of experimental results)