an introduction to statistical analysis of simulation outputs
Skip this Video
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 81


  • Uploaded on

AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS. Sample Statistics. If x 1 , x 2 , …, x n are n observations of the value of an unknown quantity X , they constitute a sample of size n for the population on which X is defined. Sample mean Sample variance.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sample statistics
Sample Statistics
  • If x1, x2, …, xn are n observations of the value of an unknown quantity X, they constitute a sample of size n for the population on which X is defined.
  • Sample mean
  • Sample variance
central limit theorem
Central Limit Theorem
  • If the n mutually independent random variables x1, x2, …, xn have the same distribution, and if their mean m and their variance s2 exist then …
central limit theorem4
Central Limit Theorem
  • The random variable

is distributed according to the standard normal distribution (zero mean and unit variance).

estimating a mean
Estimating a mean
  • Assume that we have a sample x1, x2, …, xn consisting of n independent * observations of a given population
  • The sample mean xbar is an unbiased estimator of the mean m of the population

*This is the critical assumptionWithout it, we cannot apply the formula

large populations
Large populations
  • For large values of n, the (1-)% confidence interval for m is given by
  • withfor the standard normal distribution
  • 1 – a expressed in percent is the level of the confidence interval
    • 90% means a = 0.10
    • 95% means a = 0.05
    • 99% means a = 0.01
  • a is the error probability
    • Probability that the true mean falls outside the confidence interval
95 confidence intervals
95% Confidence Intervals
  • For =.05, za/2=1.96
  • Example:
    • If = 35, s= 4 and n = 100
    • The 95% confidence interval for m is

35 ± 1.96x4/10 = 35 ± 7.84

when s is not known
When s is not known
  • We can replace s in the preceding formula by the standard-deviation s of the sample
    • When n < 30, we must read the value of za/2 from a table of Student\'s t-distribution with n - 1 degrees of freedom
    • When n  30, we can use the standard values
confidence intervals in csim
Confidence Intervals in CSIM
  • CSIM can automatically compute confidence intervals for the mean value of any table, qtable and so on.
    • For everything but boxes
      • xyx->confidence();
    • For the elapsed times in a box
      • bd->time_confidence();
    • For the population of a box
      • bd->_number_confidence();
  • We get 90, 95 and 99 percent confidence intervals
confidence intervals in csim11
Confidence Intervals in CSIM
  • We get 90, 95 and 98 percent confidence intervals
    • Computed using batch means method
      • See next section
    • But only for the mean
estimating a proportion
Estimating a proportion
  • A proportion p represents the probabilityP(X) for some fixed threshold 
    • 97% of our customers have to wait less than one minute
  • Confidence intervals for proportions are much easier to compute than confidence intervals for quantiles
  • Assume that we have n independent observations x1, x2, …, xn of a given population variable X and that this variable has a continuous distribution
  • Let p represent the proportion we want to estimate, say P(X)
  • Let k represent the number of observations that are .
basic property
Basic property
  • If p represent the proportion we want to estimate, say P(X), and k represent the number of observations that are 
    • The rv k is distributed according to a binomial distribution with mean np and variance p(1 – p)
    • The rv k/n is distributed according to a binomial distribution with mean p and variance p(1 – p)/n
the formula
The formula
  • When n > 29.the distribution of k/n is approximately normal
    • We then have
the problem
The problem
  • All statistical analysis techniques we have discussed assume that sample values are mutually independent
  • Generally false for quantities such as
    • Waiting times, response times, …
  • Tend instead to be autocorrelated
    • When the waiting lines are long, everybody wait a long time
traditional solution
Traditional solution
  • Keep the measurements sufficiently apart
    • Sample them every T minutes apart
  • Not practical
    • Would require very long simulations
three good solutions
Three good solutions
  • Batch means
  • Regenerative method
  • Time series analysis
batch means
Batch means
  • We group consecutive observations into “batches”
  • We compute the means of these batches
  • We observe that autocorrelation among batch means decreases with size of batches
    • When size increases, each batch includes more observations that are far apart
  • We collected the following values:
    • 4, 3, 3, 4, 5, 5, 3, 2, 2, 3
  • We group them into two batches of five observations:
    • 4, 3, 3, 4, 5 and 5, 3, 2, 2, 3
  • The batch means are:
    • 3.8 and 3
batch means in csim
Batch means in CSIM
  • CSIM uses fixed-size batches
    • To compute confidence intervals
    • To control the duration of a simulation(run-length control)
regenerative method
Regenerative method
  • Most systems with queues go through states that return it to an state identical to its original state
    • The system regenerates itself
  • Examples:
    • Whenever a disk array is brought back to its original state
    • Whenever a camper rental agency has all its campers available
key idea
Key idea
  • Define a regeneration interval as an interval between two consecutive regeneration points:
    • Observations collected during the same regeneration interval can be correlated
    • Observations collected during different regeneration intervals are independent
  • We group together observations that occur within the same regeneration interval
  • We compute the means of these groups of observations
  • These group means are independent from each other
limitations of the approach
Limitations of the approach
  • Not general:
    • System must go through regeneration points
      • System must be idle
  • Leads to complex computations
    • We rarely have exactly the same number of observations in two different regenerations intervals
time series analysis
Time series analysis
  • Treats consecutive observations as elements of a time series
  • Estimates autocorrelation among the elements of a time series
  • Includes this autocorrelation in the computation of all confidence intervals
  • Accuracy of confidence intervals increases with duration of simulation
    • The 1/n factor
  • We would like to be able to stop the simulation once a given accuracy level has been reached for the confidence interval of a specific measurement
the csim solution i
The CSIM solution (I)
  • Specify
    • Quantity of interest
      • Mean value from a table, a qtable, …
    • A relative accuracy
      • Maximum relative error
    • A confidence level (say, 0.90 to 0.99)
    • A maximum simulation duration
      • In seconds of CPU time
the csim solution ii
The CSIM solution (II)
  • void table::run_length(double accuracy,double conf_level, double max_time)
  • void qtable::run_length(double accuracy,double conf_level, double max_time)
  • void meter::run_length(double accuracy,double conf_level, double max_time)
  • void box::time_run_length(double accuracy,double conf_level, double max_time)
  • void box::number_run_length(double accuracy,double conf_level, double max_time)
the csim solution iii
The CSIM solution (III)
  • Example:
    • thistable->run_length(.01, .95, 500)
      • Specifies than we want an error less than 1 percent for 95% confidence interval of mean of thistable
      • Stops simulation after 500 seconds
    • thisbox->:time_run_length(.01, .99, 500)
      • Same for time average of a box
the csim solution iv
The CSIM solution (IV)
  • Replace termination test by
    • converged.wait();
example i
Example (I)
  • The campers
  • We want
    • A maximum error of 1 percent (0.01)
    • For the 95 percent confidence interval
    • Of average number of rented campers
    • A maximum simulation time of 100 s
  • Will use number aspect of agency box
example ii
Example (II)
  • Add
    • agency->number_confidence()

after activation of box agency in csim process

example iii
Example (III)
  • Put main loop
    • while(simtime() < DURATION) { hold(exponential(MIART); customer();}

in a separate arrivals process

  • Make it an infinite loop
example iv
Example (IV)
  • Add
    • converged.wait();

after call to arrivals process

    • arrivals():converged.wait();
  • Best way to let sim process generate customers and wait for terminationin parallel
example v
Example (V)
  • The new arrivals process

void arrivals() { process(“arrivals”); // REQUIRED for(;;) { // forever loop hold(exponential(MIART)); customer(); } // forever} // arrivals

  • Confidence intervals do not take into account model inaccuracies
  • While the batch means method eliminates most effects of measurement autocorrelation, it is not always 100% effective
  • The max_time parameter of the run_length() will not necessarily stop the simulation just after the specified CPU time
    • Like the emergency brake of a train
  • Partition processes into different classes
    • Low priority
    • High priority
  • Obtain separate statistics for each process priority class
declaring and initializing process classes
Declaring and initializing process classes
  • To declare a dynamic process class:
    • process_class *c;
  • To initialize a process class before it can be used in any other statement.
    • c = new process_class("low priority")
to assign a priority class
To assign a priority class
  • Add inside the process
    • c->set_process_class();
  • Processes that have not been assigned a process class belong to the “default” process class
  • Can use
    • report()
    • report_classes()
other options
Other options
  • Can change the name of a process class:
    • c->set_name("high priority");
  • Can reset statistics associated with a process
    • c ->reset();
  • Can do the same for all process classes:
    • reset_process_classes();
  • Can delete a dynamic process class:
    • delete c;
  • We need to distinguish between
    • Truly random numbers
      • Obtained through observations of a physical random process
        • Rolling dices, roulette
        • Atomic decay
    • Pseudo-random numbers
      • Obtained through arithmetic operations
  • Linear Congruential Generators (LCG)
    • Easy to implement and fast
    • Defined by the recurrence relation:

rn+1 = (a rn + c ) mod m


      • r1, r2, … are the random values
      • m is the "modulus“
      • r0 is the seed
two realizations
Two realizations
  • GCC family of compilers
    • m = 232a = 69069 c = 5
  • Microsoft Visual/Quick C/C++
    • m = 232a = 214013 c = 2531011
problems with pseudorandom number generators
Problems with pseudorandom number generators
  • Much shorter periods for some seed states
  • Lack of uniformity of distribution
  • Correlation of successive values
better rngs
Better RNGs
  • Use the Mersenne twister
    • Period is 219937 - 1
  • Blum-Blum-Schub
a quote
A quote
  • "Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number– there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.“John von Neumann
csim rngs
  • BY default, CSIM uses a single stream of random numbers
  • Can reset the seed using
    • void reseed(stream *s, long n)

as in

    • reseed(NIL, 13579)
continuous distributions supported by csim i
Continuous distributions supported by CSIM (I)
  • double uniform(double min, double max)
  • double triangular(double min, double max,double mode)
  • double beta(double min, double max, doubleshape1, double shape2)
  • double exponential(double mean)
  • double gamma(double mean, double stddev)
  • double erlang(double mean, double var)
continuous distributions supported by csim ii
Continuous distributions supported by CSIM (II)
  • double hyperx(double mean, double var)
  • double weibull(double shape, double scale)
  • double normal(double mean, double stddev)
  • double lognormal(double mean, double stddev)
  • double cauchy(double alpha, double beta)
  • double hypoexponential(double mn, double var)
continuous distributions supported by csim iii
Continuous distributions supported by CSIM (III)
  • double pareto(double a)
  • double zipf(long n)
  • double zipf_sum(long n, double *sum)
discrete distributions supported by csim
Discrete distributions supported by CSIM
  • long uniform_int(long min, long max)
  • long bernoulli(double prob_success)
  • long binomial(double prob_success, longnum_trials)
  • long geometric(double prob_success)
  • long negative_binomial(long success_num,double prob_success)
  • long poisson(double mean)
using multiple streams
Using multiple streams
  • In campers example, sequence of RNs used to generate arrivals is affected by the numbers of campers
    • If agency has less campers
      • More customers will be lost
      • Lost customers do no generate any RN
  • Better to have separate random number streams for arrivals and service times
declaring and initialing random number streams
Declaring and initialing random number streams
  • Can use:
    • stream *s;s = new stream();
  • By default, streams are created with seeds that are spaced 100,000 values apart
reseeding a stream
Reseeding a stream
  • Use:
    • s->reseed(24680);

where the new seed is a long integer

other functions
Other functions
  • Inspect the current state of a stream
    • i = s->state();
  • Delete a stream
    • delete s;
using a specific seed
Using a specific seed
  • Prefix RNG function with name of seed:
    • s->uniform (3.0, 7.0)
raid array revisited
RAID array revisited
  • Reliability of a RAID array
  • Reliability R(t) of a system is the probability that will remain operational over a time interval [0. t] given that it was operational at time t = 0
    • Not the same as availability
    • Our focus is evaluating the risk of a data loss during array lifetime
executing multiple runs
Executing multiple runs
  • Multiple runs provide statistically independent repetitions of original simulation
    • Useful for
      • Collecting more accurate results
      • Constructing confidence intervals
  • Use rerun() function within a loop
    • create("sim") call must be inside that loop
overview of sim function
Overview of sim function

extern "C" void sim() { // sim process

runcount = 0;

while(runcount < NRUNS) {

create("sim"); // make it a process

// usual contents of sim process



} // while

report_hdr(); // produce statistics report

} // sim

global includes and defines
Global includes and defines

#include <cpp.h> // CSIM C++ header file

#define NDISKS 5 // number of disks in array

#define NYEARS 5 // useful lifetime of array

#define MTTF 300000.0 // disk MTTF

#define MTTR 24.0 // disk MTTR

#define NRUNS 100000 // no of runs

void disk(int i);

global declarations
Global declarations

int nfailed; // number of failed disks

int runcount = 0; // number of runs

int ndatalosses = 0; //counter

double lifetime = NYEARS * 365 * 24;

// simulation duration

sim function i
Sim function (I)

extern "C" void sim() { // sim process

int i;

runcount = 0;

while(runcount < NRUNS) {

create("sim"); // make this a process

dataloss = new event("dataloss");


nfailed = 0;

sim function ii
Sim function (II)

// create NDISKS disk processes

for (i=0; i < NDISKS; i++){


} // for


if (simtime() < lifetime) {


} // if



} // while

sim function iii
Sim function (III)

report_hdr(); // produce statistics report

printf("Array lifetime %fd years",


printf(“or %f hours\n", lifetime);

printf("Sim time %f\n", simtime());

printf("Disk MTTF %f hours\n", MTTF);

printf("Disk MTTR %ff hours\n", MTTR);

printf(“Completed runs %d\n", runcount);

printf(“Data lossese%d\n", ndatalosses);

} // sim

disk process i
Disk process (I)

void disk(int i) {


while(simtime() < lifetime || dataloss) {

hold(exponential(MTTF)); // disk failed


disk process ii
Disk process (II)

if (nfailed == 2) {

dataloss = 1;

failtime = simtime();



} // if

hold(MTTR); // repair process

nfailed--; // disk is replaced

} // while

} // disk

simulation outcome
Simulation Outcome

C++/CSIM Simulation Report (Version 19.0 for Linux x86)

Tue Apr 22 10:13:14 2008

Ending simulation time: 0.000

Elapsed simulation time: 0.000

CPU time used (seconds): 0.000

Array lifetime 5 years

which corresponds to 43800 hours

Simulated time 0

Disk MTTF 300000 hours

Disk MTTR 24 hours

Number of runs completed 100000

Number in runs ending in data loss 17

  • Whole program took less than 98 seconds on linux03 server
  • Simulated time should be equal to43,800 hours, not zero.
    • rerun() artifact?
  • We observe 17 failures out of 100,000 runs
    • Data survival rate is 99.983 percent
      • Three nines
confidence intervals
Confidence intervals
  • Data loss rate and survival rate are distributed according to a binomial law
  • Since n = 100,000,the distributions of both proportions are approximately normal
  • Should use
ci for data survival rate s
CI for data survival rate s
  • ŝ =99.983% or 0.99983
  • We compute

(ŝ(1 – ŝ)/n) =

(0.00017 x 0.99987)/100000) =0.0000412

  • 95% CI is

0.99983 ± 1.96 x 0.0000412 =

0.99983 ± 0.000081

[0.999749, 0.99911]

ci for data loss rate
CI for data loss rate

0.000017 ± 0.000081

[0.000089, 0.000251]

extensions i
Extensions (I)
  • Other repair time distributions
    • Exponential ( to compare with results of stochastic analysis)
    • Ad hoc (80% of repairs within one day, 20% within two days)
    • Taking accounts of differences between day and night, weekdays and weekends
      • Will map simtime() into a calendar
extensions ii
Extensions (II)
  • Other disk arrays
    • Mirrored disks: NDISKS = 2
    • RAID level 6: tolerate two disk failures
    • Triplicate disks: tolerate failure of two out of three disks
    • Two-dimensional RAID arrays
      • See next slide
      • Will require more work