AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS

1 / 81

# AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS - PowerPoint PPT Presentation

AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS. Sample Statistics. If x 1 , x 2 , …, x n are n observations of the value of an unknown quantity X , they constitute a sample of size n for the population on which X is defined. Sample mean Sample variance.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS' - elina

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### AN INTRODUCTION TO STATISTICAL ANALYSISOF SIMULATION OUTPUTS

Sample Statistics
• If x1, x2, …, xn are n observations of the value of an unknown quantity X, they constitute a sample of size n for the population on which X is defined.
• Sample mean
• Sample variance
Central Limit Theorem
• If the n mutually independent random variables x1, x2, …, xn have the same distribution, and if their mean m and their variance s2 exist then …
Central Limit Theorem
• The random variable

is distributed according to the standard normal distribution (zero mean and unit variance).

Estimating a mean
• Assume that we have a sample x1, x2, …, xn consisting of n independent * observations of a given population
• The sample mean xbar is an unbiased estimator of the mean m of the population

*This is the critical assumptionWithout it, we cannot apply the formula

Large populations
• For large values of n, the (1-)% confidence interval for m is given by
• withfor the standard normal distribution
Explanations
• 1 – a expressed in percent is the level of the confidence interval
• 90% means a = 0.10
• 95% means a = 0.05
• 99% means a = 0.01
• a is the error probability
• Probability that the true mean falls outside the confidence interval
95% Confidence Intervals
• For =.05, za/2=1.96
• Example:
• If = 35, s= 4 and n = 100
• The 95% confidence interval for m is

35 ± 1.96x4/10 = 35 ± 7.84

When s is not known
• We can replace s in the preceding formula by the standard-deviation s of the sample
• When n < 30, we must read the value of za/2 from a table of Student\'s t-distribution with n - 1 degrees of freedom
• When n  30, we can use the standard values
Confidence Intervals in CSIM
• CSIM can automatically compute confidence intervals for the mean value of any table, qtable and so on.
• For everything but boxes
• xyx->confidence();
• For the elapsed times in a box
• bd->time_confidence();
• For the population of a box
• bd->_number_confidence();
• We get 90, 95 and 99 percent confidence intervals
Confidence Intervals in CSIM
• We get 90, 95 and 98 percent confidence intervals
• Computed using batch means method
• See next section
• But only for the mean
Estimating a proportion
• A proportion p represents the probabilityP(X) for some fixed threshold 
• 97% of our customers have to wait less than one minute
• Confidence intervals for proportions are much easier to compute than confidence intervals for quantiles
Assumptions
• Assume that we have n independent observations x1, x2, …, xn of a given population variable X and that this variable has a continuous distribution
• Let p represent the proportion we want to estimate, say P(X)
• Let k represent the number of observations that are .
Basic property
• If p represent the proportion we want to estimate, say P(X), and k represent the number of observations that are 
• The rv k is distributed according to a binomial distribution with mean np and variance p(1 – p)
• The rv k/n is distributed according to a binomial distribution with mean p and variance p(1 – p)/n
The formula
• When n > 29.the distribution of k/n is approximately normal
• We then have

### AUTOCORRELATION

The problem
• All statistical analysis techniques we have discussed assume that sample values are mutually independent
• Generally false for quantities such as
• Waiting times, response times, …
• Tend instead to be autocorrelated
• When the waiting lines are long, everybody wait a long time
• Keep the measurements sufficiently apart
• Sample them every T minutes apart
• Not practical
• Would require very long simulations
Three good solutions
• Batch means
• Regenerative method
• Time series analysis
Batch means
• We group consecutive observations into “batches”
• We compute the means of these batches
• We observe that autocorrelation among batch means decreases with size of batches
• When size increases, each batch includes more observations that are far apart
Example
• We collected the following values:
• 4, 3, 3, 4, 5, 5, 3, 2, 2, 3
• We group them into two batches of five observations:
• 4, 3, 3, 4, 5 and 5, 3, 2, 2, 3
• The batch means are:
• 3.8 and 3
Batch means in CSIM
• CSIM uses fixed-size batches
• To compute confidence intervals
• To control the duration of a simulation(run-length control)
Regenerative method
• Most systems with queues go through states that return it to an state identical to its original state
• The system regenerates itself
• Examples:
• Whenever a disk array is brought back to its original state
• Whenever a camper rental agency has all its campers available
Key idea
• Define a regeneration interval as an interval between two consecutive regeneration points:
• Observations collected during the same regeneration interval can be correlated
• Observations collected during different regeneration intervals are independent
Application
• We group together observations that occur within the same regeneration interval
• We compute the means of these groups of observations
• These group means are independent from each other
Limitations of the approach
• Not general:
• System must go through regeneration points
• System must be idle
• We rarely have exactly the same number of observations in two different regenerations intervals
Time series analysis
• Treats consecutive observations as elements of a time series
• Estimates autocorrelation among the elements of a time series
• Includes this autocorrelation in the computation of all confidence intervals

### RUN LENGTH CONTROL

Objective
• Accuracy of confidence intervals increases with duration of simulation
• The 1/n factor
• We would like to be able to stop the simulation once a given accuracy level has been reached for the confidence interval of a specific measurement
The CSIM solution (I)
• Specify
• Quantity of interest
• Mean value from a table, a qtable, …
• A relative accuracy
• Maximum relative error
• A confidence level (say, 0.90 to 0.99)
• A maximum simulation duration
• In seconds of CPU time
The CSIM solution (II)
• void table::run_length(double accuracy,double conf_level, double max_time)
• void qtable::run_length(double accuracy,double conf_level, double max_time)
• void meter::run_length(double accuracy,double conf_level, double max_time)
• void box::time_run_length(double accuracy,double conf_level, double max_time)
• void box::number_run_length(double accuracy,double conf_level, double max_time)
The CSIM solution (III)
• Example:
• thistable->run_length(.01, .95, 500)
• Specifies than we want an error less than 1 percent for 95% confidence interval of mean of thistable
• Stops simulation after 500 seconds
• thisbox->:time_run_length(.01, .99, 500)
• Same for time average of a box
The CSIM solution (IV)
• Replace termination test by
• converged.wait();
Example (I)
• The campers
• We want
• A maximum error of 1 percent (0.01)
• For the 95 percent confidence interval
• Of average number of rented campers
• A maximum simulation time of 100 s
• Will use number aspect of agency box
Example (II)
• agency->number_confidence()

after activation of box agency in csim process

Example (III)
• Put main loop
• while(simtime() < DURATION) { hold(exponential(MIART); customer();}

in a separate arrivals process

• Make it an infinite loop
Example (IV)
• converged.wait();

after call to arrivals process

• arrivals():converged.wait();
• Best way to let sim process generate customers and wait for terminationin parallel
Example (V)
• The new arrivals process

void arrivals() { process(“arrivals”); // REQUIRED for(;;) { // forever loop hold(exponential(MIART)); customer(); } // forever} // arrivals

Warnings
• Confidence intervals do not take into account model inaccuracies
• While the batch means method eliminates most effects of measurement autocorrelation, it is not always 100% effective
• The max_time parameter of the run_length() will not necessarily stop the simulation just after the specified CPU time
• Like the emergency brake of a train
Objective
• Partition processes into different classes
• Low priority
• High priority
• Obtain separate statistics for each process priority class
Declaring and initializing process classes
• To declare a dynamic process class:
• process_class *c;
• To initialize a process class before it can be used in any other statement.
• c = new process_class("low priority")
To assign a priority class
• c->set_process_class();
• Processes that have not been assigned a process class belong to the “default” process class
Reporting
• Can use
• report()
• report_classes()
Other options
• Can change the name of a process class:
• c->set_name("high priority");
• Can reset statistics associated with a process
• c ->reset();
• Can do the same for all process classes:
• reset_process_classes();
• Can delete a dynamic process class:
• delete c;

### RANDOM NUMBERS

Background
• We need to distinguish between
• Truly random numbers
• Obtained through observations of a physical random process
• Rolling dices, roulette
• Atomic decay
• Pseudo-random numbers
• Obtained through arithmetic operations
Example
• Linear Congruential Generators (LCG)
• Easy to implement and fast
• Defined by the recurrence relation:

rn+1 = (a rn + c ) mod m

where

• r1, r2, … are the random values
• m is the "modulus“
• r0 is the seed
Two realizations
• GCC family of compilers
• m = 232a = 69069 c = 5
• Microsoft Visual/Quick C/C++
• m = 232a = 214013 c = 2531011
Problems with pseudorandom number generators
• Much shorter periods for some seed states
• Lack of uniformity of distribution
• Correlation of successive values
Better RNGs
• Use the Mersenne twister
• Period is 219937 - 1
• Blum-Blum-Schub
A quote
• "Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number– there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.“John von Neumann
CSIM RNGs
• BY default, CSIM uses a single stream of random numbers
• Can reset the seed using
• void reseed(stream *s, long n)

as in

• reseed(NIL, 13579)
Continuous distributions supported by CSIM (I)
• double uniform(double min, double max)
• double triangular(double min, double max,double mode)
• double beta(double min, double max, doubleshape1, double shape2)
• double exponential(double mean)
• double gamma(double mean, double stddev)
• double erlang(double mean, double var)
Continuous distributions supported by CSIM (II)
• double hyperx(double mean, double var)
• double weibull(double shape, double scale)
• double normal(double mean, double stddev)
• double lognormal(double mean, double stddev)
• double cauchy(double alpha, double beta)
• double hypoexponential(double mn, double var)
Continuous distributions supported by CSIM (III)
• double pareto(double a)
• double zipf(long n)
• double zipf_sum(long n, double *sum)
Discrete distributions supported by CSIM
• long uniform_int(long min, long max)
• long bernoulli(double prob_success)
• long binomial(double prob_success, longnum_trials)
• long geometric(double prob_success)
• long negative_binomial(long success_num,double prob_success)
• long poisson(double mean)
Using multiple streams
• In campers example, sequence of RNs used to generate arrivals is affected by the numbers of campers
• If agency has less campers
• More customers will be lost
• Lost customers do no generate any RN
• Better to have separate random number streams for arrivals and service times
Declaring and initialing random number streams
• Can use:
• stream *s;s = new stream();
• By default, streams are created with seeds that are spaced 100,000 values apart
Reseeding a stream
• Use:
• s->reseed(24680);

where the new seed is a long integer

Other functions
• Inspect the current state of a stream
• i = s->state();
• Delete a stream
• delete s;
Using a specific seed
• Prefix RNG function with name of seed:
• s->uniform (3.0, 7.0)

### A CASE STUDY

RAID array revisited
• Reliability of a RAID array
• Reliability R(t) of a system is the probability that will remain operational over a time interval [0. t] given that it was operational at time t = 0
• Not the same as availability
• Our focus is evaluating the risk of a data loss during array lifetime
Executing multiple runs
• Multiple runs provide statistically independent repetitions of original simulation
• Useful for
• Collecting more accurate results
• Constructing confidence intervals
• Use rerun() function within a loop
• create("sim") call must be inside that loop
Overview of sim function

extern "C" void sim() { // sim process

runcount = 0;

while(runcount < NRUNS) {

create("sim"); // make it a process

// usual contents of sim process

rerun();

runcount++;

} // while

report_hdr(); // produce statistics report

} // sim

Global includes and defines

#include <cpp.h> // CSIM C++ header file

#define NDISKS 5 // number of disks in array

#define NYEARS 5 // useful lifetime of array

#define MTTF 300000.0 // disk MTTF

#define MTTR 24.0 // disk MTTR

#define NRUNS 100000 // no of runs

void disk(int i);

Global declarations

int nfailed; // number of failed disks

int runcount = 0; // number of runs

int ndatalosses = 0; //counter

double lifetime = NYEARS * 365 * 24;

// simulation duration

Sim function (I)

extern "C" void sim() { // sim process

int i;

runcount = 0;

while(runcount < NRUNS) {

create("sim"); // make this a process

dataloss = new event("dataloss");

dataloss->clear();

nfailed = 0;

Sim function (II)

// create NDISKS disk processes

for (i=0; i < NDISKS; i++){

disk(i);

} // for

ndatalosses++;

} // if

rerun();

runcount++;

} // while

Sim function (III)

report_hdr(); // produce statistics report

NYEARS);

printf("Sim time %f\n", simtime());

printf("Disk MTTF %f hours\n", MTTF);

printf("Disk MTTR %ff hours\n", MTTR);

printf(“Completed runs %d\n", runcount);

printf(“Data lossese%d\n", ndatalosses);

} // sim

Disk process (I)

void disk(int i) {

create("disk");

while(simtime() < lifetime || dataloss) {

hold(exponential(MTTF)); // disk failed

nfailed++;

Disk process (II)

if (nfailed == 2) {

dataloss = 1;

failtime = simtime();

finish->set();

terminate();

} // if

hold(MTTR); // repair process

nfailed--; // disk is replaced

} // while

} // disk

Simulation Outcome

C++/CSIM Simulation Report (Version 19.0 for Linux x86)

Tue Apr 22 10:13:14 2008

Ending simulation time: 0.000

Elapsed simulation time: 0.000

CPU time used (seconds): 0.000

which corresponds to 43800 hours

Simulated time 0

Disk MTTF 300000 hours

Disk MTTR 24 hours

Number of runs completed 100000

Number in runs ending in data loss 17

Discussion
• Whole program took less than 98 seconds on linux03 server
• Simulated time should be equal to43,800 hours, not zero.
• rerun() artifact?
• We observe 17 failures out of 100,000 runs
• Data survival rate is 99.983 percent
• Three nines
Confidence intervals
• Data loss rate and survival rate are distributed according to a binomial law
• Since n = 100,000,the distributions of both proportions are approximately normal
• Should use
CI for data survival rate s
• ŝ =99.983% or 0.99983
• We compute

(ŝ(1 – ŝ)/n) =

(0.00017 x 0.99987)/100000) =0.0000412

• 95% CI is

0.99983 ± 1.96 x 0.0000412 =

0.99983 ± 0.000081

[0.999749, 0.99911]

CI for data loss rate

0.000017 ± 0.000081

[0.000089, 0.000251]

Extensions (I)
• Other repair time distributions
• Exponential ( to compare with results of stochastic analysis)
• Ad hoc (80% of repairs within one day, 20% within two days)
• Taking accounts of differences between day and night, weekdays and weekends
• Will map simtime() into a calendar
Extensions (II)
• Other disk arrays
• Mirrored disks: NDISKS = 2
• RAID level 6: tolerate two disk failures
• Triplicate disks: tolerate failure of two out of three disks
• Two-dimensional RAID arrays
• See next slide
• Will require more work