Loading in 2 Seconds...

AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS

Loading in 2 Seconds...

- By
**elina** - Follow User

- 233 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'AN INTRODUCTION TO STATISTICAL ANALYSIS OF SIMULATION OUTPUTS' - elina

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Sample Statistics

- If x1, x2, …, xn are n observations of the value of an unknown quantity X, they constitute a sample of size n for the population on which X is defined.
- Sample mean
- Sample variance

Central Limit Theorem

- If the n mutually independent random variables x1, x2, …, xn have the same distribution, and if their mean m and their variance s2 exist then …

Central Limit Theorem

- The random variable

is distributed according to the standard normal distribution (zero mean and unit variance).

Estimating a mean

- Assume that we have a sample x1, x2, …, xn consisting of n independent * observations of a given population
- The sample mean xbar is an unbiased estimator of the mean m of the population

*This is the critical assumptionWithout it, we cannot apply the formula

Large populations

- For large values of n, the (1-)% confidence interval for m is given by
- withfor the standard normal distribution

Explanations

- 1 – a expressed in percent is the level of the confidence interval
- 90% means a = 0.10
- 95% means a = 0.05
- 99% means a = 0.01
- a is the error probability
- Probability that the true mean falls outside the confidence interval

95% Confidence Intervals

- For =.05, za/2=1.96
- Example:
- If = 35, s= 4 and n = 100
- The 95% confidence interval for m is

35 ± 1.96x4/10 = 35 ± 7.84

When s is not known

- We can replace s in the preceding formula by the standard-deviation s of the sample
- When n < 30, we must read the value of za/2 from a table of Student\'s t-distribution with n - 1 degrees of freedom
- When n 30, we can use the standard values

Confidence Intervals in CSIM

- CSIM can automatically compute confidence intervals for the mean value of any table, qtable and so on.
- For everything but boxes
- xyx->confidence();
- For the elapsed times in a box
- bd->time_confidence();
- For the population of a box
- bd->_number_confidence();
- We get 90, 95 and 99 percent confidence intervals

Confidence Intervals in CSIM

- We get 90, 95 and 98 percent confidence intervals
- Computed using batch means method
- See next section
- But only for the mean

Estimating a proportion

- A proportion p represents the probabilityP(X) for some fixed threshold
- 97% of our customers have to wait less than one minute
- Confidence intervals for proportions are much easier to compute than confidence intervals for quantiles

Assumptions

- Assume that we have n independent observations x1, x2, …, xn of a given population variable X and that this variable has a continuous distribution
- Let p represent the proportion we want to estimate, say P(X)
- Let k represent the number of observations that are .

Basic property

- If p represent the proportion we want to estimate, say P(X), and k represent the number of observations that are
- The rv k is distributed according to a binomial distribution with mean np and variance p(1 – p)
- The rv k/n is distributed according to a binomial distribution with mean p and variance p(1 – p)/n

The formula

- When n > 29.the distribution of k/n is approximately normal
- We then have

The problem

- All statistical analysis techniques we have discussed assume that sample values are mutually independent
- Generally false for quantities such as
- Waiting times, response times, …
- Tend instead to be autocorrelated
- When the waiting lines are long, everybody wait a long time

Traditional solution

- Keep the measurements sufficiently apart
- Sample them every T minutes apart
- Not practical
- Would require very long simulations

Three good solutions

- Batch means
- Regenerative method
- Time series analysis

Batch means

- We group consecutive observations into “batches”
- We compute the means of these batches
- We observe that autocorrelation among batch means decreases with size of batches
- When size increases, each batch includes more observations that are far apart

Example

- We collected the following values:
- 4, 3, 3, 4, 5, 5, 3, 2, 2, 3
- We group them into two batches of five observations:
- 4, 3, 3, 4, 5 and 5, 3, 2, 2, 3
- The batch means are:
- 3.8 and 3

Batch means in CSIM

- CSIM uses fixed-size batches
- To compute confidence intervals
- To control the duration of a simulation(run-length control)

Regenerative method

- Most systems with queues go through states that return it to an state identical to its original state
- The system regenerates itself
- Examples:
- Whenever a disk array is brought back to its original state
- Whenever a camper rental agency has all its campers available

Key idea

- Define a regeneration interval as an interval between two consecutive regeneration points:
- Observations collected during the same regeneration interval can be correlated
- Observations collected during different regeneration intervals are independent

Application

- We group together observations that occur within the same regeneration interval
- We compute the means of these groups of observations
- These group means are independent from each other

Limitations of the approach

- Not general:
- System must go through regeneration points
- System must be idle
- Leads to complex computations
- We rarely have exactly the same number of observations in two different regenerations intervals

Time series analysis

- Treats consecutive observations as elements of a time series
- Estimates autocorrelation among the elements of a time series
- Includes this autocorrelation in the computation of all confidence intervals

Objective

- Accuracy of confidence intervals increases with duration of simulation
- The 1/n factor
- We would like to be able to stop the simulation once a given accuracy level has been reached for the confidence interval of a specific measurement

The CSIM solution (I)

- Specify
- Quantity of interest
- Mean value from a table, a qtable, …
- A relative accuracy
- Maximum relative error
- A confidence level (say, 0.90 to 0.99)
- A maximum simulation duration
- In seconds of CPU time

The CSIM solution (II)

- void table::run_length(double accuracy,double conf_level, double max_time)
- void qtable::run_length(double accuracy,double conf_level, double max_time)
- void meter::run_length(double accuracy,double conf_level, double max_time)
- void box::time_run_length(double accuracy,double conf_level, double max_time)
- void box::number_run_length(double accuracy,double conf_level, double max_time)

The CSIM solution (III)

- Example:
- thistable->run_length(.01, .95, 500)
- Specifies than we want an error less than 1 percent for 95% confidence interval of mean of thistable
- Stops simulation after 500 seconds
- thisbox->:time_run_length(.01, .99, 500)
- Same for time average of a box

The CSIM solution (IV)

- Replace termination test by
- converged.wait();

Example (I)

- The campers
- We want
- A maximum error of 1 percent (0.01)
- For the 95 percent confidence interval
- Of average number of rented campers
- A maximum simulation time of 100 s
- Will use number aspect of agency box

Example (III)

- Put main loop
- while(simtime() < DURATION) { hold(exponential(MIART); customer();}

in a separate arrivals process

- Make it an infinite loop

Example (IV)

- Add
- converged.wait();

after call to arrivals process

- arrivals():converged.wait();
- Best way to let sim process generate customers and wait for terminationin parallel

Example (V)

- The new arrivals process

void arrivals() { process(“arrivals”); // REQUIRED for(;;) { // forever loop hold(exponential(MIART)); customer(); } // forever} // arrivals

Warnings

- Confidence intervals do not take into account model inaccuracies
- While the batch means method eliminates most effects of measurement autocorrelation, it is not always 100% effective
- The max_time parameter of the run_length() will not necessarily stop the simulation just after the specified CPU time
- Like the emergency brake of a train

Objective

- Partition processes into different classes
- Low priority
- High priority
- Obtain separate statistics for each process priority class

Declaring and initializing process classes

- To declare a dynamic process class:
- process_class *c;
- To initialize a process class before it can be used in any other statement.
- c = new process_class("low priority")

To assign a priority class

- Add inside the process
- c->set_process_class();
- Processes that have not been assigned a process class belong to the “default” process class

Reporting

- Can use
- report()
- report_classes()

Other options

- Can change the name of a process class:
- c->set_name("high priority");
- Can reset statistics associated with a process
- c ->reset();
- Can do the same for all process classes:
- reset_process_classes();
- Can delete a dynamic process class:
- delete c;

Background

- We need to distinguish between
- Truly random numbers
- Obtained through observations of a physical random process
- Rolling dices, roulette
- Atomic decay
- Pseudo-random numbers
- Obtained through arithmetic operations

Example

- Linear Congruential Generators (LCG)
- Easy to implement and fast
- Defined by the recurrence relation:

rn+1 = (a rn + c ) mod m

where

- r1, r2, … are the random values
- m is the "modulus“
- r0 is the seed

Two realizations

- GCC family of compilers
- m = 232a = 69069 c = 5
- Microsoft Visual/Quick C/C++
- m = 232a = 214013 c = 2531011

Problems with pseudorandom number generators

- Much shorter periods for some seed states
- Lack of uniformity of distribution
- Correlation of successive values
- …

Better RNGs

- Use the Mersenne twister
- Period is 219937 - 1
- Blum-Blum-Schub

A quote

- "Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number– there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.“John von Neumann

CSIM RNGs

- BY default, CSIM uses a single stream of random numbers
- Can reset the seed using
- void reseed(stream *s, long n)

as in

- reseed(NIL, 13579)

Continuous distributions supported by CSIM (I)

- double uniform(double min, double max)
- double triangular(double min, double max,double mode)
- double beta(double min, double max, doubleshape1, double shape2)
- double exponential(double mean)
- double gamma(double mean, double stddev)
- double erlang(double mean, double var)

Continuous distributions supported by CSIM (II)

- double hyperx(double mean, double var)
- double weibull(double shape, double scale)
- double normal(double mean, double stddev)
- double lognormal(double mean, double stddev)
- double cauchy(double alpha, double beta)
- double hypoexponential(double mn, double var)

Continuous distributions supported by CSIM (III)

- double pareto(double a)
- double zipf(long n)
- double zipf_sum(long n, double *sum)

Discrete distributions supported by CSIM

- long uniform_int(long min, long max)
- long bernoulli(double prob_success)
- long binomial(double prob_success, longnum_trials)
- long geometric(double prob_success)
- long negative_binomial(long success_num,double prob_success)
- long poisson(double mean)

Empirical distributions supported by CSIM

- Not clear enough

Using multiple streams

- In campers example, sequence of RNs used to generate arrivals is affected by the numbers of campers
- If agency has less campers
- More customers will be lost
- Lost customers do no generate any RN
- Better to have separate random number streams for arrivals and service times

Declaring and initialing random number streams

- Can use:
- stream *s;s = new stream();
- By default, streams are created with seeds that are spaced 100,000 values apart

Other functions

- Inspect the current state of a stream
- i = s->state();
- Delete a stream
- delete s;

Using a specific seed

- Prefix RNG function with name of seed:
- s->uniform (3.0, 7.0)

RAID array revisited

- Reliability of a RAID array
- Reliability R(t) of a system is the probability that will remain operational over a time interval [0. t] given that it was operational at time t = 0
- Not the same as availability
- Our focus is evaluating the risk of a data loss during array lifetime

Executing multiple runs

- Multiple runs provide statistically independent repetitions of original simulation
- Useful for
- Collecting more accurate results
- Constructing confidence intervals
- Use rerun() function within a loop
- create("sim") call must be inside that loop

Overview of sim function

extern "C" void sim() { // sim process

runcount = 0;

while(runcount < NRUNS) {

create("sim"); // make it a process

// usual contents of sim process

rerun();

runcount++;

} // while

report_hdr(); // produce statistics report

} // sim

Global includes and defines

#include <cpp.h> // CSIM C++ header file

#define NDISKS 5 // number of disks in array

#define NYEARS 5 // useful lifetime of array

#define MTTF 300000.0 // disk MTTF

#define MTTR 24.0 // disk MTTR

#define NRUNS 100000 // no of runs

void disk(int i);

Global declarations

int nfailed; // number of failed disks

int runcount = 0; // number of runs

int ndatalosses = 0; //counter

double lifetime = NYEARS * 365 * 24;

// simulation duration

Sim function (I)

extern "C" void sim() { // sim process

int i;

runcount = 0;

while(runcount < NRUNS) {

create("sim"); // make this a process

dataloss = new event("dataloss");

dataloss->clear();

nfailed = 0;

Sim function (II)

// create NDISKS disk processes

for (i=0; i < NDISKS; i++){

disk(i);

} // for

dataloss->timed_wait(lifetime);

if (simtime() < lifetime) {

ndatalosses++;

} // if

rerun();

runcount++;

} // while

Sim function (III)

report_hdr(); // produce statistics report

printf("Array lifetime %fd years",

NYEARS);

printf(“or %f hours\n", lifetime);

printf("Sim time %f\n", simtime());

printf("Disk MTTF %f hours\n", MTTF);

printf("Disk MTTR %ff hours\n", MTTR);

printf(“Completed runs %d\n", runcount);

printf(“Data lossese%d\n", ndatalosses);

} // sim

Disk process (I)

void disk(int i) {

create("disk");

while(simtime() < lifetime || dataloss) {

hold(exponential(MTTF)); // disk failed

nfailed++;

Disk process (II)

if (nfailed == 2) {

dataloss = 1;

failtime = simtime();

finish->set();

terminate();

} // if

hold(MTTR); // repair process

nfailed--; // disk is replaced

} // while

} // disk

Simulation Outcome

C++/CSIM Simulation Report (Version 19.0 for Linux x86)

Tue Apr 22 10:13:14 2008

Ending simulation time: 0.000

Elapsed simulation time: 0.000

CPU time used (seconds): 0.000

Array lifetime 5 years

which corresponds to 43800 hours

Simulated time 0

Disk MTTF 300000 hours

Disk MTTR 24 hours

Number of runs completed 100000

Number in runs ending in data loss 17

Discussion

- Whole program took less than 98 seconds on linux03 server
- Simulated time should be equal to43,800 hours, not zero.
- rerun() artifact?
- We observe 17 failures out of 100,000 runs
- Data survival rate is 99.983 percent
- Three nines

Confidence intervals

- Data loss rate and survival rate are distributed according to a binomial law
- Since n = 100,000,the distributions of both proportions are approximately normal
- Should use

CI for data survival rate s

- ŝ =99.983% or 0.99983
- We compute

(ŝ(1 – ŝ)/n) =

(0.00017 x 0.99987)/100000) =0.0000412

- 95% CI is

0.99983 ± 1.96 x 0.0000412 =

0.99983 ± 0.000081

[0.999749, 0.99911]

Extensions (I)

- Other repair time distributions
- Exponential ( to compare with results of stochastic analysis)
- Ad hoc (80% of repairs within one day, 20% within two days)
- Taking accounts of differences between day and night, weekdays and weekends
- Will map simtime() into a calendar

Extensions (II)

- Other disk arrays
- Mirrored disks: NDISKS = 2
- RAID level 6: tolerate two disk failures
- Triplicate disks: tolerate failure of two out of three disks
- Two-dimensional RAID arrays
- See next slide
- Will require more work

Download Presentation

Connecting to Server..