This presentation is the property of its rightful owner.
1 / 40

# Probability Plots PowerPoint PPT Presentation

Probability Plots. Jake Blanchard Spring 2010. Introduction. Probability plots allow us to assess the degree to which a set of data fits a particular distribution

Probability Plots

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Probability Plots

Jake Blanchard

Spring 2010

Uncertainty Analysis for Engineers

### Introduction

• Probability plots allow us to assess the degree to which a set of data fits a particular distribution

• The idea is to scale the x-axis of a CDF such that the result would be a straight line if the data conforms to the assumed distribution

Uncertainty Analysis for Engineers

### An Example

• Suppose we have a set of data that we suspect is normal.

• First, we form an empirical cdf

• [f,xx]=ecdf(x)

• Then scale cdf so that each unit on the axis corresponds to 1 standard deviation

• z=norminv(ff);

• Then we plot the data (sorted) against this new axis

• figure, plot(y,z,'+')

Uncertainty Analysis for Engineers

### The Script

n=100;

x=normrnd(10,3,n,1);

y=sort(x);

[f, xx]=ecdf(x)

for i=1:n

ff(i)=(f(i)+f(i+1))/2;

end

z=norminv(ff);

figure, plot(y,z,'+')

Uncertainty Analysis for Engineers

Uncertainty Analysis for Engineers

### Matlab has an alternative

• probplot('norm',x)

• Options are

• exponential

• extreme value

• lognormal

• normal

• rayleigh

• weibull

Uncertainty Analysis for Engineers

Vertical axis here is cdf, not number of standard deviations

Uncertainty Analysis for Engineers

### Look at some lognormal data

Normal probability plot

Normal probability plot of log of data

Lognormal probability plot

Uncertainty Analysis for Engineers

### Now some exponential data

Lognormal probability plot

Exponential probability plot

Uncertainty Analysis for Engineers

### Facts

• On normal probability plots, the intercept is the mean

• On exponential paper, the slope is 1/

• Results at the extremes are expected to deviate from the straight line more than those in the middle

• On the other hand, for some data, multiple distributions will fit in the center, but not in the tails

Uncertainty Analysis for Engineers

### Results

• We are dealing with samples, so our conclusions tend to be one of

• The model appears to be adequate

• The model is questionable

• The model is not adequate

Uncertainty Analysis for Engineers

### Comparison

• Take some wind data (maximum measured wind velocity over a given period)

• 20 data points taken over 20 years

• Compare all 6 Matlab probability plots

• Compare looking at CDFs

• Compare other error measures

Uncertainty Analysis for Engineers

Uncertainty Analysis for Engineers

Uncertainty Analysis for Engineers

### Goodness of Fit Statistics

• For discrete and continuous sampled data distributions

• Chi-square statistic

• Kolmogorov-Smirnoff (K-S) statistic

• Anderson-Darling (A-D) statistic

• Root Mean Square Error (RMS).

• Value is limited if there are fewer than about 30 data points.

• The lower the value, the closer the distribution appears to fit the data. But they do not provide a measure that the data actually come from the distribution.

### Chi-square statistic

• This goodness-of-fit statistic measures

• The oldest, most commonly used

• Data are grouped into frequency cells and compared to the expected number of observations based on the proposed distribution.

• Definition

• Where O(i) is the observed frequency of the ith histogram bar and

• E(i) is the expected frequency from the fitted distribution of x values falling within the x range of the ith histogram bar.

• It can be overly sensitive to large errors

### Chi-Squared Tests

• First we divide values into groups; suggestion is

• For example, if we have n=500 data points, then this gives us about 45 groups (I would use 50 for convenience)

### Example (cont.)

• Sort data and divide into 50 cells

• Find upper and lower bound of values in each cell

• Calculate expected number of data points in each cell by subractingcdf of lower bound from cdf of upper bound

### Example (cont.)

• Compare this value to the value of the chi-squared distribution for k-np-1 degrees of freedom and a desired confidence level, where np is the number of parameters in the model (eg, 2 for a normal distribution)

### Example (cont.)

• In Matlab, we can get this chi-squared distribution from chi2inv(p,v), where p is the confidence level (0<p<1) and v is the number of DOF

• If our calculated value for chi-squared is less than chi-square distribution, then the fit is OK

### Kolmogorov-Smirnov Test

• Compare measured cumulative frequency with CDF of assumed theoretical distribution

• Compare the maximum discrepancy between these two with a critical value of a test statistic and reject fit if former exceeds latter

• Good when we don’t have many data points

### Kolmogorov-Smirnoff Statistic

• Where Dn is the K-S distance,

• n is the total number of data points,

• F(x) is the distribution function of the fitted distribution,

• and Fn(x)=i/n and i is the cumulative rank of the data point.

• K-S is better than χ2because data are assessed at all points—

• avoids problem of number of bars (bins).

• But value determined by the one largest discrepancy

• So it takes no account of lack of fit across entire distribution

• ### More on K-S statistic

• The position of Dn along the x-axis is more likely to occur away from the low probability tails.

• This insensitivity to lack of fit at the extremes is corrected for in the Anderson-Darling statistic.

• Some statistical literature is critical about distribution fitting software that use this statistic as a goodness-of-fit test.

• Because the statistic assumes the fitted distribution is fully specified so that the critical region of the curve can be checked.

### Process

• Sort n data points

• Make a step-wise cdf

• (cdfplot in Matlab)

• Fit data to a model to obtain model cdf

• Find maximum difference between these two cdf’s over each of the steps in the first cdf

• Look up comparison data in tables

### What is the Significance Level?

• Our hypothesis is that the fit is a good fit.

• If difference in cdf’s exceeds that of the test statistic, we reject the hypothesis

• There are two possibilities:

• Fit really is bad, or

• We are rejecting a good fit

• Significance level is probability that we are rejecting a good fit

### Fit Tests in Matlab

• chi2gof(x) – normal distribution only

• kstest(x,CDF)

### Matlab Code

w1=data(:,4); n=numel(w1);

hist(w1,25); ncells=128; npoints=51;

nused=ncells*npoints;

datasort=sort(w1(1:nused));

a=4.52997; b=2.72931; %from dfittol

chisqr=0;

upperbound=min(datasort);

for i=1:ncells

indx=(i-1)*(npoints)+1;

thisset=datasort(indx:indx+npoints-1);

lowerbound=upperbound;

upperbound=max(thisset);

ee=n*(gamcdf(upperbound,a,b)-gamcdf(lowerbound,a,b));

chisqr=chisqr+(npoints-ee)^2/ee;

end

chisqr

param=2; v=ncells-param-1;

chpdf=chi2inv(0.95,v)

### KS Test

a=13.9761; %from dfittool

b=2.34506; %from dfittol

cdfvals=wblcdf(xvals,a,b);

kstest(datasort,[xvals' cdfvals'])

### Anderson-Darling Statistic

• This is a more sophisticated and complex version of the K-S,

• It is more powerful because

• The f(x) weights the observed distances by the probability that the value will be generated at that x value.

• This helps focus the difference measure more equitably.

• The vertical distances are integrated over ALL values of x rather than just looking at the maximum.

• This makes maximum use of the observed data

### Root Mean Square Error

• RMS error is available as a test statistic in BESTFIT for expert data that is sampled using percentiles.

• Measures the area between the distribution fit and the data.

• The smaller the better. Does not provide fine distinction.

### Back to Example - Tests

• All pass KS test except exponential and rayleigh at 5% significance level

• Same holds at 2% level

Uncertainty Analysis for Engineers

### Script

Rayleigh and Exponential fail test (kx=kr=1)

vel=[78.2 75.8 81.8 85.2 75.9 78.2 72.3 69.3 76.1 74.8 ...

78.4 76.4 72.9 76.0 79.3 77.4 77.1 80.8 70.6 73.5];

sortv=sort(vel);

subplot(2,3,1),probplot('ev',vel)

subplot(2,3,2),probplot('norm',vel)

subplot(2,3,3),probplot('lognorm',vel)

subplot(2,3,4),probplot('weibull',vel)

subplot(2,3,5),probplot('exponential',vel)

subplot(2,3,6),probplot('rayleigh',vel)

x=68:0.1:86;

alpha=0.02;

zev=evfit(vel);

sev=evcdf(x,zev(1),zev(2));

kev=kstest(sortv,[x' sev'],alpha)

[an bn]=normfit(vel);

sn=normcdf(x,an,bn);

kn=kstest(sortv,[x' sn'],alpha)

zx=expfit(vel);

sx=expcdf(x,zx);

kx=kstest(sortv,[x' sx'],alpha)

zl=lognfit(vel);

sl=logncdf(x,zl(1),zl(2));

kl=kstest(sortv,[x' sl'],alpha)

zw=wblfit(vel);

sw=wblcdf(x,zw(1),zw(2));

kw=kstest(sortv,[x' sw'],alpha)

zr=raylfit(vel);

sr=raylcdf(x,zr);

kr=kstest(sortv,[x' sr'],alpha)

figure,plot(x,sev,'o',x,sn,'r',x,sx,'b',x,sl,'+',x,sw,'m',x,sr,'c',...

'LineWidth',2)

legend('EV','Norm','Exp','LogN','Weibull','Rayleigh')

hold on

cdfplot(vel)

Uncertainty Analysis for Engineers

### A Second Example

• A new controller was installed on 96 diesel locomotives

• The mileage at failure for each was recorded

• 37 failed at less than 135,000 miles

• All we know about the others is that each lasted beyond 135k miles

• Thus, we have to “censor” the data

• We assume we have 96 data points, but only plot 37

• This is important when we compute CDF

• Goal is 80,000 mile warranty

Uncertainty Analysis for Engineers

### Censoring in Matlab

• Tell Matlab which data points are censored and how many of each there are

• We’ll use a lognormal plot in the example because failure mechanism indicates this is appropriate

Uncertainty Analysis for Engineers

### Uncensored “Normal” probability plot

Uncertainty Analysis for Engineers

### Un-Censored “Lognormal” Plot

Uncertainty Analysis for Engineers

### Censored “Lognormal” Plot

Uncertainty Analysis for Engineers

### Conclusions

• The lognormal distribution looks like a good fit

• Probability of failure is approximately 15%

Uncertainty Analysis for Engineers

### Script

d=[22.2 37.5 46.0 48.5 51.5 ...

53 54.5 57.5 66.5 68 ...

69.5 76.5 77 78.5 80 ...

81.5 82 83 84 91.5 93.5 ...

102.5 107 108.5 112.5 113.7 ...

116 117 118.5 119 120 ...

122.5 123 127.5 131.0 132.5 134]

data=[d'; 135];

bools=zeros(size(data));

bools(end)=1;

freq=bools*59+1;

probplot('logn',data)

figure, probplot('logn',data,bools,freq)

Uncertainty Analysis for Engineers