# Probability and distributions PowerPoint PPT Presentation

Outline. Random samplingProbability calculations and combinatoricsDiscrete distributionsContinuous distributionsThe Central limit theoremExploratory data analysis. Random sampling The sample function. Simulating a random experiment by sample function.Choose 5 numbers at random from the set

Probability and distributions

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

#### Presentation Transcript

1. Probability and distributions

2. Outline Random sampling Probability calculations and combinatorics Discrete distributions Continuous distributions The Central limit theorem Exploratory data analysis

3. Random sampling – The sample function Simulating a random experiment by sample function. Choose 5 numbers at random from the set {1,2,…, 40} > sample(1:40,5) [1] 7 24 35 5 1 Note: the default behaviour of sample is sampling without replacement; use the syntax replace = TRUE if you want sample with replacement. Simulating 10 coin tosses > sample(c("H","T"),10,replace=T) [1] "T" "T" "T" "T" "H" "H" "H" "T" "H" "H" Simulating with specific different probabilities > sample(c("H","T"),10,replace=T,prob = c(0.9,0.1)) [1] "H" "H" "H" "H" "H" "H" "H" "H" "H" "H"

4. Random sampling – The sample function # Roll a die > sample(1:6,10,replace=TRUE) [1] 5 1 5 3 3 4 5 4 2 1 # no sixes! ## pick 6 of 54 (a lottery) > sample(1:54,6) # no replacement [1] 6 39 23 35 25 26 ## pick a card. (Fancy! Uses paste, rep) > cards = paste(rep(c("A",2:10,"J","Q","K"),4),c("H","D","S","C")) > sample(cards,5) # a pair of jacks, no replacement [1] "J D" "5 C" "A S" "2 D" "J H" ## roll 2 die. Even fancier > dice = as.vector(outer(1:6,1:6,paste)) > sample(dice,5,replace=TRUE) # replace when rolling dice [1] "1 1" "4 1" "6 3" "4 4" "2 6"

5. Probability calculations and combinatorics Factorials and permutation: prod(x), factorial(x) Combinations: choose(n,x) Example: a printed circuit board has eight different locations in which a component can be loaded. If four different components are to be placed on the board, how many different designs are possible? > factorial(8)/factorial(8-4) [1] 1680 If ?ve identical components are to be placed on the board, how many different designs are possible? > choose(8,5) [1] 56

6. Discrete distributions For a discrete random variable X with possible values x1, x2, …, xn, a probability mass function is a function such that The cumulative distribution function of a discrete random variable X, denoted as F(x), is For a discrete random variable X, F(x) satifies the following properties 0 ? F(x) ? 1 If x < y, then F(x) ? F(y)

7. Discrete distributions Bernoulli Binomial Poisson Geometric Hypergeometric Multinomial Negative binomial

8. Discrete distributions – R functions R has a wide range of built-in probability functions, for each of which four functions are available: Probability density function which has a d prefix. Cumulative probability function: p preifx. Quantiles of the distribution: q prefix. Random numbers generated from the distribution: r prefix. Each letter (d, p, q or r) can be prefixed to the R functions.

9. Discrete distributions – summary R functions

10. Discrete distributions – Binomial distribution Let X has a binomial distribution with n = 10 and p = 0.4, Caculate: P(X = 5) P(X ? 3) Find x that P(X ? x) = 0.975 > dbinom(5,10,0.4) [1] 0.2006581 > 1-pbinom(2,10,0.4) [1] 0.8327102 > sum(dbinom(3:10,10,0.4)) [1] 0.8327102 > qbinom(0.975,10,0.4) [1] 7

11. Discrete distributions – Binomial distribution > rbinom(25,10,0.4) [1] 7 3 4 5 4 4 6 5 5 2 4 5 4 5 3 4 3 2 6 1 6 2 5 1 5 > barplot(dbinom(0:10,10,0.4)) > plot(0:10,dbinom(0:10,10,0.4),type='h') > points(0:10,dbinom(0:10,10,0.4))

12. Discrete distributions – Poisson distribution Simulating 600 values from a Poisson distribution with mean of 0.90 > count <- rpois(600,0.9) > table(count) count 0 1 2 3 4 5 250 204 101 36 5 4 > hist(count,breaks=-0.5:6,main="")

13. Continuous distributions For a continuous random variable X, a probability denstiy function is a function such that The cumulative distribution function of a discrete random variable X, denoted as F(x), is for .

14. Continuous distributions Normal Exponential Uniform Gamma Student’s t Fisher’s F Chi-squared

15. Continuous distributions – R functions

16. Continuous distributions – summary R functions

17. Continuous distributions – Normal distribution Suppose the heights of people in a city have a normal distribution with the mean 170 cm and the standard deviation 8 cm. What is the probability that a randomly selected individual will be: shorter than 165 cm taller than 185 cm between 160 and 170 cm Generating 100 values of height that have above distribution.

18. Continuous distributions – Normal distribution > mu = 170; sd = 8 > pnorm(165,mu,sd) [1] 0.2659855 > 1 - pnorm(185,mu,sd) [1] 0.03039636 > pnorm(170,mu,sd) - pnorm(160,mu,sd) [1] 0.3943502 > height <- rnorm(100,mu,sd) > hist(height) > curve(dnorm(x,mu,sd),140,200)

19. Continuous distributions – Standard normal distribution From a normal distribution X ~ N( ?, ?2) , we can transform X to a standard normal distribution Z by Z has mean 0 and variance 1. So, we have > pnorm(-0.625) [1] 0.2659855

20. Continuous distributions – Standard normal distribution > rnorm(1,100,16) #an IQ score [1] 115.2561 > x <- rnorm(100) > hist(x,prob=T,col='gray',main="normal mu=0,sigma=1") > curve(dnorm,add=T,col='red')

21. Continuous distributions – Exponential distribution > x=rexp(100,1/2500) > hist(x,probability=TRUE,col=gray(.9),main="exponential mean=2500") > curve(dexp(x,1/2500),add=T)

22. The central limit theorem If Xi is drawn independently from a population where ? and ? are known, the the standardized average is asymptotically normal with mean 0 and variance 1. That is, if n is large enough the average is approximately normal with mean ? and standard deviation . How can we check this? Simulation is an excellent way.

23. The central limit theorem – Binomial dist. If Sn has a binomail distribution with parameters n and p then is approximately normal (0,1). > n=10;p=.25;S= rbinom(1,n,p) > (S - n*p)/sqrt(n*p*(1-p)) [1] 0.3651484 > n = 10;p = .25;S = rbinom(100,n,p) > X = (S - n*p)/sqrt(n*p*(1-p)) # has 100 random numbers > hist(X,prob=T) > curve(dnorm,add=T)

24. The central limit theorem – Binomial dist. > hist(rbinom(10000,20,0.5),xlim=c(0,20), probability=T,breaks=seq(0.5,20.5,1)) > lines(seq(0,20,0.1),dnorm(seq(0,20,0.1), 10,sqrt(5))) #Non symmetric binomial > hist(rbinom(10000,20,0.3),xlim=c(0,20), probability=T, breaks=seq(- 0.5,15.5,1)) > lines(seq(0,20,0.1),dnorm(seq(0,20,0.1), 6,sqrt(4.2)))

25. Normal plots A better plot than the histogram for deciding if random data is approximately normal is the so called "normal probability" plot. The basic idea is to graph the quantiles of your data against the corresponding quantiles of the normal distribution. Essentially, if the graph looks like a straight line then the data is approximately normal. Any curve can tell you that the distribution has short or long tails. The line is drawn through points formed by the first and third quantiles. Using qqnorm, qqplot and qqline.

26. Normal plots > x <- rnorm(100) > y <- rnorm(100,10,15) > z <- rexp(100,1/10) > t <- runif(100,0,1) > par(mfrow=c(2,2)) > qqnorm(x,main='normal(0,1)');qqline(x) > qqnorm(y,main='normal(10,15)');qqline(y) > qqnorm(z,main='exponential mu=10');qqline(z) > qqnorm(t,main='unif(0,1)');qqline(t)

27. Exploratory Data Analysis

28. Example – Home data The dataset homedata contains assessed values for Maplewood, NJ for the year 1970 and the year 2000. What is the shape of the distribution? data(homedata) # from simple package attach(homedata) hist(y1970);hist(y2000) # make two histograms detach(homedata) # clean up On first appearances : the 1970 data looks more normal, the year 2000 data has a heavier tail. Neither looks particularly normal - both are heavy tailed and skewed. Any analysis will want to consider the medians or a transformation.

29. Example – CEO salaries The data set exec.pay gives the total direct compensation for CEO's at 200 large publicly traded companies in the U.S for the year 2000 (in units of \$100,000). What can we say about this distribution besides it looks like good work if you can get it? Using simple.eda. data(exec.pay) # or read in from file simple.eda(exec.pay) From figures, we see a heavily skewed distribution as we might expect. A transformation is called for, let's try the logarithmic transformation (base 10). Since some values are 0 (these CEO's are directly compensated less than \$100,000 or perhaps were forced to return all profits in a plea arrangement to stay out of jail), we ask not to include these.

30. Example – CEO salaries log.exec.pay = log(exec.pay[exec.pay >0])/log(10) # 0 is a problem simple.eda(log.exec.pay) After transformation, we see that the data is now very symmetric and gives good insight into the actual distribution. (Almost log normal, which says that after taking a logarithm, it looks like a normal.) Any analysis will want to use resistant measures such as the median or a transform prior to analysis.

31. Example – Taxi time at EWR The dataset ewr contains taxi in and taxi out times at Newark airport (EWR). Let's see what the trends are. #Taxi time at EWR data(ewr) names(ewr) # only 3-10 are raw data airnames = names(ewr) # store them for later ewr.actual = ewr[,3:10] # get the important columns boxplot(ewr.actual)

32. Example – Taxi time at EWR From figures, we see that all of them look skewed. Let's see if there is a difference between taxi in and out times. par(mfrow=c(2,4)) # 2 rows 4 columns attach(ewr) for(i in 3:10) boxplot(ewr[,i] ~ as.factor(inorout),main=airnames[i]) detach(ewr) par(mfrow=c(1,1)) # return graphics as is (or close window) Notice the taxi in times are more or less symmetric with little variation (except for HP-America West –with a 10 minute plus average). The taxi out times have a heavy tail. At EWR, when the airport is busy, the planes can really backup and the 30 minute wait is not unusual. The data for Northwest (NW) seems to be less. We can compare this using statistical tests. Since the distributions are skewed, we may wish to compare the medians. (In general, be careful when applying statistical tests to summarized data.)

33. Example – Symmetric or skewed, Long or short? For unimodal data, there are 6 basic possibilities as it is symmetric or skewed, and the tails are short, regular or long. Here are some examples with random data from known distributions. ## symmetric: short, regular then long X=runif(100);boxplot(X,horizontal=T,bty=n) X=rnorm(100);boxplot(X,horizontal=T,bty=n) X=rt(100,2);boxplot(X,horizontal=T,bty=n) ## skewed: short, regular then long # triangle distribution X=sample(1:6,100,p=7-(1:6),replace=T);boxplot(X,horizontal=T,bty=n) X=abs(rnorm(200));boxplot(X,horizontal=T,bty=n) X=rexp(200);boxplot(X,horizontal=T,bty=n)