1 / 34

Understanding Variation and Uncertainty in Data Analysis

Learn how to transform knowledge of data variation into statements about the uncertainty surrounding model parameters, using confidence intervals and statistical techniques.

gwinkler
Download Presentation

Understanding Variation and Uncertainty in Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3: Uncertainty "variation arises in data generated by a model" "how to transform knowledge of this variation into statements about the uncertainty surrounding the model parameters" Confidence intervals via the frequentist/repeated sampling/classical approach

  2. Parameter  T(Y1 ,...,Yn) estimate of  Var(T) = 2/n nV 2 in probability as n   V1/2 standard error (estimated s.d.) for T Average. mean  and variance 2/n S2 =  (Yi - )2/(n-1) estimates 2 V1/2 = n-1/2S s.e. for

  3. pivot - function of data and parameter whose distribution is known distribution of Z(0) does not depend on 0 Exponential. Pr(Yj / 0 u) = 1 - exp(-u), u>0 Z(0) =  Yj / 0 is gamma parameters 1 and n a sum Approximate. Z(0) = (T - 0)/V1/2 N(0,1) in distribution Pr(T - V1/2 z1- 0  T - V1/2 z )  1 - 2 where (z) =  Approximate (1-2)100 % CI for 0 interval estimate

  4. Birth data. approx 95% CI for 0 based on normal Z(0) = n1/2 ( - 0 )/s day 1 data n=16 = 8.77 s2 = 18.46 s = 4.30  = .025 z.025 = -1.96 (6.66,10.87) hrs of labor Pr(T - V1/2 z1- 0  T - V1/2 z )  1 - 2

  5. Binomial ditribution. parameters m,  observation R = R/m var( ) = (1- )/m s.e. { (1- )/m}1/2 pivotal quantity ( - )/{ (1- )/m}1/2 approx N(0,1) Suppose m = 1000 = .35 approx 95% CI 35  1.96  .015 Margin of error

  6. Delta method. Gauss Method of linearization Tn available, but interested in h(Tn)  available, but interested in h() (Tn - )/var(Tn)1/2 Z in distribution n var (Tn) 2 in probability Tn =  + n -1/2 Zn continues

  7. h( ) smooth h(y)  h(x)+(y-x)h'(x) for y near x h(Tn) = h( + Tn - )  h() + (Tn - ) h'() = h() + n -1/2 Zn h'() h(Tn)  N( h(),h'()2 var(Tn) )

  8. Poisson, Y. mean , variance  Many techniques "expect" constant variance, normality, linear model Seek h( ) such that var(h(Y)) = 1 h'()2 = 1 h'() = 1/1/2 h() = 2 1/2 Work with Y1/2 N(1/2, 1/4) Approx 95% CI for 1/2: ( Y1/2 - z.975/2, Y1/2 - z.025/2 ) Square up y = 16 births day 1, CI for 1/2 : 4  1.96/2 Square up (9.1,24.8)

  9. Tests. Null hypothesis H0: supposeaverage labor time 0 = 6 hours Alternative HA :  > 6 hours Oxford = 8.77 hours n = 16 Is this extreme? Is average time longer in Oxford? Pivot t = ( - 0)/(s/n1/2) 2.58 = tobs Pr0(T  tobs)  1 - (2.58) = .005 P-value, significance level Choice

  10. Normal model. N(,2) mean , variance 2 Standard normal Z = (Y-)/ ~ N(0,1) Density (z) cdf (z) Y =  +  Z

  11. Chi-squared distribution. Z1,...,Z ~ IN(0,1) W = Z12 + ... + Z2  degrees of freedom additive qchisq() qchisq(.975,14) = 26.119 (1 - 2) CI for 2 ( (n-1) S2 / cn-1(1-) , (n-1) S2 / cn-1() ) Cross-fertilized maize. n1 = 15, s12 = 837.3,  = .025 ( 14  837.3 / 26.119 , 14  837.3 / 5.629 ) (449,2082) eighths of inches squared

  12. Left: chi-squared Right: students t

  13. Student's t distribution. Maize data, differences n = 15 = 20.93 s2 = 1424.6 95% CI 20.93  (1424.6/15)1/2 2.14 (0.03,41.84) Is H0 :  = 0 plausible? Not in the 95% confidence interval

  14. F distribution. F = (W / )/(W' / ) W's independent ~ F, F, ~ 2/ F1, ~ T2 Maize. n1 , n2 =15 s12 = 837.3 , s22 = 269.4 Variances 2 ,  2 CI for  ( F-1,-1() s22/s12 , F-1,-1(1-) s22/s12 ) =.025 (0.108,.958) H0:  = 1

  15. Normal random sample. =  +n-1/2 Z S2 = (n-1)-12 W Z ~ N(0,1) W ~ n-12 independently T = Z/{W/(n-1)}1/2 is students t with n-1 df T is a pivotal quantity for  100  % CI  n-1/2 s tn-1()

  16. Bivariate data

  17. Bivariate distribution. cov(Y1,Y2) = E[(Y1 - 1)(Y2 - 2)] = 12 = cov(Y2 ,Y1) Collect into a square array cov(Y,Y) = covariance matrix 2 by 2 variances, 11 and 22, on diagonal covariances, 12 and 21, off diagonal correlation  = 12 / (11 22)

  18. correlations -0.7, 0, 0.7

  19. yahoo.com shares

  20. Multivariate normal. p-variate Y = (Y1 ,..., Yp )T p linear combinations of IN(0,1) Linear combinations of normals are normal If it exists, density function f(y; , ) E(Y) =  cov(Y,Y) =  These are vectors and matrices Curves of constant density - ellipses

  21. Properties. Marginals also (multivariate) normal Conditionals are (multivariate) normal Bivariate. E(Y1), E(Y2) = 0; var(Y1), var(Y2) = 1; cov(Y1, Y2) =  Y1, Y2 are N(0,1) Conditional distribution: Y1 given Y2 is N(Y2, 1 - 2) If Y1 and Y2 are uncorrelated they are independent

  22. If Y is Np(,) then a + BTY ~ Nq(a + BT, BT B) A surprise (Y - )T-1(Y - ) ~  p2 Another surprise and S2 are statistically independent

  23. Proof. S2 is based on Yi - These are uncorrelated with and all are normal, hence the Yi - are independent of Use. Suppose have samples size ni from IN(i, i2) 1 - 2 is normal mean: 1 - 2, variance: 12/n1 + 22/n12

  24. Pooled estimate of 2 S2 = {(n1-1)S12 + (n2-1)S22}/(n1 + n2 -2) 2 2/ independ of 1 - 2 1 - 2 confidence interval ( 1 - 2)  {S2(n1-1 + n2-1)}1/2 t ()  = n1+ n2 -2 Maize. 20.9  553.31/2 (1/15+1/15)1/2 2.05 95% CI (3.34,38.53) Doesn't include 0

  25. Simulation. Computer generation of artificial data How much variability to expect Adequacy of approximation Sensitivity of conclusions To provide insight How variable are normal probability plots? What does bivariate normal data look like?  Based on pseudo-random, e.g. approx IN(0,1)

  26. Tiger Woods, 20% Lance Armstrong, 30% Serena Willians, 50% Pictures in cereal boxes with these percents How many boxes do you expect to have to buy to get all 3? X = 3, 4, 5, …

  27. Assume pictures distributed randomly R.v. Pr{X=Tig} = .2, Pr{X=Lan}=.3, Pr{X=Ser}=.5 Simulate 10000 times 3 4 5 6 7 8 9 10 11 12 13 14 1806 1762 1445 1191 863 669 517 417 322 206 179 125 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 119 76 65 52 43 29 23 19 16 11 8 5 5 8 5 30 31 32 33 34 35 36 37 38 4 3 2 2 1 1 0 0 1 summary() Min. 1st Qu. Median Mean 3rd Qu. Max. 3.00 4.00 5.00 6.64 8.00 38.00

  28. Linear congruential generator. Xj+1 = (aXj +c) mod M Uj = Xj /M M = 248 , a = 517 , c = 1 Study by simulation!

  29. Other distributions. Continuous cdf F, inverse F-1 Y = F-1(U) ~ F(y) N(0,1). Z = -1 (U), Y =  +  Z Exponential. - log(1-U)/  qnorm qgamma qchisq qt, qf Discrete - layout segments, lengths pi , along [0,1]

  30. Birth data. Poisson arrivals,  = 12.9/day N ~ y e- /y!, y=0,1,2,3,... (2.6) Arrival times uniform during the day V1 ,..., VN 1/24, 0<y<24 Women remain for gamma, shape  = 3.15, mean  = 7.93 hours G1,...,GN y-1exp{-y}/() y > 0,  = / (2.7) V1 + G1, ..., VN + GN Record how many women present at each arrival/departure

More Related