1 / 49

Statistics and Data Analysis

Statistics and Data Analysis. . Part 21

vonda
Download Presentation

Statistics and Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics

    2. Statistics and Data Analysis

    3. Statistical Inference: Point Estimates and Confidence Intervals Statistical Inference Estimation Concept Sampling Distribution Point Estimates and the Law of Large Numbers Uncertainty in Estimation Interval Estimation

    4. Application: Credit Modeling 1992 American Express analysis of Application process: Acceptance or rejection Cardholder behavior Loan default Average monthly expenditure General credit usage/behavior 13,444 applications in November, 1992 Treat as a population for our discussion (It’s not; this is just November, 1992. There is also December, 1992, 1993, …)

    5. Modeling Fair Isaacs’s Acceptance Rate

    6. The Question They Are Really Interested In: Default

    7. The data contained many covariates. Do these help explain the interesting variables?

    8. Variables Typically Used By Credit Scorers

    9. Possibly Useful Data

    10. Sample Statistics The population has characteristics Mean, variance Median Percentiles A random sample is a “slice” of the population

    11. Populations and Samples Population features of a random variable. Mean = µ = expected value of a random variable Standard deviation = s = (square root) of expected squared deviation of the random variable from the mean Percentiles such as the median = value that divides the population in half – a value such that 50% of the population is below this value Sample statistics that describe the data Sample mean = = the average value in the sample Sample standard deviation = s tells us where the sample values will be (using our empirical rule, for example) Sample median helps to locate the sample data on a figure that displays the data, such as a histogram.

    12. The Overriding Principle in Statistical Inference The characteristics of a random sample will mimic (resemble) those of the population Mean, median, standard deviation, etc. Histogram The resemblance becomes closer as the number of observations in the (random) sample becomes larger. (The law of large numbers)

    13. Point Estimation We use sample features to estimate population characteristics. Mean of a sample from the population is an estimate of the mean of the population: is an estimator of µ The standard deviation of a sample from the population is an estimator of the standard deviation of the population: s is an estimator of s

    14. Point Estimator A formula Used with the sample data to estimate a characteristic of the population (a parameter) Provides a single value:

    15. Sampling Distribution The random sample is itself random, since each member is random. Statistics computed from random samples will vary as well. For some statistics, the distributions of the elements in the sample will induce a distribution of the statistic.

    16. Estimating Fair Isaacs’s Acceptance Rate

    17. The Estimator

    18. 100 Samples of 100 Observations

    19. The Role of the LLN The LLN informs us about what to expect for the deviations from the parameter that we are estimating.

    21. The Mean is A Good Estimator

    22. What Makes it a Good Estimator? The average of the averages will hit the true mean (on average) The mean is UNBIASED (No moral connotations)

    23. What Does the Law of Large Numbers Say? The sampling variability in the estimator gets smaller as N gets larger. If N gets large enough, we should hit the target exactly; The mean is CONSISTENT

    25. Uncertainty in Estimation How to quantify the variability in the proportion estimator

    26. Range of Uncertainty The point estimate will be off (high or low) Quantify uncertainty in ± sampling error. Look ahead: If I draw a sample of 100, what value(s) should I expect? Based on unbiasedness, I should expect the mean to hit the true value. Based on my empirical rule, the value should be within plus or minus 2 standard deviations 95% of the time. What should I use for the standard deviation?

    27. Estimating the Variance of the Distribution of Means Use the variances of the 100 observed samples? No, in practice, we will have only one sample! Use what we know about the variance of the mean: Var[mean] = s2/N Estimate s2 using the data: Then, divide s2 by N.

    28. The Sampling Distribution For sampling from the population and using the sample mean to estimate the population mean: Expected value of will equal µ Standard deviation of will equal s/ v N CLT suggests a normal distribution

    30. Accommodating Sampling Variability To describe the distribution of sample means, use the sample to estimate the population expected value To describe the variability, use the sample standard deviation, s, divided by the square root of N To accommodate the distribution, use the empirical rule, 95%, 2 standard deviations.

    31. Estimating the Sampling Distribution For the 2nd sample, the mean was 0.849, s was 0.358. s/vN = .0358 Forming the distribution I use 0.849 ± 2 x 0.0358 For a different sample, the mean was 0.750, s was 0.433, s/vN = .0433. Forming the distribution, I use 0.750 ± 2 x 0.0433

    33. Will the Interval Contain the True Value? Uncertain: The midpoint is random; it may be very high or low, in which case, no. Sometimes it will contain the true value. The degree of certainty depends on the width of the interval. Very narrow interval: very uncertain. (1 standard errors) Wide interval: much more certain (2 standard errors) Extremely wide interval: nearly perfectly certain (2.5 standard errors) Infinitely wide interval: Absolutely certain.

    34. The Degree of Certainty The interval is a “Confidence Interval” The degree of certainty is the degree of confidence. The standard in statistics is 95% certainty (about two standard errors).

    35. 66?% and 95% Confidence Intervals

    36. Average Monthly Spending

    37. Estimating the Mean Given a sample N = 225 observations = 241.242 S = 276.894 Estimate the population mean Point estimate 241.242 66?% confidence interval: 241.242 ± 1 x 276.894/v225 = 227.78 to 259.70 95% confidence interval: 241.242 ± 2 x 276.894/v225 = 204.32 to 278.162 99% confidence interval: 241.242 ± 2.5 x 276.894/v225 = 195.09 to 287.39

    38. Where Did the Interval Widths Come From? Empirical rule of thumb: 2/3 = 66 2/3% is contained in an interval that is the mean plus and minus 1 standard deviation 95% is contained in a 2 standard deviation interval 99% is contained in a 2.5 standard deviation interval. Based exactly on the normal distribution, the exact values would be 0.9675 standard deviations for 2/3 (rather than 1.00) 1.9600 standard deviations for 95% (rather than 2.00) 2.5760 standard deviations for 99% (rather than 2.50)

    39. Normally Distributed Data If the following are true The data come from a normal population The true variance, s2, is known. Then use the normal distribution values instead of the empirical rule. (Neither assumption is met for our two applications)

    40. What Became of the Central Limit Theorem? The CLT describes the pattern of observed means we can expect if we draw samples and compute means repeatedly. The distribution of the sample mean(s) will begin to resemble normality as the sample size increases.

    41. Large Sample If the sample is moderately large (over 30), one can use the normal distribution values instead of the empirical rule. The empirical rule is easier to remember. The values will be very close to each other.

    42. Refinements – 1 (Minor) When estimating a proportion (like the acceptance rate): We used P = Sixi/N and (it can be shown) P(1-P)/(N-1) for the mean and variance. Researchers suggest using, instead, the Agresti/Coull correction, P* = (Sixi+2)/(N+4) and P*(1-P*)/(N+4). This will not change your conclusions, but it will make you look like a polished expert. There are lots of other refinements described in the background notes for this session. For experts and serious practitioners.

    43. Refinements 2 (More Important) When you have a fairly small sample (under 30) and you have to estimate s using s, then both the empirical rule and the normal distribution can be a bit misleading. The interval you are using is a bit too narrow. You will find the appropriate widths for your interval in the “t table” The values depend on the sample size. (More specifically, on N-1 = the degrees of freedom.)

    44. Critical Values For 95% and 99% using a sample of 15: Normal: 1.960 and 2.576 Empirical rule: 2.000 and 2.500 T[14] table: 2.145 and 2.977 Note that the interval based on t is noticeably wider. The values from “t” converge to the normal values (from above) as N increases. What should you do in practice? Unless the sample is quite small, you can usually rely safely on the empirical rule. If the sample is very small, use the t distribution.

    46. Application A sports training center is examining the endurance of athletes. A sample of 17 observations on the number of hours for a specific task produces the following sample: 4.86, 6.21, 5.29, 4.11, 6.19, 3.58, 4.38, 4.70, 4.66, 5.64, 3.77, 2.11, 4.81, 3.31, 6.27, 5.02, 6.12 This being a biological measurement, we are confident that the underlying population is normal. Form a 95% confidence interval for the mean of the distribution. The sample mean is 4.766. The sample standard deviation, s, is 1.160. The standard error of the mean is 1.16/v17 = 0.281. Since this is a small sample from the normal distribution, we use the critical value from the t distribution with N-1 = 16 degrees of freedom. From the t table, the value of t[.025,16] is 2.120 The confidence interval is 4.766 ± 2.120(0.281) = [4.170,5.362]

    48. Confidence Interval for Regression Coefficient Coefficient on OwnRent Estimate = +0.040923 Standard error = 0.007141 Confidence interval 0.040923 ± 1.96 X 0.007141 = 0.040923 ± 0.013996 = 0.02693 to 0.05492 Form a confidence interval for the coefficient on SelfEmpl. (Left for the reader)

    49. Some Common Notation Degree of Confidence = the probability that the constructed interval will contain the parameter Alpha level = a = the probability that the constructed interval will not contain the parameter a/2 = the probability in the “upper tail” of the distribution. (The other a/2 is in the lower tail.) za/2 is the value from the normal table such that P[Z > za/2] = a/2. E.g., 1.96 for 0.025, 2.58 for 0.005. Authors differ on how to handle the t values. I use ta/2[N-1], for example, t.025[15] = 2.131. Your textbook and Gary Simon use ta/2;N-1.

    50. Summary Methodology: Statistical Inference Application to credit scoring Sample statistics as estimators Point estimation Sampling variability The law of large numbers Unbiasedness and consistency Sampling distributions Confidence intervals Proportion Mean Regression coefficient Using the normal and t distributions instead of the empirical rule for the width of the interval.

More Related