Alternative Forecasting Methods: Bootstrapping Bryce Bucknell Jim Burke Ken Flores Tim Metts
Agenda Scenario Obstacles Regression Model Bootstrapping Applications and Uses Results
Scenario You have been recently hired as the statistician for the University of Notre Dame football team. You are tasked with performing a statistical analysis for the first year of the Charlie Weis era. Specifically, you have been asked to develop a regression model that explains the relationship between key statistical categories and the number of points scored by the offense. You have a limited number of data points, so you must also find a way to ensure that the regression results generated by the model are reliable and significant. Problems/Obstacles: • Central Limit Theorem • Replication of data • Sampling • Variance of error terms
Constrained by the Central Limit Theorem In selecting simple random samples of size n from a population, the sampling distribution of the sample mean x can be approximated by a normal probability distribution as the sample size becomes large. It is generally accepted that the sample size must be 30 or greater to satisfy the large-sample condition of the theorem. _ Sample N = 1 Sample N = 2 Sample N = 3 Sample N = 4 1. http://www.statisticalengineering.com/central_limit_theorem_(summary).htm
Central Limit Theorem Central Limit theorem is the foundation for many statistical procedures, because the distribution of the phenomenon under study does NOT have to be Normal because its average WILL tend to be normal. Why is the assumption of a normal distribution important? • A normal distribution allows for the application of the empirical rule – 68%, 95% and 99.7% • Chebyshev’s Theorem no more than 1/4 of the values are more than 2 standard deviations away from the mean, no more than 1/9 are more than 3 standard deviations away, no more than 1/25 are more than 5 standard deviations away, and so on. • The assumption of a normally distributed data allows descriptive statistics to be used to explain the nature of the population
Not enough data available? Monte Carlo simulation, a type of spreadsheet simulation, is used to randomly generate values for uncertain variables over and over to simulate a model. • Monte Carlo methods randomly select values to create scenarios • The random selection process is repeated many times to create multiple scenarios • Through the random selection process, the scenarios give a range of possible solutions, some of which are more probable and some less probable • As the process is repeated multiple times, 10,000 or more, the average solution will give an approximate answer to the problem • The accuracy can be improved by increasing the number of scenarios selected
Sampling without Replacement Simple Random Sampling • A simple random sample from a population is a sample chosen randomly, so that each possible sample has the same probability of being chosen. • In small populations such sampling is typically done "without replacement“ • Sampling without replacement results in deliberate avoidance of choosing any member of the population more than once • This process should be used when outcomes are mutually exclusive, i.e. poker hands
Sampling with Replacement • Initial data set is not sufficiently large enough to use simple random sampling without replacement • Through Monte Carlo simulation we have been able to replicate the original population • Units are sampled from the population one at a time, with each unit being replaced before the next is sampled. • One outcome does not affect the other outcomes • Allows a greater number of potential outcomes than sampling without replacement • If observations were not replaced there would not be enough independent observations to create a sample size of n ≥ 30
All random variables have the same finite variance Simplifies mathematical and computational treatment Leads to good estimation results in data mining and regression Random variables may have different variances Standard errors of regression coefficients may be understated T-ratios may be larger than actual More common with cross sectional data X Residuals X Residuals Hetroscedasticity vs. Homoscedasticity Homoscedasticity – constant variance Hetroscedasticity – nonconstant variance
Regression Model For ND Points Scored ND Points = 38.54 + 0.079*b1 - 0.170*b2 - 0.662*b3 - 3.16*b4 b1 = Total Yards Gained b3 = Total Plays b2 = Penalty Yards b4 = Turnovers
4 Checks of a Regression Model 1. Do the coefficients have the correct sign? 2. Are the slope terms statistically significant? 3. How well does the model fit the data? 4. Is there any serial correlation?
4 Checks of a Regression Model 1. Do the coefficients have the correct sign? Could this represent a big play factor?
4 Checks of a Regression Model 2. Are the slope terms statistically significant? 3. How well does the model fit the data? 4. Is there any serial correlation?
4 Checks of a Regression Model 3. How well does the model fit the data? Adjusted R2 = 74.22%
4 Checks of a Regression Model 4. Is there any serial correlation? Data is cross sectional With limited data points, how useful is this regression in describing how well the model fits the actual data? Is there a way to tests its reliability?
How to test the significance of the analysis What happens when the sample size is not large enough (n ≥ 30)? Bootstrapping is a method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample. • Commonly used statistical significance tests are used to determine the likelihood of a result given a random sample and a sample size of n. • If the population is not random and does not allow a large enough sample to be drawn, the central limit theorem would not hold true • Thus, the statistical significance of the data would not hold • Bootstrapping uses replication of the original data to simulate a larger population, thus allowing many samples to be drawn and statistical tests to be calculated
How It Works Bootstrapping is a method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample. • The bootstrap procedure is a means of estimating the statistical accuracy . . . from the data in a single sample. • Bootstrapping is used to mimic the process of selecting many samples when the population is too small to do otherwise • The samples are generated from the data in the original sample by copying it many number of times (Monte Carlo Simulation) • Samples can then selected at random and descriptive statistics calculated or regressions run for each sample • The results generated from the bootstrap samples can be treated as if it they were the result of actual sampling from the original population
Characteristics of Bootstrapping Full Sample Sampling with Replacement
Navy Pittsburgh Ohio State Michigan USC Michigan State Washington Washington Ohio State Purdue USC USC BYU BYU Stanford Tennessee Pittsburgh Navy Ohio State Syracuse Stanford Stanford Michigan Ohio State Random sampling with replacement can be employed to create multiple independent samples for analysis Bootstrapping Example Limited number of observations Original Data Set 1st Random Sample 109 Copies of each observation Creating a much larger sample with which to work
When it should be used Bootstrapping is especially useful in situations when no analytic formula for the sampling distribution is available. • Traditional forecasting methods, like exponential smoothing, work well when demand is constant – patterns easily recognized by software • In contrast, when demand is irregular, patterns may be difficult to recognize. • Therefore, when faced with irregular demand, bootstrapping may be used to provide more accurate forecasts, making some important assumptions…
Assumptions and Methodology • Bootstrapping makes no assumption regarding the population • No normality of error terms • No equal variance • Allows for accurate forecasts of intermittent demand • If the sample is a good approximation of the population, the sampling distribution may be estimated by generating a large number of new samples • For small data sets, taking a small representative sample of the data and replicating it will yield superior results
Applications and Uses Criminology • Statistical significance testing is important in criminology and criminal justice • Six of the most popular journals in criminology and criminal justice are dominated by quantitative methods that rely on statistical significance testing • However, it poses two potential problems: tautology and violations of assumptions
Applications and Uses Criminology • Tautology: the null hypothesis is always false because virtually all null hypothesis may be rejected at some sample size • Violation of assumptions of regression: errors are homogeneous and errors of independent variables are normally distributed • Bootstrapping provides a user-friendly alternative to cross-validation and jackknife to augment statistical significance testing
Applications and Uses Actuarial Practice • Process of developing an actuarial model begins with the creation of probability distributions of input variables • Input variables are generally asset-side generated cash flows (financial) or cash flows generated from the liabilities side (underwriting) • Traditional actuarial methodologies are rooted in parametric approaches, which fit prescribed distribution of losses to the data
Applications and Uses Actuarial Practice • However, experience from the last two decades has shown greater interdependence of loss variables with asset variables • Increased complexity has been accompanied by increased competitive pressures and more frequent insolvencies • There is a need to use nonparametric methods in modeling loss distributions • Bootstrap standard errors and confidence intervals are used to derive the distribution
Applications and Uses Classifications Used by Ecologists • Ecologists often use cluster analysis as a tool in the classification and mapping of entities such as communities or landscapes • However, the researcher has to choose an adequate group partition level and in addition, cluster analysis techniques will always reveal groups • Use bootstrap to test statistically for fuzziness of the partitions in cluster analysis • Partitions found in bootstrap samples are compared to the observed partition by the similarity of the sampling units that form the groups.
Applications and Uses Human Nutrition • Inverse regression used to estimate vitamin B-6 requirement of young women • Standard statistical methods were used to estimate the mean vitamin B-6 requirement • Used bootstrap procedure as a further check for the mean vitamin B-6 requirement by looking at the standard error estimates and confidence intervals
Application and Uses Outsourcing • Agilent Technologies determined it was time to transfer manufacturing of its 3070 in-circuit test systems from Colorado to Singapore • Major concern was the change in environmental test conditions (dry vs humid) • Because Agilent tests to tighter factory limits (“guard banding”), they needed to adjust the guard band for Singapore • Bootstrap was used to determine the appropriate guard band for Singapore facility
An Alternative to the bootstrap Jackknife • A statistical method for estimating and removing bias* and for deriving robust estimates of standard errors and confidence intervals • Created by systematically dropping out subsets of data one at a time and assessing the resulting variation Bias: A statistical sampling or testing error caused by systematically favoring some outcomes over others
Bootstrap Yields slightly different results when repeated on the same data (when estimating the standard error) Not bound to theoretical distributions Jackknife Less general technique Explores sample variation differently Yields the same result each time Similar data requirements A comparison of the Bootstrap & Jackknife
Another alternative method Cross-Validation • The practice of partitioning data into a sample of data into sub-samples such that the initial analysis is conducted on a single sub-sample (training data), while further sub-samples (test or validation data) are retained “blind” in order for subsequent use in confirming and validating the initial analysis
Bootstrap Requires a small of data More complex technique – time consuming Cross-Validation Not a resampling technique Requires large amounts of data Extremely useful in data mining and artificial intelligence Bootstrap vs. Cross-Validation
Methodology for ND Points Model • Use bootstrapping on ND points scored regression model • Goal: determine the reliability of the model • Replication, random sampling, and numerous independent regression • Calculation of a confidence interval for adjusted R2
Bootstrapping Results R2 Data The Mean, Standard Dev., 95% and 99% confidence intervals are then calculated in excel from the 24 observations
Bootstrapping Results R2 Data Mean: 0.8046 STDEV: 0.1131 Conf 95% 0.0453 or 75.93 - 84.98% Conf 99% 0.0595 or 74.51 - 86.41% So what does this mean for the results of the regression? Can we rely on this model to help predict the number of points per game that will be scored by the 2006 team?