Introduction to Regression and Forecasting Models

Chapter 14 Regression and Forecasting Models

Introduction • Many decision-making applications depend on a forecast of some quantity. • Here are some examples: • When a company plans its ordering or production schedule for a product, it must forecast the customer demand for this product so that it can stock appropriate quantities—neither too much nor too little. • When an organization plans to invest in stocks, bonds, or other financial instruments, it typically attempts to forecast movements in stock prices and interest rates.

Introduction continued • Many forecasting methods are available, and all practitioners have their favorites. • To say the least, there is little agreement among practitioners or theoreticians as to the best forecasting method. • The methods can generally be divided into three groups: • Judgmental methods, • Regression methods, and • Extrapolation methods.

Introduction continued • Regression models, also called causal models, forecast a variable by estimating its relationship with other variables. • The technique of regression is extremely popular, due to its flexibility and power. • Regression can estimate relationships between time series variables or cross-sectional variables (those that are observed at a single point in time), and it can estimate linear or nonlinear relationships.

Introduction continued • Extrapolation methods, also called time series methods, use past data of a time series variable - and nothing else - to forecast future values of the variable. • Many extrapolation methods are available, including moving averages and exponential smoothing. • All extrapolation methods search for patterns in the historical series and then attempt to extrapolate these patterns into the future.

Overview of regression models • Regression analysis is the study of relationships between variables. • It is one of the most useful tools for a business analyst because it applies to so many situations. • Some potential uses of regression analysis in business address the following questions: • How do wages of employees depend on years of experience, years of education, and gender? • How does the current price of a stock depend on its own past values, as well as the current and past values of a market index?

Overview of regression models continued • Each of these questions asks how a single variable, such as selling price or employee wages, depends on other relevant variables. • If you can estimate this relationship, you can better understand how the world operates and also do a better job of predicting the variable in question.

Overview of regression models continued • Regression analysis can be categorized in several ways. • One categorization is based on the type of data being analyzed. • There are two basic types: cross-sectional data and time series data. • Cross-sectional data are usually data gathered from approximately the same period of time from a cross section of a population. • In contrast, time series studies involve one or more variables that are observed at several, usually equally spaced, points in time.

Overview of regression models continued • In every regression study, the goal is to explain or predict a particular variable. This is called the dependentvariable (or the response variable) and is often denoted generically as Y. • To help explain or predict the dependent variable, one or more explanatory variables are used. • These variables are also called independentvariables or predictor variables, and they are often denoted generically as Xs.

Overview of regression models continued • A second categorization of regression analysis involves the number of explanatory variables in the analysis. • If there is a single explanatory variable, the analysis is called simple regression. • If there are several explanatory variables, it is called multiple regression. • There are important differences between simple and multiple regression. • The primary difference, as the name implies, is that simple regression is simpler, from calculations to interpretation.

The least-squares line • The basis for regression is a fairly simple idea. If you create a scatterplot of one variable Y versus another variable X, you obtain a swarm of points that indicates any possible relationship between these two variables. • The terms scatterplot, scatter chart, and XY chart are all used to describe the same thing. • To quantify this relationship, you try to find the best-fitting line (or curve) through the points in the graph.

The least-squares line continued • Consider the scatterplot below. • The line shown is one possible fit. It appears to be a reasonably good fit, but a numerical measure of goodness-of-fit is needed so that this fit can be compared with the fits of other possible lines.

The least-squares line continued • The measure commonly used is the sum of squared residuals. • Here, a residual is defined as the vertical distance from a point to the line, as illustrated for points A and B. • Put differently, a residual is a prediction error. It is the difference between an observed Y and the predicted Y from the regression line. • The least-squares regression line minimizes the sum of squared residuals.

Prediction and fitted values • After you find the least-squares line, you can use it for prediction. • Geometrically, this is easy. Given any value of X, you predict the corresponding value of Y to be the height of the line above this X. • The predicted Y value is called the fitted value. • In contrast, the height of any point is the actual value of Y for this point. • This implies that: • Residual = Actual value - Fitted value

Measures of goodness-of-fit • Besides the sum of squared residuals, other measures of goodness-of-fit typically are quoted in regression analyses. • The standard error of estimate is obtained by averaging and then taking the square root, as shown in the following formula. • The standard error of estimate is useful because it provides an estimate of the magnitude of the prediction errors you are likely to make.

Measures of goodness-of-fit continued • Another goodness-of-fit measure is called the multiple R, defined as the correlation between the actual Y values and the fitted Y values. • In general, a correlation is a number between -1 an +1 that measures the goodness-of-fit of the linear relationship between two variables. • A correlation close to -1 or +1 indicates a tight linear fit, whereas a correlation close to 0 tends to indicate no linear fit—usually a shapeless swarm of points.

Measures of goodness-of-fit continued • In regression, you want the fitted Y values to be close to the actual Y values, so you want a scatterplot of the actual values versus the fitted values to be close to a 45° line, with the multiple R close to +1.

Measures of goodness-of-fit continued • If you square the multiple R, you get a measure that has a more direct interpretation. • This measure is known simply as R-square. It represents the percentage of the variation of the Y values explained by the Xsincluded in the regression equation.

Simple regression models • In this section, we discuss how to estimate the regression equation for a dependent variable Y based on a single explanatory variable X. • The common terminology is that “Y is regressed on X.” • This is the equation of the least-squares line passing through the scatterplot of Y versus X. • Because we are estimating a straight line, the regression equation is of the form Y = a + bX, where, as in basic algebra, a is called the intercept and b is called the slope.

Regression-based trend models • A special case of simple regression is when the only explanatory variable is time, usually labeled t (rather than X). • In this case, the dependent variable Y is a time series variable, such as a company’s monthly sales, and the purpose of the regression is to see whether this dependent variable follows a trend through time. • With a linear trend line, the variable changes by a constant amount each period. • With an exponential trend line, the variable changes by a constant percentage each period. • The following example demonstrates how easily trends can be estimated with Excel.

Example 14.1:Exponential Growth.xlsx • The Best Chips Company produces and sells potato chips throughout the country. • Its sales have been growing steadily over the past 10 years, as shown on the next slide and in this file. • The company wants to predict its sales for the next couple of years, assuming that the upward trend it has observed in the past 10 years will continue in the future. • How should the company proceed?

Example 14.1 continued:Historical sales at Best Chips

Example 14.1 continued:Solution • We begin by creating the scatterplot of Sales versus Year shown on the next slide. • Sales are clearly increasing through time, but it is not absolutely clear whether they are increasing at a constant rate, which would favor a linear trendline, or at an increasing rate, which would favor an exponential trendline. • Therefore, we try fitting both of these.

Example 14.1 continued:Time series plot of sales

Example 14.1 continued:Solution • To superimpose a linear trendline on any scatterplot, right-click on any point on the chart and then select the Add Trendline menu item. • This brings up the dialog box shown here.

Example 14.1 continued: Solution • You can select any of six types of trendlines. • For now, select the default Linear option. Also, click on the Options tab and check the Display equation box. • The result is shown on the next slide. • This figure shows the best-fitting straight line to the points, and it indicates that the equation of this straight line is y = 92,091x +1,168,200.

Example 14.1 continued:Plot with superimposed linear trend line

Example 14.1 continued:Solution • Here, y corresponds to sales and x corresponds to year. • The most important part of this equation is the coefficient of x, 92,091. It implies that sales are increasing by $92,091 per year—if we believe that the linear trendline provides a good fit. • To obtain an exponential trendline, we go through the same procedure except that we select the Exponential option in the dialog box.

Example 14.1 continued:Solution • The resulting curve is shown below. • The equation for the curve is y 1.2278e0.0541x. • The most important part of this equation is the coefficient in the exponent, 0.0541. • It implies that sales are increasing by approximately 5.4% per year.

Example 14.1 continued:Solution • Which of these trendlines provides the better fit? • We can proceed in two ways. • First, we can “eyeball” it. Looking at the superimposed trendlines, it appears that the exponential fit is slightly better. • The typical way to measure fits to a trendline through time is to calculate the historical predictions from each curve and the corresponding absolute percentage errors (APEs). • We find the predictions by plugging the year indexes (1 to 10) into the trendline equations.

Example 14.1 continued:Solution • We then calculate the APE for each year from the following equation. • A measure of goodness-of-fit is then the average of these APE values, denoted by MAPE (mean absolute percentage error). • This measure is quite intuitive.

Example 14.1 continued:Solution • All of this is implemented and shown on the next slide. • To create the predictions, APEs, and MAPEs, proceed as follows: • Predictions. Calculate the predictions from the linear trendline by entering the formula =1168200+92091*A53 in cell D3 and copying it down to cell D14. Similarly, calculate the predictions from the exponential trendline by entering the formula =1227762*EXP(0.0541*A3) in cell E3 and copying it down to cell E14.

Example 14.1 continued:Evaluating the goodness-of-fit

Example 14.1 continued:Solution • APE values. Calculate all of the APE values at once by entering the formula =ABS($B3-D53/$B3 in cell F3 and copying it to the range F3:G12. • MAPE values. Calculate the MAPE for each trendline by entering the formula =AVERAGE(F3:F12) in cell F16 and copying it to cell G16. • The MAPE values confirm that the exponential trendline is slightly better than the linear trendline.

Caution about exponential trend lines • Exponential trendlines are often used in predicting sales and other economic quantities. • However, we urge caution with such predictions. It is difficult for any company to sustain a given percentage increase year after year.

Using an explanatory variable other than time • You are not restricted to using time as the explanatory variable in simple regression. • Any variable X that is related to the dependent variable Y is a candidate. • The following example illustrates one such possibility. • It shows how you can still take advantage of Excel’s Add Trendlineoption, even though the resulting trend line is not what you usually think of with trend—a trend through time.

Example 14.2:Cost Regression 1.xlsx • Consider a company that produces a single product. • For each of the past 16 months, the company has kept track of the number of units produced as well as the total cost of production. • These data are shown on the next slide and in the file Cost Regression 1.xlsx. • What can simple regression tell us about the relationship between these two variables? • How can it be used to predict future production costs?

Example 14.2 continued:Cost and production data for a single product

Example 14.2 continued:Solution • When we try to relate two variables with regression, it is always a good idea to create a scatterplot of the two variables first, just to see whether there is any relationship worth pursuing. • This can be done with Excel’s Chart Wizard in the usual way, or it can be done easily with Palisade’s StatTools add-in. • StatTools is used for the rest of statistical analysis in this chapter, and it is very easy to use. • To get started with any statistical analysis on any Excel data set, you must first use Data Set Manager from the StatToolsribbon.

StatTools

Example 14.2 continued:StatTools • Once you have designated a StatTools data set, again the first step in any analysis, you can then create a scatterplot from the Summary Graphs dropdown list. • This leads to the dialog box shown here.

Example 14.2 continued:StatTools • Select the X and Y variables (Actually, you can select multiple Ys and Xs. You will then get a scatterplot of each Y-X pair.) • The resulting scatterplot for this example appears on the next slide. • This plot indicates a clear linear relationship, where Total Cost increases as Units Produced increases. • Although this chart was created with StatTools, it (like other StatTools charts) is a regular Excel • chart, so you can modify it just as you can modify any other Excel chart.

Example 14.2 continued:Scatterplot of total cost vs. units produced

Example 14.2 continued:Solution • The equation of the straight line has a slope, 198.47, and an intercept, 23,643. For this example, both numbers have a natural interpretation. • The slope corresponds to the unit variable cost of production. We estimate that each extra unit produced contributes $198.47 to total cost. • The intercept corresponds to the fixed cost of production. We estimate that the fixed cost is $23,643, regardless of the production level.

Example 14.2 continued:Solution • We can now find the fitted values, the residuals, and the standard error of estimate. The results appear on the next slide. • There are two steps, where the first is required only if you want to make future predictions: • Designate the data for future months as a second StatTools data set with the Data Set Manager. We called it Data for Prediction, and it is the range E1:G3 • Select Regression from the StatTools Regression and Classification dropdown list, and fill in the resulting dialog box as shown on the next slide.

Example 14.2 continued:Solution

Example 14.2 continued:Solution • The scatterplot of residuals versus fitted values on the previous slide is a diagnostic tool used to see if there are peculiar points or patterns. • The shapeless swarm seen here is an indication that no regression assumptions are violated. • The predictions below are also found by plugging the known X values into the regression equation, but no residuals are possible because the actual Total Cost values are not yet known for these months.

Multiple regression models • When you try to explain a dependent variable Y with regression, there are often a multitude of explanatory variables to choose from. • In this section, we explore multiple regression, where the regression equation for Y includes a number of explanatory variables, the Xs. • The general form of this equation is shown in the box.

Multiple regression models continued • In the previous equation, ais again the Y-intercept, and b1through bk are the slopes. • Collectively, aand the bs are called the regression coefficients. • Each slope coefficient is the expected change in Y when that particular X increases by one unit and the other Xs in the equation remain constant. • For example, b1 is the expected change in Y when X1increases by one unit and the other Xs in the equation, X2 through Xk, remain constant. • We illustrate these ideas in the following example.

Example 14.3:Cost Regression 2.xlsx • Suppose the company from Example 14.2 now produces three different products, A, B, and C. • The company has kept track of the number of units produced of each product and the total production cost for the past 15 months. • These data are shown on the next slide and in the file Cost Regression 2.xlsx. • What can multiple regression tell us about the relationship between these variables? • How can multiple regression be used to predict future production costs?

Introduction to Regression and Forecasting Models

Introduction to Regression and Forecasting Models

Presentation Transcript

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14.

Chapter 14

Chapter 14

CHAPTER 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14

Chapter 14