Spreadsheet Modeling & Decision Analysis

Spreadsheet Modeling & Decision Analysis A Practical Introduction to Management Science 4th edition Cliff T. Ragsdale

Chapter 9 Regression Analysis

Introduction to Regression Analysis (RA) • Regression Analysis is used to estimate a function f( ) that describes the relationship between a continuous dependent variable and one or more independent variables. Y = f(X1, X2, X3,…, Xn)+e Note: • f( ) describes systematic variation in the relationship. • e represents the unsystematic variation (or random error) in the relationship.

In other words, the observations that we have interest can be separated into two parts: Y = f(X1, X2, X3,…, Xn)+e Observations = Model + Error Observations = Signal + Noise Ideally, the noise shall be very small, comparing to the model. 註：觀察值＝模型＋誤差。

what we see Signal to Noise What we observe can be divided into: signal noise

Model specification If the true function is: yi = B0 + B1Xi + B2Zi And we fit: yi = b0 + b1Xi + b2Zi + ei Our model is exactly specified and we obtain an unbiased and efficient estimate.

Model specification And finally, if the true function is: 2 yi = B0 + B1Xi + B2Zi + B3XiZi + B4Zi And we fit: yi = b0 + b1Xi + b2Zi + ei Our model is underspecified, we excluded some necessary terms, and we obtain a biased estimate.

Model specification On the other hand, if the true function is: yi = B0 + B1Xi + B2Zi And we fit: yi = b0 + b1Xi + b2Zi + b3XiZi + ei Our model is overspecified, we included some unnecessary terms, and we obtain an inefficient estimate.

Model specification • if specify the model exactly, there is no bias • if you overspecify the model (add more terms than needed), result is unbiased, but inefficient • if you underspecify the model (omit one or more necessary terms (the result is biased) • Overall Strategy • best option is to exactly specify the true function • we would prefer to err by overspecifying our model because that only leads to inefficiency • Therefore, start with a likely overspecified model and reduce it

An Example • Consider the relationship between advertising (X1) and sales (Y) for a company. • There probably is a relationship... ...as advertising increases, sales should increase. • But how would we measure and quantify this relationship? See file Fig9-1.xls

Sales (in $1,000s) Advertising (in $1,000s) A Scatter Plot of the Data

Y Regression Curve Probability distributions for Y at different levels of X X The Nature of a Statistical Relationship

So the following regression model is suggested by the data, • This refers to the true relationship between the entire population of advertising and sales values. • The estimated regression function (based on our sample) will be represented as, A Simple Linear Regression Model • The scatter plot shows a linear relation between advertising and sales.

The method of “least squares” selects the values that minimize: Determining the Best Fit • Numerical values must be assigned to b0 and b1 • If ESS=0 our estimated function fits the data perfectly. • We could solve this problem using Solver...

Estimation – Linear Regressin Formula for a straight line outcome program D y b1 = slope = y = b0 + b1x + e D x y D y D x want to solve for b0 = intercept x

Using Solver... See file Fig9-4.xls

The Estimated Regression Function • The estimated regression function is:

Using the Regression Tool • Excel also has a built-in tool for performing regression that: • is easier to use • provides a lot more information about the problem See file Fig9-1.xls

The TREND() Function TREND(Y-range, X-range, X-value for prediction) where: Y-range is the spreadsheet range containing the dependent Y variable, X-range is the spreadsheet range containing the independent X variable(s), X-value for prediction is a cell (or cells) containing the values for the independent X variable(s) for which we want an estimated value of Y. Note: The TREND( ) function is dynamically updated whenever any inputs to the function change. However, it does not provide the statistical information provided by the regression tool. It is best two use these two different approaches to doing regression in conjunction with one another.

Evaluating the “Fit” 600.0 500.0 400.0 300.0 2 = 0.9691 R Sales (in $000s) 200.0 100.0 0.0 20 30 40 50 60 70 80 90 100 Advertising (in $000s)

The R2 Statistic • The R2 statistic indicates how well an estimated regression function fits the data. • 0<= R2 <=1 • It measures the proportion of the total variation in Y around its mean that is accounted for by the estimated regression equation. • To understand this better, consider the following graph...

^ Yi ^ Yi - Y Yi (estimated value) ^ Yi - Y Y Error Decomposition Yi(actual value) Y * Yi - ^ Y = b0 + b1X X

Partition of the Total Sum of Squares or, TSS = ESS + RSS

Degree of Linear Correlation • R2 = 1 = perfect linear correlation; R2 = 0 = no correlation • High R2 = good fit only if linear model is appropriate; always check with a scatterplot • Correlation does not prove causation; x and y may both be correlated to a third (possibly unidentified) variable • A more popular (but less meaningful) measure is the “correlation coefficient”: R2 = RSQ([y-range],[x-range] r = CORREL([y-range],[x-range])

R2 = 0.67 R2 = 0.67 R2 = 0.67 R2 = 0.67

Testing for Significance: F Test • Hypotheses H0: 1 = 0 Ha: 1 = 0 • Test Statistic • Rejection Rule Reject H0 if F > F where F is based on an F distribution with 1 d.f. in the numerator and n - 2 d.f. in the denominator.

Some Cautions about theInterpretation of Significance Tests • Rejecting H0: b1 = 0 and concluding that the relationship between x and y is significant does not enable us to conclude that a cause-and-effect relationship is present between x and y. • Just because we are able to reject H0: b1 = 0 and demonstrate statistical significance does not enable us to conclude that there is a linear relationship between x and y.

An Example of Inappropriate Interpretation • A study shows that, in elementary schools, the ability of spelling is stronger for the students with larger feet.  Could we conclude that the size of foot can influence the ability of spelling?  Or there exists another factor that can influence the foot size and the spelling ability?

Suppose we want to estimate the average levels of sales expected if $65,000 is spent on advertising. Making Predictions • Estimated Sales = 36.342 + 5.550 * 65 • = 397.092 • So when $65,000 is spent on advertising, we expect the average sales level to be $397,092.

The standard error measures the scatter in the actual data around the estimate regression line. where k = the number of independent variables The Standard Error • For our example,Se= 20.421 • This is helpful in making predictions...

Excel Functions Scatterplot Chart/Add Trendline/Linear/Display equation, R2 Menu-driven tools: Tools/Data Analysis/Regression Tools/Data Analysis Plus/Stepwise Regression Tools/OLS Regression Excel functions: a = INTERCEPT([y-range],[x-range]) b = SLOPE([y-range],[x-range]) se = STEYX([y-range],[x-range])

An Approximate Prediction Interval • An approximate 95% prediction interval for a new value of Y when X1=X1his given by where: • Example: If $65,000 is spent on advertising: • 95% lower prediction interval = 397.092 - 2*20.421 = 356.250 • 95% upper prediction interval = 397.092 + 2*20.421 = 437.934 • If we spend $65,000 on advertising we are approximately 95% confident actual sales will be between $356,250 and $437,934.

An Exact Prediction Interval • A (1-a)% prediction interval for a new value of Y when X1=X1his given by where:

Example • If $65,000 is spent on advertising: 95% lower prediction interval = 397.092 - 2.306*21.489 = 347.556 95% upper prediction interval = 397.092 + 2.306*21.489 = 446.666 • If we spend $65,000 on advertising we are 95% confident actual sales will be between $347,556 and $446,666. • This interval is only about $20,000 wider than the approximate one calculated earlier but was much more difficult to create. • The greater accuracy is not always worth the trouble.

Comparison of Prediction Interval Techniques Sales 575 Prediction intervals created using standard errorSe 525 475 425 375 325 Regression Line 275 Prediction intervals created using standard prediction error Sp 225 175 125 25 35 45 55 65 75 85 95 Advertising Expenditures

Confidence Intervals for the Mean • A (1-a)% confidence interval for the true mean value of Y when X1=X1his given by where:

A Note About Extrapolation • Predictions made using an estimated regression function may have little or no validity for values of the independent variables that are substantially different from those represented in the sample.

What Does “Regression” Mean?

What Does “Regression” Mean? • Draw “best-fit” line free hand • Find mother’s height = 60”, find average daughter’s height • Repeat for mother’s height = 62”, 64”… 70”; draw “best-fit” line for these points • Draw line daughter’s height = mother’s height • For a given mother’s height, daughter’s height tends to be between mother’s height and mean height: “regression toward the mean”

What Does “Regression” Mean?

^ Residual Analysis • Residual for Observation i yi – yi • Standardized Residual for Observation i where: ^ ^ ^

  

Residual Analysis • Detecting Outliers • An outlier is an observation that is unusual in comparison with the other data. • Minitab classifies an observation as an outlier if its standardized residual value is < -2 or > +2. • This standardized residual rule sometimes fails to identify an unusually large observation as being an outlier. • This rule’s shortcoming can be circumvented by using studentized deleted residuals. • The |i th studentized deleted residual| will be larger than the |i th standardized residual|.

If each independent variables varies in a linear manner with Y, the estimated regression function in this case is: Multiple Regression Analysis • Most regression problems involve more than one independent variable. • The optimal values for the bi can again be found by minimizing the ESS. • The resulting function fits a hyperplane to our sample data.

Example Regression Surface for Two Independent Variables Y * * * * * * * * * * * * * * * * * * * * * * * X2 X1

Multiple Regression Example:Real Estate Appraisal • A real estate appraiser wants to develop a model to help predict the fair market values of residential properties. • Three independent variables will be used to estimate the selling price of a house: • total square footage • number of bedrooms • size of the garage • See data in file Fig9-17.xls

Selecting the Model • We want to identify the simplest model that adequately accounts for the systematic variation in the Y variable. • Arbitrarily using all the independent variables may result in overfitting. • A sample reflects characteristics: • representative of the population • specific to the sample • We want to avoid fitting sample specific characteristics -- or overfitting the data.

Key regression results are: Variables Adjusted Parameter in the Model R2 R2 Se Estimates X10.870 0.855 10.299 b0=9.503, b1=56.394 X2 0.759 0.731 14.030 b0=78.290, b2=28.382 X3 0.793 0.770 12.982 b0=16.250, b3=27.607 Models with One Independent Variable • With simplicity in mind, suppose we fit three simple linear regression functions: • The model using X1accounts for 87% of the variation in Y, leaving 13% unaccounted for.

Important Software Note When using more than one independent variable, all variables for the X-range must be in one contiguous block of cells (that is, in adjacent columns).

Spreadsheet Modeling & Decision Analysis