1 / 27

# Lecture 9: Diagnostics & Review - PowerPoint PPT Presentation

Lecture 9: Diagnostics & Review. February 10, 2014. Question. A least squares regression line is determined from a sample of values for variables x and y where x = size of a listed home (in sq feet) y = selling price of the home (in \$)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Lecture 9: Diagnostics & Review' - elden

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Lecture 9:Diagnostics & Review

February 10, 2014

A least squares regression line is determined from a sample of values for variables x and y where

x = size of a listed home (in sq feet)

y = selling price of the home (in \$)

Which of the following is true about the model b0 + b1x?

• If there is positive correlation r between x and y, then b1 must be positive

• The units of the intercept and slope will be the same as the response variable, y.

• If r2 = 0.85, then it is appropriate to conclude that a change in x will cause a change in y

• None of the above, more than one of the above, or not enough information to tell.

A least squares regression line is determined from a sample of values for variables x and y where

x = size of a listed home (in sq feet)

y = selling price of the home (in \$)

Which of the following is true about the model b0 + b1x?

• If there is positive correlation r between x and y, then b1 must be positive

b1 = r * sy / sx

So if r> 0, then b1 is positive because syand sx> 0

• Problem set 4 due (9am)

• How was it?

• Next week: Multiple Regression

• Exam Wednesday

• Sample question

• Taken from Exam 1 - #37 last year

• What did we talk about?

• Outliers

• Sensitivity analysis

• Heteroscedasticity

Common problems and fixes:

Say we’re estimating price of a lease by the size of the house:

Price = β0 + β1 * SqFt + ε

Interpretation of the estimates?

• β0would be fixed costs and

• β1would be marginal costs

Common Problems:Heteroscedasticity

Heteroscedasticity: What does that mean for your analysis?

• Point estimates for β’s?

• Still OK. No bias.

• Prediction and Confidence intervals?

• Not reliable; too narrow or too wide.

• Hypothesis tests regarding β0 and β1 are not reliable.

Common Problems:Heteroscedasticity

Fixing the problem:

• Revise the model: how will depend on the substance.

• Try revising the model to estimate Price/SqFt by dividing the original eq by SqFt:

• Notice the change in the

• intercept and slope:

• Don’t be locked into thinking the intercept is fixed cost

• How to interpret them depends

Common Problems:Heteroscedasticity

Fixing the problem:

Price/SqFt = M + F * (1/SqFt) + ε

• Revise by thinking about the substance

• Here it was predict price per sqft directly.

• Don’t revise by doing weird things

• Use theory!

• After revising, check if the residuals have similar variances?

• Sometimes they won’t.

• In this case they do:

Common Problems:Heteroscedasticity

Comparing the revised and original model:

• Revised model may have different (and smaller) R2.

• Again, so? R2 is great but it’s only one notion of fit.

• In the example, the revised model provides a narrower (hence better) confidence interval for fixed and variable costs:

Original Model

Revised Model

Original Model

Revised Model

Common Problems:Heteroscedasticity

Comparing the revised and original model:

• It also provides a more sensible prediction interval

• The data originally indicated that large homes varied in price more:

Common Problems:Heteroscedasticity

How do you know how to remodel the problem?

• Practice

• Creativity; try different things.

• There is no magic bullet; sometimes you can’t.

Common Problems:Correlated Errors

Problem: Dependence between residuals (autocorrelation)

• The amount of error (detected by the size of the residual) you make at observation x+ 1 is related to the amount of error you make at observation x.

• Why is this a problem?

• SRM assumes that the errors, ε, are independent.

• Common problem for time series data, but not just a time series problem.

• Recall the u-shaped pattern in one of the residual plots before

Common Problems:Correlated Errors

Detecting the problem:

• Easier with time series data:

• plot the residuals versus time and look for a pattern (is t+1 related to t?). Not guaranteed to find it but often helpful.

• Use the Durbin-Watson statistic to test for correlation between adjacent residuals (aka serial- or auto-correlation)

• With time series data adjacency is temporal.

• In non time series data, we’re still talking about errors next to one another being related.

• For things like spatial autocorrelation, there are more advanced things like mapping the residuals and tests we can do

Durbin-Watson Statistic

• Tests to see if the correlation between the residuals is 0

• Null hypothesis: H0: ρε = 0

• It’s calculated as:

• From the Durbin-Watson, D,statistic and sample size you can calculate the p-value for the hypothesis test

• You’ll see this more in multiple regression and forecasting

Common Problems:Correlated Errors

Consequences of Dependence:

• With autocorrelation in the errors the estimated standard errors are too small

• Estimated slope and intercept are less precise than as indicated by the output

Common Problems:Correlated Errors

How do you fix it?

• Try to model it directly or transform the data.

• Example: number of mobile phone users:

• Growth rate isn’t linear; try different transformations

Original data

Transformed data

Common Problems:Correlated Errors

Does this fix the problem?

• Linear pattern looks better

• You still need to check the other SRM conditions!!

• Omitted variables?

• Analysis of residuals. Might still be a problem.

Original data

Transformed data

• Regress price on weight

• Are the residuals distributed Normal?

• Yes

• No

• Maybe?

• I have no idea how to verify that

• Using your regression model from the last slide, predict the price of a diamond that weighs 0.44 carats

• What is the approximate 95% confidence interval?

• [\$877.75, \$1558.61]

• [\$2324.80, \$3014.69]

• [\$-97.97, \$184.95]

• [\$2330.41, \$3009.09]

• I have no idea

• Using your regression model from the last slide, predict the price of a diamond that weighs 0.28 carats

• What is the prediction interval?

• [\$877.75, \$1558.61]

• [\$452.57, \$1129.46]

• [\$764.38, \$1058.25]

• [\$345.61, \$678.34]

• I have no idea

• Again, no magic bullet. Try different ones.

• How do you decide if you transform the X or Y?

• Often depends on the substance.

• Transformations

• A common mistake is to forget to convert back to the appropriate units.

• Say your data and interest is in km/l and you transform the response to be liters / 100 km. Don’t forget to transform back to the correct units. Similarly for ln(x) [ in excel e is =exp() ]

• Conditions for the SRM

• Know them.

• Don’t be hesitant to try to fit a model if they are violated; just be cautious.

• Some of you might think a regression model is inappropriate if you don’t see a pattern in the data, i.e.,:

• Totally fine to try to fit a model

• The slope will probably be 0.

Check list:

• Is the association between y and x linear?

• Maybe one could exist but you don’t obviously see it (much more common in multiple regression)

• Have omitted/lurking variables been ruled out?

• In the exam, I’ll try to give you the necessary info.

• Are the errors evidently independent?

• How do you verify this?

• Are the variances of the residuals similar?

• How do you verify this?

• Are the residuals nearly normal?

• How do you verify this?

• What do you need to know?

• Everything from chapters 19 through 22…

• No CAPM; we’ll come back to it.

• What do you need to know from last semester?

• Statistics builds on itself. I’ll assume you’re comfortable with some basic concepts (confidence intervals, hypothesis tests, z-scores, means, etc., etc.)

• Will there be decision problems like those on Quiz 1? Maybe, but probably not. I want this to be more applied data analysis.

• Types of Questions?

• Possibly homework like.

• Some business related decision making