- 85 Views
- Uploaded on
- Presentation posted in: General

Lecture 9: Diagnostics & Review

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Lecture 9:Diagnostics & Review

February 10, 2014

A least squares regression line is determined from a sample of values for variables x and y where

x = size of a listed home (in sq feet)

y = selling price of the home (in $)

Which of the following is true about the model b0 + b1x?

- If there is positive correlation r between x and y, then b1 must be positive
- The units of the intercept and slope will be the same as the response variable, y.
- If r2 = 0.85, then it is appropriate to conclude that a change in x will cause a change in y
- None of the above, more than one of the above, or not enough information to tell.

A least squares regression line is determined from a sample of values for variables x and y where

x = size of a listed home (in sq feet)

y = selling price of the home (in $)

Which of the following is true about the model b0 + b1x?

- If there is positive correlation r between x and y, then b1 must be positive
b1 = r * sy / sx

So if r> 0, then b1 is positive because syand sx> 0

- Problem set 4 due (9am)
- How was it?

- Next week: Multiple Regression
- Exam Wednesday
- Sample question
- Taken from Exam 1 - #37 last year

- Sample question

- What did we talk about?
- Outliers
- Sensitivity analysis

- Heteroscedasticity

Say we’re estimating price of a lease by the size of the house:

Price = β0 + β1 * SqFt + ε

Interpretation of the estimates?

- β0would be fixed costs and
- β1would be marginal costs

Heteroscedasticity: What does that mean for your analysis?

- Point estimates for β’s?
- Still OK. No bias.

- Prediction and Confidence intervals?
- Not reliable; too narrow or too wide.
- Hypothesis tests regarding β0 and β1 are not reliable.

Fixing the problem:

- Revise the model: how will depend on the substance.
- Try revising the model to estimate Price/SqFt by dividing the original eq by SqFt:

- Notice the change in the
- intercept and slope:
- Don’t be locked into thinking the intercept is fixed cost
- How to interpret them depends
- Think about the data!

Fixing the problem:

Price/SqFt = M + F * (1/SqFt) + ε

- Revise by thinking about the substance
- Here it was predict price per sqft directly.

- Don’t revise by doing weird things
- Use theory!

- After revising, check if the residuals have similar variances?
- Sometimes they won’t.
- In this case they do:

Comparing the revised and original model:

- Revised model may have different (and smaller) R2.
- Again, so? R2 is great but it’s only one notion of fit.

- In the example, the revised model provides a narrower (hence better) confidence interval for fixed and variable costs:

Original Model

Revised Model

Original Model

Revised Model

Comparing the revised and original model:

- It also provides a more sensible prediction interval
- The data originally indicated that large homes varied in price more:

How do you know how to remodel the problem?

- Practice
- Creativity; try different things.
- There is no magic bullet; sometimes you can’t.

Problem: Dependence between residuals (autocorrelation)

- The amount of error (detected by the size of the residual) you make at observation x+ 1 is related to the amount of error you make at observation x.
- Why is this a problem?
- SRM assumes that the errors, ε, are independent.
- Common problem for time series data, but not just a time series problem.
- Recall the u-shaped pattern in one of the residual plots before

Detecting the problem:

- Easier with time series data:
- plot the residuals versus time and look for a pattern (is t+1 related to t?). Not guaranteed to find it but often helpful.

- Use the Durbin-Watson statistic to test for correlation between adjacent residuals (aka serial- or auto-correlation)
- With time series data adjacency is temporal.
- In non time series data, we’re still talking about errors next to one another being related.
- For things like spatial autocorrelation, there are more advanced things like mapping the residuals and tests we can do

- Tests to see if the correlation between the residuals is 0
- Null hypothesis: H0: ρε = 0

- It’s calculated as:
- From the Durbin-Watson, D,statistic and sample size you can calculate the p-value for the hypothesis test
- You’ll see this more in multiple regression and forecasting

Consequences of Dependence:

- With autocorrelation in the errors the estimated standard errors are too small
- Estimated slope and intercept are less precise than as indicated by the output

How do you fix it?

- Try to model it directly or transform the data.
- Example: number of mobile phone users:
- Growth rate isn’t linear; try different transformations

Original data

Transformed data

Does this fix the problem?

- Linear pattern looks better
- You still need to check the other SRM conditions!!
- Omitted variables?
- Analysis of residuals. Might still be a problem.

Original data

Transformed data

- Download diamonds.xlsx
- Regress price on weight
- Are the residuals distributed Normal?
- Yes
- No
- Maybe?
- I have no idea how to verify that

- Using your regression model from the last slide, predict the price of a diamond that weighs 0.44 carats
- What is the approximate 95% confidence interval?
- [$877.75, $1558.61]
- [$2324.80, $3014.69]
- [$-97.97, $184.95]
- [$2330.41, $3009.09]
- I have no idea

- Using your regression model from the last slide, predict the price of a diamond that weighs 0.28 carats
- What is the prediction interval?
- [$877.75, $1558.61]
- [$452.57, $1129.46]
- [$764.38, $1058.25]
- [$345.61, $678.34]
- I have no idea

- Question about transformations:
- Again, no magic bullet. Try different ones.
- How do you decide if you transform the X or Y?
- Often depends on the substance.

- Transformations
- A common mistake is to forget to convert back to the appropriate units.
- Say your data and interest is in km/l and you transform the response to be liters / 100 km. Don’t forget to transform back to the correct units. Similarly for ln(x) [ in excel e is =exp() ]

- A common mistake is to forget to convert back to the appropriate units.

- Conditions for the SRM
- Know them.
- Don’t be hesitant to try to fit a model if they are violated; just be cautious.
- Some of you might think a regression model is inappropriate if you don’t see a pattern in the data, i.e.,:
- Totally fine to try to fit a model
- The slope will probably be 0.

Check list:

- Is the association between y and x linear?
- Maybe one could exist but you don’t obviously see it (much more common in multiple regression)

- Have omitted/lurking variables been ruled out?
- In the exam, I’ll try to give you the necessary info.

- Are the errors evidently independent?
- How do you verify this?

- Are the variances of the residuals similar?
- How do you verify this?

- Are the residuals nearly normal?
- How do you verify this?

- What do you need to know?
- Everything from chapters 19 through 22…
- No CAPM; we’ll come back to it.

- What do you need to know from last semester?
- Statistics builds on itself. I’ll assume you’re comfortable with some basic concepts (confidence intervals, hypothesis tests, z-scores, means, etc., etc.)
- Will there be decision problems like those on Quiz 1? Maybe, but probably not. I want this to be more applied data analysis.

- Types of Questions?
- Possibly homework like.
- Some business related decision making
- Some non-business related analysis

- Best way to study?
- Do the problems. Then do more.