- 420 Views
- Updated On :
- Presentation posted in: Sports / GamesEducation / CareerFashion / BeautyGraphics / DesignNews / Politics

Statistics and Data Analysis. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 15 – Regression Models. 1/49. Linear Regression Models. Analyzing residuals Violations of assumptions

Related searches for Statistics and Data Analysis

Statistics and Data Analysis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Statistics and Data Analysis

Professor William Greene

Stern School of Business

IOMS Department

Department of Economics

Statistics and Data Analysis

Part 15 – Regression Models

1/49

- Analyzing residuals
- Violations of assumptions
- Unusual data points
- Hints for improving the model

- Model building
- Linear models – cost functions
- Semilog models – growth models
- Logs and elasticities

2/49

- Assumptions about disturbances (noise)
- Zero mean
- Constant variance
- No correlation across observations
- Normality

- Disturbances are assumed to be pure noise. Residuals should appear that way also.

3/49

Graphics show relative sizes of the two works.

The Persistence of Statistics. Hildebrand, Ott and Gray, 2005

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931

4/49

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

Log of $price = a + b log surface area + e

5/49

Monet. Le Pont d'Argenteuil, 1874. Modified, Musée d’Orsay, Paris, October 7, 2007, anon. vandal.

6/49

Speaking of owner modified $100,000,000 paintings…

7/49

8/49

9/49

- How do you know the model is “good?”
- Various diagnostics to be developed over the semester.
- But, the first place to look is at the residuals.

10/49

- Standard application: Cost function for output of a production process.
- Compare linear equation to a quadratic model (in logs)
- (124 American Electric Utilities)

11/49

Log c = a + b log q + e

Most of the points in this area are above the regression line.

Most of the points in this area are above the regression line.

Most of the points in this area are below the regression line.

12/49

Residuals from the (log)linear cost model

13/49

Log Cost = α + β1 logOutput + β2 [logOutput]2 + ε

(Developed more fully after the midterm)

14/49

The quadratic equation is the appropriate model.

Logc = a + b1 logq + b2 log2q + e

15/49

Residuals from the quadratic cost model

16/49

- Hetero - differences
- Scedastic - function, variation around the mean
- Arises when y is “proportional” to x
- Arises sometimes when there are natural, heterogeneous groups

17/49

Residuals from a regression of salaries on years of experience.

Standard deviation of the residuals seems not to be constant.

18/49

This usually suggests the model should be defined in terms of logs of the variable.

19/49

Residuals from a regression of logs of salaries on years of experience.

Salary = αeβteε

We will explore this model below.

20/49

Countries are ordered by the standard deviation of their 19 residuals.

Regression of log of per capita gasoline use on log of per capita income for 18 OECD countries for 19 years. The standard deviation varies by country. The “solution” is “weighted least squares.” (See text, page 659.)

21/49

- Not a problem for using least squares to estimate α or β.
- But, there is a better method than least squares.
- Assessment of the uncertainty of the least squares estimates may be too optimistic.
- (Not contagious)

22/49

- Auto – self
- Correlation – correlation
- Correlated with itself? Obviously?
- Noise in one observation is correlated with noise in other observations.
- Usually a feature of time series data
- Residuals correlated with recent past residuals
- Typically streaks of unusually high or low observations (measured against the regression)

23/49

Regression of log Gasoline on log Income (both per capita), U.S., 1953-2004. Residuals are highly autocorrelated.Same problems as heteroscedasticity. Autocorrelation can (also) be cured. Not by taking logs, however.

24/49

Outliers have (what appear to be) very large disturbances, ε

Wolf weight vs. tail length The 500 most successful movies

25/49

Remember the empirical rule, 99.5% of observations will lie within mean ± 3 standard deviations? We show (a+bx) ± 3se below.)

Titanic is 8.1 standard deviations from the regression!

Only 0.86% of the 466 observations lie outside the bounds. (We will refine this later.)

These observations might deserve a close look.

26/49

(1) Examine the data

(2) Are they due to mismeasurement error or obvious “coding errors?” Delete the observations.

(3) Are they just unusual observations? Do nothing.

(4) Generally, resist the temptation to remove outliers.Especially if the sample is large. (500 movies islarge. 10 wolves is not.)

(5) Question why you think it is an outlier. Is it really?

27/49

“High leverage” points have unusual values of x.

Problem? The regression slope is strongly influenced by these points.Response: Unless you are strongly convinced that these are bad data, strongly resist the temptation to pay any attention to these observations.

This phenomenon is extremely hard to detect in a moderate to large sample. It is also extremely elusive when there is more than one variable in the model.

Y

X

28/49

- High leverage outliers(Unusual x and unusual y)
- With Titanic: 6.693 + 1.051 Domestic
- Without Titanic: 20.774 + 0.930 Domestic

29/49

30/49

31/49

32/49

Minitab uses ± 2S to flag “large” residuals.

33/49

Be careful about singling out particular observations this way.

The resulting model might be a product of your opinions

Removing outliers might create new outliers that were not outliers before.

Statistical inferences from the model will be incorrect.

34/49

35/49

Graph -> Probability Plots …

36/49

- Interpreting the linear model
- Semilog and growth models
- Log-log model and elasticities

37/49

The units of the LHS and RHS must be the same.

$M cost = a + b MKWH

Y = $ cost

a = $ cost = 2.444 $M

b = $M /MKWH = 0.005291 $M/MKWH

So,…..

a = fixed cost = total cost if MKWH = 0

b = marginal cost = dCost/dMKWH

b * MKWH = variable cost

Generation cost ($M) and output (Millions of KWH) for 124 American electric utilities. (1970).

38/49

LogSalary = 9.84 + 0.05 Years + e

39/49

40/49

Frequent Flyer Flights for 72 Months. (Text, Ex. 11.1, p. 508)

41/49

logFlights = α + β Months + ε

a = 2.770, b = 0.03710, s = 0.06102

42/49

- logY = α + βlogx + ε
- Elasticities
- Gasoline income elasticity
- The linear and loglinear models give similar answers
- Price elasticity

43/49

- The “responsiveness” of one variable to changes in another
- E.g., in economics demand elasticity = (%ΔQ) / (%ΔP)
- Math: Ratio of percentage changes
- %ΔQ / %ΔP = {100%[(ΔQ )/Q] / {100%[(ΔP)/P]}
- Units of measurement and the 100% fall out of this eqn.
- Elasticity = (ΔQ/ΔP)*(P/Q)
- Elasticities are units free

44/49

45/49

Q = αPβeεso

logQ=a+βlogP+ε

Thenβ =dlogQ/dlogPis the elasticity

46/49

Regression Analysis: Log-Gas_t versus LogPG_t

The regression equation is

Log-Gas_t = 0.372 - 0.169 LogPG_t

Predictor Coef SE Coef T P

Constant 0.372140 0.008433 44.13 0.000

LogPG_t -0.16949 0.03827 -4.43 0.000

S = 0.0608113 R-Sq = 28.2%

Using Logs

Regression Analysis: Gas_t versus PGas_t

The regression equation is

Gas_t = 1.66 - 0.199 PGas_t

Predictor Coef SE Coef T P

Constant 1.65874 0.05803 28.58 0.000

PGas_t -0.19928 0.05516 -3.61 0.001

S = 0.0941783 R-Sq = 20.7%

Using Levels

47/49

Elasticity in the loglinear model isb = -0.1695.Elasticity in the linear model at the mean of G of 1.4545 and mean of PG1.0251 is -0.1993(1.0251/1.4545)= -0.1404.

48/49

49/49

- Residual analysis
- Consistent with model assumptions?
- Suggest missing elements in the model

- Building the regression model
- Interpreting the model – cost function
- Growth model – semilog
- Double log and estimating elasticities