Statistics and data analysis
Advertisement
This presentation is the property of its rightful owner.
1 / 51

Statistics and Data Analysis PowerPoint PPT Presentation

Statistics and Data Analysis. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 15 – Regression Models. 1/49. Linear Regression Models. Analyzing residuals Violations of assumptions

Related searches for Statistics and Data Analysis

Download Presentation

Statistics and Data Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Statistics and data analysis

Statistics and Data Analysis

Professor William Greene

Stern School of Business

IOMS Department

Department of Economics


Statistics and data analysis1

Statistics and Data Analysis

Part 15 – Regression Models


Linear regression models

1/49

Linear Regression Models

  • Analyzing residuals

    • Violations of assumptions

    • Unusual data points

    • Hints for improving the model

  • Model building

    • Linear models – cost functions

    • Semilog models – growth models

    • Logs and elasticities


Model assumptions

2/49

Model Assumptions

  • Assumptions about disturbances (noise)

    • Zero mean

    • Constant variance

    • No correlation across observations

    • Normality

  • Disturbances are assumed to be pure noise. Residuals should appear that way also.


An enduring art mystery

3/49

An Enduring Art Mystery

Graphics show relative sizes of the two works.

The Persistence of Statistics. Hildebrand, Ott and Gray, 2005

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931


Monet in large and small

4/49

Monet in Large and Small

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

Log of $price = a + b log surface area + e


Speaking of monet

5/49

Speaking of Monet…

Monet. Le Pont d'Argenteuil, 1874. Modified, Musée d’Orsay, Paris, October 7, 2007, anon. vandal.


Slides for this session

6/49

Speaking of owner modified $100,000,000 paintings…


The data

7/49

The Data


Monet regression

8/49

Monet Regression


Using the residuals

9/49

Using the Residuals

  • How do you know the model is “good?”

  • Various diagnostics to be developed over the semester.

  • But, the first place to look is at the residuals.


Residuals can signal a flawed model

10/49

Residuals Can Signal a Flawed Model

  • Standard application: Cost function for output of a production process.

  • Compare linear equation to a quadratic model (in logs)

  • (124 American Electric Utilities)


Candidate model for cost

11/49

Candidate Model for Cost

Log c = a + b log q + e

Most of the points in this area are above the regression line.

Most of the points in this area are above the regression line.

Most of the points in this area are below the regression line.


A missing variable

12/49

A Missing Variable?

Residuals from the (log)linear cost model


A better model

13/49

A Better Model?

Log Cost = α + β1 logOutput + β2 [logOutput]2 + ε

(Developed more fully after the midterm)


Candidate models for cost

14/49

Candidate Models for Cost

The quadratic equation is the appropriate model.

Logc = a + b1 logq + b2 log2q + e


Missing variable included

15/49

Missing Variable Included

Residuals from the quadratic cost model


Heteroscedasticity

16/49

Heteroscedasticity

  • Hetero - differences

  • Scedastic - function, variation around the mean

  • Arises when y is “proportional” to x

  • Arises sometimes when there are natural, heterogeneous groups


Heteroscedasticity1

17/49

Heteroscedasticity

Residuals from a regression of salaries on years of experience.

Standard deviation of the residuals seems not to be constant.


Problem with the model

18/49

Problem with the Model?

This usually suggests the model should be defined in terms of logs of the variable.


Sometimes heteroscedasticity can be cured by taking logs

19/49

Sometimes Heteroscedasticity Can Be Cured By Taking Logs

Residuals from a regression of logs of salaries on years of experience.

Salary = αeβteε

We will explore this model below.


Sometimes not

20/49

Sometimes Not …

Countries are ordered by the standard deviation of their 19 residuals.

Regression of log of per capita gasoline use on log of per capita income for 18 OECD countries for 19 years. The standard deviation varies by country. The “solution” is “weighted least squares.” (See text, page 659.)


Should i worry about heteroscedasticity

21/49

Should I Worry About Heteroscedasticity?

  • Not a problem for using least squares to estimate α or β.

  • But, there is a better method than least squares.

  • Assessment of the uncertainty of the least squares estimates may be too optimistic.

  • (Not contagious)


Autocorrelation

22/49

Autocorrelation

  • Auto – self

  • Correlation – correlation

  • Correlated with itself? Obviously?

  • Noise in one observation is correlated with noise in other observations.

  • Usually a feature of time series data

    • Residuals correlated with recent past residuals

    • Typically streaks of unusually high or low observations (measured against the regression)


Time series regression

23/49

Time Series Regression

Regression of log Gasoline on log Income (both per capita), U.S., 1953-2004. Residuals are highly autocorrelated.Same problems as heteroscedasticity. Autocorrelation can (also) be cured. Not by taking logs, however.


Unusual data points

24/49

Unusual Data Points

Outliers have (what appear to be) very large disturbances, ε

Wolf weight vs. tail length The 500 most successful movies


Outliers

25/49

Outliers (?)

Remember the empirical rule, 99.5% of observations will lie within mean ± 3 standard deviations? We show (a+bx) ± 3se below.)

Titanic is 8.1 standard deviations from the regression!

Only 0.86% of the 466 observations lie outside the bounds. (We will refine this later.)

These observations might deserve a close look.


What to do about outliers

26/49

What to Do About Outliers

(1) Examine the data

(2) Are they due to mismeasurement error or obvious “coding errors?” Delete the observations.

(3) Are they just unusual observations? Do nothing.

(4) Generally, resist the temptation to remove outliers.Especially if the sample is large. (500 movies islarge. 10 wolves is not.)

(5) Question why you think it is an outlier. Is it really?


High leverage points

27/49

High Leverage Points

“High leverage” points have unusual values of x.

Problem? The regression slope is strongly influenced by these points.Response: Unless you are strongly convinced that these are bad data, strongly resist the temptation to pay any attention to these observations.

This phenomenon is extremely hard to detect in a moderate to large sample. It is also extremely elusive when there is more than one variable in the model.

Y

X


Highly influential points

28/49

Highly Influential Points

  • High leverage outliers(Unusual x and unusual y)

  • With Titanic: 6.693 + 1.051 Domestic

  • Without Titanic: 20.774 + 0.930 Domestic


Regression options

29/49

Regression Options


Save residuals

30/49

Save Residuals


Residuals

31/49

Residuals


Minitab s opinions

32/49

Minitab’s Opinions

Minitab uses ± 2S to flag “large” residuals.


On removing outliers

33/49

On Removing Outliers

Be careful about singling out particular observations this way.

The resulting model might be a product of your opinions

Removing outliers might create new outliers that were not outliers before.

Statistical inferences from the model will be incorrect.


Normal distribution of e i

34/49

Normal Distribution of ei?


Probability plot

35/49

Probability Plot

Graph -> Probability Plots …


Using and interpreting the model

36/49

Using and Interpreting the Model

  • Interpreting the linear model

  • Semilog and growth models

  • Log-log model and elasticities


Statistical cost analysis

37/49

Statistical Cost Analysis

The units of the LHS and RHS must be the same.

$M cost = a + b MKWH

Y = $ cost

a = $ cost = 2.444 $M

b = $M /MKWH = 0.005291 $M/MKWH

So,…..

a = fixed cost = total cost if MKWH = 0

b = marginal cost = dCost/dMKWH

b * MKWH = variable cost

Generation cost ($M) and output (Millions of KWH) for 124 American electric utilities. (1970).


Semilog models and growth rates

38/49

Semilog Models and Growth Rates

LogSalary = 9.84 + 0.05 Years + e


Growth in a semilog model

39/49

Growth in a Semilog Model


Using semilog models for trends

40/49

Using Semilog Models for Trends

Frequent Flyer Flights for 72 Months. (Text, Ex. 11.1, p. 508)


Regression approach

41/49

Regression Approach

logFlights = α + β Months + ε

a = 2.770, b = 0.03710, s = 0.06102


Loglinear models

42/49

Loglinear Models

  • logY = α + βlogx + ε

  • Elasticities

  • Gasoline income elasticity

  • The linear and loglinear models give similar answers

  • Price elasticity


Elasticity and loglinear models

43/49

Elasticity and Loglinear Models

  • The “responsiveness” of one variable to changes in another

  • E.g., in economics demand elasticity = (%ΔQ) / (%ΔP)

  • Math: Ratio of percentage changes

    • %ΔQ / %ΔP = {100%[(ΔQ )/Q] / {100%[(ΔP)/P]}

    • Units of measurement and the 100% fall out of this eqn.

    • Elasticity = (ΔQ/ΔP)*(P/Q)

    • Elasticities are units free


Linear demand curves

44/49

Linear Demand Curves


Loglinear demand curves

45/49

Loglinear Demand Curves

Q = αPβeεso

logQ=a+βlogP+ε

Thenβ =dlogQ/dlogPis the elasticity


Demand models

46/49

DemandModels

Regression Analysis: Log-Gas_t versus LogPG_t

The regression equation is

Log-Gas_t = 0.372 - 0.169 LogPG_t

Predictor Coef SE Coef T P

Constant 0.372140 0.008433 44.13 0.000

LogPG_t -0.16949 0.03827 -4.43 0.000

S = 0.0608113 R-Sq = 28.2%

Using Logs

Regression Analysis: Gas_t versus PGas_t

The regression equation is

Gas_t = 1.66 - 0.199 PGas_t

Predictor Coef SE Coef T P

Constant 1.65874 0.05803 28.58 0.000

PGas_t -0.19928 0.05516 -3.61 0.001

S = 0.0941783 R-Sq = 20.7%

Using Levels


Linear and loglinear models

47/49

Linear and Loglinear Models

Elasticity in the loglinear model isb = -0.1695.Elasticity in the linear model at the mean of G of 1.4545 and mean of PG1.0251 is -0.1993(1.0251/1.4545)= -0.1404.


Income elasticity

48/49

Income Elasticity


Summary

49/49

Summary

  • Residual analysis

    • Consistent with model assumptions?

    • Suggest missing elements in the model

  • Building the regression model

    • Interpreting the model – cost function

    • Growth model – semilog

    • Double log and estimating elasticities


  • Login