statistics and data analysis
Skip this Video
Download Presentation
Statistics and Data Analysis

Loading in 2 Seconds...

play fullscreen
1 / 51

Statistics and Data Analysis - PowerPoint PPT Presentation

  • Uploaded on

Statistics and Data Analysis. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 15 – Regression Models. 1/49. Linear Regression Models. Analyzing residuals Violations of assumptions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Statistics and Data Analysis' - libitha

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
statistics and data analysis

Statistics and Data Analysis

Professor William Greene

Stern School of Business

IOMS Department

Department of Economics

statistics and data analysis1

Statistics and Data Analysis

Part 15 – Regression Models

linear regression models
1/49Linear Regression Models
  • Analyzing residuals
    • Violations of assumptions
    • Unusual data points
    • Hints for improving the model
  • Model building
    • Linear models – cost functions
    • Semilog models – growth models
    • Logs and elasticities
model assumptions
2/49Model Assumptions
  • Assumptions about disturbances (noise)
    • Zero mean
    • Constant variance
    • No correlation across observations
    • Normality
  • Disturbances are assumed to be pure noise. Residuals should appear that way also.
an enduring art mystery
3/49An Enduring Art Mystery

Graphics show relative sizes of the two works.

The Persistence of Statistics. Hildebrand, Ott and Gray, 2005

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931

monet in large and small
4/49Monet in Large and Small

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

Log of $price = a + b log surface area + e

speaking of monet
5/49Speaking of Monet…

Monet. Le Pont d'Argenteuil, 1874. Modified, Musée d’Orsay, Paris, October 7, 2007, anon. vandal.


Speaking of owner modified $100,000,000 paintings…

using the residuals
9/49Using the Residuals
  • How do you know the model is “good?”
  • Various diagnostics to be developed over the semester.
  • But, the first place to look is at the residuals.
residuals can signal a flawed model
10/49Residuals Can Signal a Flawed Model
  • Standard application: Cost function for output of a production process.
  • Compare linear equation to a quadratic model (in logs)
  • (124 American Electric Utilities)
candidate model for cost
11/49Candidate Model for Cost

Log c = a + b log q + e

Most of the points in this area are above the regression line.

Most of the points in this area are above the regression line.

Most of the points in this area are below the regression line.

a missing variable
12/49A Missing Variable?

Residuals from the (log)linear cost model

a better model
13/49A Better Model?

Log Cost = α + β1 logOutput + β2 [logOutput]2 + ε

(Developed more fully after the midterm)

candidate models for cost
14/49Candidate Models for Cost

The quadratic equation is the appropriate model.

Logc = a + b1 logq + b2 log2q + e

missing variable included
15/49Missing Variable Included

Residuals from the quadratic cost model

  • Hetero - differences
  • Scedastic - function, variation around the mean
  • Arises when y is “proportional” to x
  • Arises sometimes when there are natural, heterogeneous groups

Residuals from a regression of salaries on years of experience.

Standard deviation of the residuals seems not to be constant.

problem with the model
18/49Problem with the Model?

This usually suggests the model should be defined in terms of logs of the variable.

sometimes heteroscedasticity can be cured by taking logs
19/49Sometimes Heteroscedasticity Can Be Cured By Taking Logs

Residuals from a regression of logs of salaries on years of experience.

Salary = αeβteε

We will explore this model below.

sometimes not
20/49Sometimes Not …

Countries are ordered by the standard deviation of their 19 residuals.

Regression of log of per capita gasoline use on log of per capita income for 18 OECD countries for 19 years. The standard deviation varies by country. The “solution” is “weighted least squares.” (See text, page 659.)

should i worry about heteroscedasticity
21/49Should I Worry About Heteroscedasticity?
  • Not a problem for using least squares to estimate α or β.
  • But, there is a better method than least squares.
  • Assessment of the uncertainty of the least squares estimates may be too optimistic.
  • (Not contagious)
  • Auto – self
  • Correlation – correlation
  • Correlated with itself? Obviously?
  • Noise in one observation is correlated with noise in other observations.
  • Usually a feature of time series data
    • Residuals correlated with recent past residuals
    • Typically streaks of unusually high or low observations (measured against the regression)
time series regression
23/49Time Series Regression

Regression of log Gasoline on log Income (both per capita), U.S., 1953-2004. Residuals are highly autocorrelated.Same problems as heteroscedasticity. Autocorrelation can (also) be cured. Not by taking logs, however.

unusual data points
24/49Unusual Data Points

Outliers have (what appear to be) very large disturbances, ε

Wolf weight vs. tail length The 500 most successful movies

25/49Outliers (?)

Remember the empirical rule, 99.5% of observations will lie within mean ± 3 standard deviations? We show (a+bx) ± 3se below.)

Titanic is 8.1 standard deviations from the regression!

Only 0.86% of the 466 observations lie outside the bounds. (We will refine this later.)

These observations might deserve a close look.

what to do about outliers
26/49What to Do About Outliers

(1) Examine the data

(2) Are they due to mismeasurement error or obvious “coding errors?” Delete the observations.

(3) Are they just unusual observations? Do nothing.

(4) Generally, resist the temptation to remove outliers.Especially if the sample is large. (500 movies islarge. 10 wolves is not.)

(5) Question why you think it is an outlier. Is it really?

high leverage points
27/49High Leverage Points

“High leverage” points have unusual values of x.

Problem? The regression slope is strongly influenced by these points.Response: Unless you are strongly convinced that these are bad data, strongly resist the temptation to pay any attention to these observations.

This phenomenon is extremely hard to detect in a moderate to large sample. It is also extremely elusive when there is more than one variable in the model.



highly influential points
28/49Highly Influential Points
  • High leverage outliers(Unusual x and unusual y)
  • With Titanic: 6.693 + 1.051 Domestic
  • Without Titanic: 20.774 + 0.930 Domestic
minitab s opinions
32/49Minitab’s Opinions

Minitab uses ± 2S to flag “large” residuals.

on removing outliers
33/49On Removing Outliers

Be careful about singling out particular observations this way.

The resulting model might be a product of your opinions

Removing outliers might create new outliers that were not outliers before.

Statistical inferences from the model will be incorrect.

probability plot
35/49Probability Plot

Graph -> Probability Plots …

using and interpreting the model
36/49Using and Interpreting the Model
  • Interpreting the linear model
  • Semilog and growth models
  • Log-log model and elasticities
statistical cost analysis
37/49Statistical Cost Analysis

The units of the LHS and RHS must be the same.

$M cost = a + b MKWH

Y = $ cost

a = $ cost = 2.444 $M

b = $M /MKWH = 0.005291 $M/MKWH


a = fixed cost = total cost if MKWH = 0

b = marginal cost = dCost/dMKWH

b * MKWH = variable cost

Generation cost ($M) and output (Millions of KWH) for 124 American electric utilities. (1970).

semilog models and growth rates
38/49Semilog Models and Growth Rates

LogSalary = 9.84 + 0.05 Years + e

using semilog models for trends
40/49Using Semilog Models for Trends

Frequent Flyer Flights for 72 Months. (Text, Ex. 11.1, p. 508)

regression approach
41/49Regression Approach

logFlights = α + β Months + ε

a = 2.770, b = 0.03710, s = 0.06102

loglinear models
42/49Loglinear Models
  • logY = α + βlogx + ε
  • Elasticities
  • Gasoline income elasticity
  • The linear and loglinear models give similar answers
  • Price elasticity
elasticity and loglinear models
43/49Elasticity and Loglinear Models
  • The “responsiveness” of one variable to changes in another
  • E.g., in economics demand elasticity = (%ΔQ) / (%ΔP)
  • Math: Ratio of percentage changes
    • %ΔQ / %ΔP = {100%[(ΔQ )/Q] / {100%[(ΔP)/P]}
    • Units of measurement and the 100% fall out of this eqn.
    • Elasticity = (ΔQ/ΔP)*(P/Q)
    • Elasticities are units free
loglinear demand curves
45/49Loglinear Demand Curves

Q = αPβeεso


Thenβ =dlogQ/dlogPis the elasticity

demand models

Regression Analysis: Log-Gas_t versus LogPG_t

The regression equation is

Log-Gas_t = 0.372 - 0.169 LogPG_t

Predictor Coef SE Coef T P

Constant 0.372140 0.008433 44.13 0.000

LogPG_t -0.16949 0.03827 -4.43 0.000

S = 0.0608113 R-Sq = 28.2%

Using Logs

Regression Analysis: Gas_t versus PGas_t

The regression equation is

Gas_t = 1.66 - 0.199 PGas_t

Predictor Coef SE Coef T P

Constant 1.65874 0.05803 28.58 0.000

PGas_t -0.19928 0.05516 -3.61 0.001

S = 0.0941783 R-Sq = 20.7%

Using Levels

linear and loglinear models
47/49Linear and Loglinear Models

Elasticity in the loglinear model isb = -0.1695.Elasticity in the linear model at the mean of G of 1.4545 and mean of PG1.0251 is -0.1993(1.0251/1.4545)= -0.1404.

  • Residual analysis
    • Consistent with model assumptions?
    • Suggest missing elements in the model
  • Building the regression model
    • Interpreting the model – cost function
    • Growth model – semilog
    • Double log and estimating elasticities