Statistical Inference and Regression Analysis: GB.3302.30

1 / 97

# Statistical Inference and Regression Analysis: GB.3302.30 - PowerPoint PPT Presentation

Statistical Inference and Regression Analysis: GB.3302.30. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 7 – Regression Model-1 Regression Diagnostics. Using the Residuals.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Statistical Inference and Regression Analysis: GB.3302.30' - marcy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Statistical Inference and Regression Analysis: GB.3302.30

Professor William Greene

IOMS Department

Department of Economics

### Statistics and Data Analysis

Part 7 – Regression Model-1 Regression Diagnostics

Using the Residuals
• How do you know the model is “good?”
• Various diagnostics to be developed over the semester.
• But, the first place to look is at the residuals.
Residuals Can Signal a Flawed Model
• Standard application: Cost function for output of a production process.
• Compare linear equation to a quadratic model (in logs)
• (123 American Electric Utilities)
Candidate Model for Cost

Log c = a + b log q + e

Most of the points in this area are above the regression line.

Most of the points in this area are above the regression line.

Most of the points in this area are below the regression line.

A Better Model?

Log Cost = α + β1 logOutput + β2 [logOutput]2 + ε

Candidate Models for Cost

The quadratic equation is the appropriate model.

Logc = a + b1 logq + b2 log2q + e

Missing Variable Included

Residuals from the quadratic cost model

Residuals from the linear cost model

Unusual Data Points

Outliers have (what appear to be) very large disturbances, ε

Wolf weight vs. tail length The 500 most successful movies

Outliers

99.5% of observations will lie within mean ± 3 standard deviations. We show (a+bx) ± 3se below.)

Titanic is 8.1 standard deviations from the regression!

Only 0.86% of the 466 observations lie outside the bounds. (We will refine this later.)

These observations might deserve a close look.

logPrice = a + b logArea + e

Not an outlier: Monet chose to paint a small painting.

Possibly an outlier: Why was the price so low?

(1) Examine the data

(2) Are they due to mismeasurement error or obvious “coding errors?” Delete the observations.

(3) Are they just unusual observations? Do nothing.

(4) Generally, resist the temptation to remove outliers.Especially if the sample is large. (500 movies islarge. 10 wolves is not.)

(5) Question why you think it is an outlier. Is it really?

On Removing Outliers

Be careful about singling out particular observations this way.

The resulting model might be a product of your opinions, not the real relationship in the data.

Removing outliers might create new outliers that were not outliers before.

Statistical inferences from the model will be incorrect.

### Statistics and Data Analysis

Part 7 – Regression Model-2 Statistical Inference

b As a Statistical Estimator
• What is the interest in b?
•  = dE[y|x]/dx
• Effect of a policy variable on the expectation of a variable of interest.
• Effect of medication dosage on disease response
• … many others
Application: Health Care Data

German Health Care Usage Data, There are altogether 27,326 observations on German households, 1984-1994.

DOCTOR = 1(Number of doctor visits > 0) HOSPITAL = 1(Number of hospital visits > 0)

HSAT =  health satisfaction, coded 0 (low) - 10 (high)

DOCVIS =  number of doctor visits in last three months HOSPVIS =  number of hospital visits in last calendar yearPUBLIC =  insured in public health insurance = 1; otherwise = 0 ADDON =  insured by add-on insurance = 1; otherswise = 0

INCOME =  household nominal monthly net income in German marks / 10000.HHKIDS = children under age 16 in the household = 1; otherwise = 0 EDUC =  years of schooling

AGE = age in years

MARRIED = marital status

EDUC = years of education

Regression?
• Population relationshipIncome =  + Health + 
• (For this population,Income = .31237+ .00585 Health + E[Income | Health] = .31237 + .00585 Health
Average Income | Health

Health

Nj = 447 255 642 1173 1390 4233 2570 4191 6172 3061 3192

b is a statistic
• Random because it is a sum of the ’s.
• It has a distribution, like any sample statistic
Sampling Experiment
• 500 samples of N=52 drawn from the 27,326 (using a random number generator to simulate N observation numbers from 1 to 27,326)
• Compute b with each sample
• Histogram of 500 values
Conclusions
• Sampling variability
• Seems to center on 
• Appears to be normally distributed
Distribution of slope estimator, b
• Assumptions:
• (Model) Regression: yi =  + xi + i
• (Crucial) Exogenous data: data x and noise  are independent; E[|x]=0 or Cov(,x)=0
• (Temporary) Var[|x] = 2, not a function of x(Homoscedastic)
• Results: What are the properties of b?
(2) b is efficient
• Gauss – Markov Theorem: Like Rao Blackwell. (Proof in Greene)
• Variance of b is smallest among linear unbiased estimators.
• Have derived expected value and variance of b.
• b is a ‘point’ estimator
• Looking for a way to form a confidence interval.
• Need a distribution and a pivotal statistic to use.
Usable Confidence Interval
• Use s instead of s.
• Use t distribution instead of normal.
• Critical t depends on degrees of freedom
• b - ts <  < b + ts
Regression Results

-----------------------------------------------------------------------------

Ordinary least squares regression ............

LHS=BOX Mean = 20.72065

Standard deviation = 17.49244

---------- No. of observations = 62 DegFreedom Mean square

Regression Sum of Squares = 7913.58 1 7913.57745

Residual Sum of Squares = 10751.5 60 179.19235

Total Sum of Squares = 18665.1 61 305.98555

---------- Standard error of e = 13.38627 Root MSE 13.16860

Fit R-squared = .42398 R-bar squared .41438

Model test F[ 1, 60] = 44.16247 Prob F > F* .00000

--------+--------------------------------------------------------------------

| Standard Prob. 95% Confidence

BOX| Coefficient Error t |t|>T* Interval

--------+--------------------------------------------------------------------

Constant| -14.3600** 5.54587 -2.59 .0121 -25.2297 -3.4903

CNTWAIT3| 72.7181*** 10.94249 6.65 .0000 51.2712 94.1650

--------+--------------------------------------------------------------------

Note: ***, **, * ==> Significance at 1%, 5%, 10% level.

-----------------------------------------------------------------------------

• Outside the confidence interval is the rejection for hypothesis tests about 
• For the internet buzz regression, the confidence interval is 51.2712 to 94.1650
• The hypothesis that  equals zero is rejected.

### Statistics and Data Analysis

Part 7-3 – Prediction

Predicting y Using the Regression
• Actual y0 is  + x0 + 0
• Prediction is y0^ = a + bx0 + 0
• Error is y0 – y0^ = (a-) + (b-)x0 + 0
• Variance of the error isVar[a] + x02 Var[b] + 2x0 Cov[a,b] + Var[0]
Quantum of Solace
• Actual Box = \$67.528882M
• a=-14.36, b=72.7181, N=62, sb =10.94249, s2 = 13.38632
• buzz = 0.76, prediction = 40.906
• Mean buzz = 0.4824194
• (buzz – mean)2 = 1.49654
• Sforecast = 13.8314252
• Confidence interval = 40.906 +/- 2.003(13.831425) = 13.239 to 68.527(Note: The confidence interval contains the value)
Forecasting Out of Sample

Regression Analysis: G versus Income

The regression equation is

G = 1.93 + 0.000179 Income

Predictor Coef SE Coef T P

Constant 1.9280 0.1651 11.68 0.000

Income 0.00017897 0.00000934 19.17 0.000

S = 0.370241 R-Sq = 88.0% R-Sq(adj) = 87.8%

How to predict G for 2012? You would need first to predict Income for 2012.

How should we do that?

Per Capita Gasoline Consumption vs. Per Capita Income, 1953-2004.

The Extrapolation Penalty

The interval is narrowest at x* = , the center of our experience. The interval widens as we move away from the center of our experience to reflect the greater uncertainty.(1) Uncertainty about the prediction of x(2) Uncertainty that the linear relationship will continue to exist as we move farther from the center.

Normality
• Necessary for t statistics and confidence intervals
• Residuals reveal whether disturbances are normal?
• Standard tests and devices
Normally Distributed Residuals?

--------------------------------------

Kolmogorov-Smirnov test of F(E )

vs. Normal[ .00000, 13.27610^2]

******* K-S test statistic = .1810063

******* 95% critical value = .1727202

******* 99% critical value = .2070102

Normality hyp. should be rejected.

--------------------------------------

Nonnormal Disturbances
• Appeal to the central limit theorem
• Use standard normal instead of t
• t is essentially normal if N > 50.

### Statistics and Data Analysis

Part 7-4 – Multiple Regression

An Enduring Art Mystery

Graphics show relative sizes of the two works.

The Persistence of Statistics. Rice, 2007

Why do larger paintings command higher prices?

The Persistence of Memory. Salvador Dali, 1931

The Data

Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)

Monet in Large and Small

Sale prices of 328 signed Monet paintings

The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model.

Log of \$price = a + b log surface area + e

How much for the signature?
• The sample also contains 102 unsigned paintings

Average Sale Price

Signed \$3,364,248

Not signed \$1,832,712

• Average price of signed Monet’s is almost twice that of unsigned
Can we separate the two effects?

Average Prices

Small Large

Unsigned 346,845 5,795,000

Signed 689,422 5,556,490

What do the data suggest?

(1) The size effect is huge

(2) The signature effect is confined to the small paintings.

A Multiple Regression

b2

Ln Price = a + b1 ln Area + b2 (0 if unsigned, 1 if signed) + e

Monet Multiple Regression

Regression Analysis: ln (US\$) versus ln (SurfaceArea), Signed

The regression equation is

ln (US\$) = 4.12 + 1.35 ln (SurfaceArea) + 1.26 Signed

Predictor Coef SE Coef T P

Constant 4.1222 0.5585 7.38 0.000

ln (SurfaceArea) 1.3458 0.08151 16.51 0.000

Signed 1.2618 0.1249 10.11 0.000

S = 0.992509 R-Sq = 46.2% R-Sq(adj) = 46.0%

Interpretation (to be explored as we develop the topic):(1) Elasticity of price with respect to surface area is 1.3458 – very large

(2) The signature multiplies the price by exp(1.2618) (about 3.5), for any given size.

Ceteris Paribus in Theory
• Demand for gasoline: G = f(price,income)
• Demand (price) elasticity:eP = %change in G given %change in P holding income constant.
• How do you do that in the real world?
• The “percentage changes”
• How to change price and hold income constant?
A Thought Experiment
• The main driver of gasoline consumption is income not price
• Income is growing over time.
• We are not holding income constant when we change price!
• How do we do that?
How to Hold Income Constant?

Multiple Regression Using Price and Income

Regression Analysis: G versus GasPrice, Income

The regression equation is

G = 0.134 - 0.00163 GasPrice + 0.000026 Income

Predictor Coef SE Coef T P

Constant 0.13449 0.02081 6.46 0.000

GasPrice -0.0016281 0.0004152 -3.92 0.000

Income 0.00002634 0.00000231 11.43 0.000

It looks like the theory works.

### Statistics and Data Analysis

Linear Multiple Regression Model

Classical Linear Regression Model
• The model is y = f(x1,x2,…,xK,1,2,…K) + 

= a multiple regression model. Important examples:

• Marginal cost in a multiple output setting
• Separate age and education effects in an earnings equation.
• Denote (x1,x2,…,xK) as x. Boldface symbol = vector.
• Form of the model – E[y|x] = a linear function of x.
• ‘Dependent’ and ‘independent’ variables.
• Independent of what? Think in terms of autonomous variation.
• Can y just ‘change?’ What ‘causes’ the change?
Model Assumptions: Generalities
• Linearity means linear in the parameters. We’ll return to this issue shortly.
• Identifiability. It is not possible in the context of the model for two different sets of parameters to produce the same value of E[y|x] for allx vectors. (It is possible for some x.)
• Conditional expected value of the deviation of an observation from the conditional mean function is zero
• Form of the variance of the random variable around the conditional mean is specified
• Nature of the process by which x is observed.
• Assumptions about the specific probability distribution.
Linearity of the Model
• f(x1,x2,…,xK,1,2,…K) = x11 + x22 + … + xKK
• Notation: x11 + x22 + … + xKK = x.
• Boldface letter indicates a column vector. “x” denotes a variable, a function of a variable, or a function of a set of variables.
• There are K “variables” on the right hand side of the conditional mean “function.”
• The first “variable” is usually a constant term. (Wisdom: Models should have a constant term unless the theory says they should not.)
• E[y|x] = 1*1 + 2*x2 + … + K*xK.

(1*1 = the intercept term).

Linearity
• Linearity means linear in the parameters, not in the variables
• E[y|x] = 1 f1(…) + 2 f2(…) + … + K fK(…).

fk() may be any function of data.

• Examples:
• Logs and levels in economics
• Time trends, and time trends in loglinear models – rates of growth
• Dummy variables
Linearity
• Simple linear model, E[y|x] =x’β
• Quadratic model: E[y|x] = α + β1x+ β2x2
• Loglinear model, E[lny|x] = α + Σk lnxkβk
• Semilog, E[y|x] = α + Σk lnxkβk
• All are “linear.” An infinite number of variations.
Notation

Define column vectors of Nobservations on y and the K x variables.

The assumption means that the rank of the matrix X is K.

No linear dependencies => FULL COLUMN RANK of the matrix X.

Uniqueness of the Conditional Mean

The conditional mean relationship must hold for any set of N observations, i = 1,…,N. Assume, that NK (justified later)

E[y1|x] = x1

E[y2|x] = x2

E[yn|x] = xn

All N observations at once: E[y|X] = X = E.

Uniqueness of E[y|X]

Now, suppose there is a  that produces the same expected value,

E[y|X] = X = E.

Let  =  - . Then,

X = X - X = E - E = 0.

Is this possible? X is an NK matrix (N rows, K columns). What does X= 0 mean? We assume this is not possible. This is the ‘full rank’ assumption. Ultimately, it will imply that we can ‘estimate’ . This requires NK .

An Unidentified (But Valid)Theory of Art Appreciation

Enhanced Monet Area Effect Model: Height and Width Effects

Log(Price) = β1 + β2 log Area +

β3log Aspect Ratio +

β4log Height +

β5Signature +

ε

(Aspect Ratio = Height/Width)

Conditional Homoscedasticity and Nonautocorrelation

Disturbances provide no information about each other.

• Var[i | X ] = 2
• Cov[i, j|X] = 0
Heteroscedasticity

Countries are ordered by the standard deviation of their 19 residuals.

Regression of log of per capita gasoline use on log of per capita income, gasoline price and number of cars per capita for 18 OECD countries for 19 years. The standard deviation varies by country.

Normal Distribution of ε
• Used to facilitate finite sample derivations of certain test statistics.
• Observations are independent
• Assumption will be unnecessary – we will use the central limit theorem for the statistical results we need.
The Linear Model
• y = X+ε, N observations, K columns in X, (usually including a column of ones for the intercept).
• E[ε|X]=0, E[ε]=0 and Cov[ε,x]=0
• Regression: E[y|X] = X

### Statistics and Data Analysis

Least Squares

Vocabulary
• Some terms to be used in the discussion.
• Population characteristics and entities vs. sample quantities and analogs
• Residuals and disturbances
• Population regression line and sample regression
• Objective: Learn about the conditional mean function. Estimate  and 2
• First step: Mechanics of fitting a line to a set of data