Notes 6: Multiple Linear Regression

1 / 90

# Notes 6: Multiple Linear Regression - PowerPoint PPT Presentation

Notes 6: Multiple Linear Regression. 1. The Multiple Linear Regression Model 2. Estimates and Plug-in Prediction 3. Confidence Intervals and Hypothesis Tests 4. Fits, residuals, R-squared, and the overall F-test 5. Categorical Variables as Regressors

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Notes 6: Multiple Linear Regression' - Melvin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Notes 6: Multiple Linear Regression

1. The Multiple Linear Regression Model

2. Estimates and Plug-in Prediction

3. Confidence Intervals and Hypothesis Tests

4. Fits, residuals, R-squared, and the overall F-test

5. Categorical Variables as Regressors

6. Issues with Regression: Outliers, Nonlinearities,

and Omitted Variables

The data, regression output, and some of the plots in these slides can be found in this file. (MidCity_reg.xls)

1. The Multiple Linear Regression Model

The plug-in predictive interval for the price of a house

given its size is quite large.

Is this “bad”? Not necessarily. There is a lot of variation in

this relationship. Put differently, you can’t accurately

predict the price of a house just based on its size.

The width of our predictive interval reflects this.

How can we predict the price of a house more accurately?

If we know more about a house, we should have

a better idea of its price !!

Our data has more variables than just size and price:

(price and size /1000)

The first 7 rows are:

x1

x2

y

x3

Before we tried to predict price given size.

Suppose we also know the number of bedrooms and bathrooms a house has. What is prediction for price?

Let xij = the value of the jth explanatory variable associated with observation i. So xi1 = # of bedrooms in the ithhouse.

In the spreadsheet, xij is the ith row of the jth column.

The Multiple Linear Regression Model

Y is a linear combination of the x variables + error.

The error works exactly the same way as in simple linear regression!! We assume the e are independent of all the x\'s.

xij is the value of the jth explanatory variable (or regressor) associated with observation i. There are k regressors.

How do we interpret this model?

We can’t plot the line anymore, but…

a is still the intercept, our “guess” for Y when all the x’s = 0.

There are now k slope coefficients bi, one for each x.

Big difference: Each coefficient bi now describes the change in Y when xi increases by one unit, holding all of the other x’s fixed. You can think of this as “controlling” for all of the other x’s.

Another way to think about the model:

This is the variance of the errors, or “how wrong” our guess can be.

This is the “guess” we would make for Y given values for each x1, x2, … , xk

The conditional distribution of Y given all of the x’s

is normal with the mean depending on the

x\'s through a linear combination.

Notice that s2 has the same interpretation here as

it did in the simple linear regression model.

Suppose we model price as depending on nbed, nbath,

and size. Then we have:

If we knewa, b1, b2, b3, and s, could we predict price?

Predict the price of a house that has 3 bedrooms, 2 bathrooms, and total size of 2200 square feet.

Our “guess” for price would be:

a + b1(3) + b2(2) + b3(2200)

Now since we know that e ~ N(0,s2), with 95% probability

Price  a + b1(3) + b2(2) + b3(2200)  2s

But again, we don’t know the parameters a, b1, b2, b3,

or s. We have to estimate them from some data.

Given data, we have estimates of a, bi, and s.

a is our estimate of a.

b1, b2, b3 are our estimates of b1, b2, and b3.

se is our estimate of s.

2. Estimates and Plug-in Prediction

Here is the output from the regression of price on size (SqFt),

nbed (Bedrooms) and nbath (Bathrooms) in StatPro:

se

a

b1

b2

b3

For simple linear regression we wrote down formulas for a and b. We could do that for multiple regression, too, but we’d need matrix algebra.

Once we have our intercept and slope estimates, though, we estimate se just like before.

Define the residual ei as:

ei = yi - a - b1x1i - b2x2i - … - bkxki

As before, the residual is the difference between our “guess” for Y and the actual value yi we observed.

Also as before, these estimates a, b1, …, bk are the least squares estimates. They minimize

Our estimate of s is just the sample standard deviation of the residuals ei

Remember for simple regression, k=1. So this is really the same formula. We divide by n-k-1 for the same reason.

se just asks, “on average, how far are our observed values yi away from the line we fitted?”

Our estimated relationship is:

Price = -5.64 + 10.46*nbed + 13.55*nbath + 35.64*size

+/- 2( 20.36)

b1

Interpret: (remember UNITS!)

With size and nbath held fixed, how does adding one

bedroom affect the value of the house?

Answer: price increases by 10.46 thousands of dollars.

With nbed and nbath held fixed, adding 1000 square feet

increases the price of the house by \$35,640.

Suppose a house has size = 2.2, 3 bedrooms

and 2 bathrooms.

What is your (estimated) prediction for its price?

-5.64 + 10.46*3 + 13.55*2 + 35.64*2.2 = 131.248

a

b1

b2

b3

2se = 40.72

131.248 +/- 40.72

This is our multiple regression “plug-in” predictive interval.

We just plug in our estimates a, b1, b2, b3, and se

in place of the unknown parameters a, b1, b2, b3, and s.

Note (1):

When we regressed price on size the coefficient

Now the coefficient for size is about 36. WHY?

Without nbath and nbed in the regression, an increase in size can by associated with an increase in nbath and nbed in the background.

If all I know is that one house is a lot bigger than another

I might expect the bigger house to have more beds and baths!

With nbath and nbed held fixed, the effect of size is smaller.

Example:

Suppose I build a 1000 square foot addition to my house. This addition includes two bedrooms and one bathroom.

How does this affect the value of my house?

10.46*2 + 13.55*1 + 35.64*1 = 70.11

The value of the house goes up by \$70,110. This is almost exactly the relationship we estimated before!

But now we can say, if the 1000 square foot addition is only a basement or screened-in porch, the increase in value is much smaller. This is a much more realistic model!!

With just size, the width of our predictive interval was

2*22.467 = 44.952

With nbath and nbed added to the model the +/- is

2*20.36 = 40.72

The additional information makes our prediction

more precise (but not a whole lot in the case,

we still need a "better model").

And we can do this! We have more info in our sample,

and we might be able to use this info more efficiently.

3. Confidence Intervals and Hypothesis Tests

95% confidence interval for a:

estimate

+/-

2 standard errors

AGAIN!!!!!!!

(in Excel)

95% confidence interval for bi:

(in Excel)

(recall that k is the number of explanatory variables in the model)

For example, b2 = 13.55 and

The 95% CI for b2 is 13.55 +/- 2(4.22)

Again, StatPro (and nearly every other software package)

prints out the 95% confidence intervals for the intercept

and each slope coefficient.

Hypothesis tests on coefficients:

To test the null hypothesis

t is the

"t statistic"

If n>30 or

so, we

reject

If the t

statistic

is bigger

than 2 !!

vs.

We reject at level .05 if

Otherwise, we fail to reject.

Intuitively, we reject if estimate is more than 2 se\'s away

from proposed value.

Same for the slopes (gee, this looks familiar):

To test the null hypothesis

vs.

We reject at level .05 if

Otherwise, we fail to reject.

Again, we reject if estimate is more than 2 se\'s away

from proposed value.

Example

StatPro automatically prints out the t-statistics for testing

whether the intercept=0 and whether each slope =0, as well

as the associated p-values.

e.g., = 35.64/10.67=3.34 => reject Ho: b3=0

What does this mean?

In this sample, we have evidence that each of our explanatory variables has a significant impact on the price of a house.

Even so, adding two variables doesn’t help our predictions that much! Our predictive interval is still pretty wide.

In many applications we will have LOTS (sometimes hundreds) of x’s. We will want to ask, “which x’s really belong in our model?”. This is called model selection.

One way to answer this is to throw out all the x’s whose coefficients have t-stats less than 2. But this isn’t necessarily the BEST way… more on this later.

Be careful interpreting these tests.

Example: 1993 data on 50 states and D.C.

vcrmrate_93i = Violent crimes per 100,000 population

black_93i = proportion of black people in population

Increase the proportion of black people in a state’s population by 1% and I predict 28.5 more violent crimes per 100,000 population…

unem_93i = unemployment rate in state i

pcpolice_93i = avg size of police force per capita in state i

prison_93i = prison inmates per 100,000 population

When I control for these other factors, black_93 is no longer significant!

More importantly,

Correlation does not imply causation!!!

We should not conclude that police “cause” crime!

4. Fits, residuals, R-squared, and the overall F-test

In multiple regression the fit is:

"the part of y related to the x\'s "

as before, the residual is the part left over:

Just like for simple regression we would like to “split up” Y into two parts and ask “how much can be explained by the x’s?”

In multiple regression, the residuals ei have sample mean 0

and are uncorrelated with each of the x\'s and the fitted values:

part of y that has

nothing to do with x\'s

part of y that is

explained by x’s

of price on size, nbath, nbed vs. the fitted values.

We can see the 0 correlation.

Scatterplots of residuals vs. each of the x’s would look similar.

So, just as with one x we have:

total variation in y =

variation explained by x + unexplained variation

R-squared

the closer R-squared is to 1, the better the fit.

R2 is also the square of the correlation between

the fitted values, , and the observed values, y:

Regression

finds the

linear combination

of the x\'s which

is most correlated

with y.

(Recall that with just size,

the correlation between

fits and y was .553)

So R2 here is just (.663)2 = 0.439569

The "Multiple R" in the StatPro output is

the correlation between y and the fits.

R2 = (.663)2 = 0.439569

Aside: Model Selection

In general I might have a LOT of x’s and I’ll want to ask, “which of these variables belong in my model?”

THERE IS NO ONE RIGHT ANSWER TO THIS.

One way is to ask, “which of the coefficients are significant?” But we know significant coefficients don’t necessarily mean we will do better predicting.

Another way might be to ask, “what happens to R2 when I add more variables?” CAREFUL, though, it turns out that when you add variables R2 will NEVER go down!!

The overall F-test

The p-value beside "F" is a test of the null hypothesis:

(all the slopes are 0)

We reject the null, at least some of the slopes are not 0.

What does this mean?

I sometimes call the “overall F-test” the “kitchen sink test”. Notice that if the null hypothesis

is true, then NONE of the x’s have ANY explanatory power in our linear model!!

We’ve thrown “everything but the kitchen sink” at Y. We want to know, can ANY of our x’s predict Y??

That’s fine. But in practice this test is VERY sensitive. You’re being “maximally skeptical” here, so you will usually reject H0. If on the other hand we don’t reject the null in this test, we probably need to rethink things!!

5. Categorical Variables as Regressors

Here, again, is the first 7 rows of our housing data:

Does whether a house is brick or not affect the

price of the house?

This is a categorical variable.

Can we use multiple regression with categorical x\'s ?!

What about the neighborhood? (location, location, location!!)

Here’s the price/size scatterplot again. In this one, brick houses are in pink.

What kind of model would you like to fit here?

Adding a Binary Categorical x

To add "brick" as an explanatory variable in our regression

we create the dummy variable which is 1 if the house is

brick and 0 otherwise:

the

"brick dummy"

.

.

.

Note:

I created the dummy by using the Excel formula:

=IF(Brick="Yes",1,0)

but we\'ll see that StatPro has a nice utility for creating

dummies.

As a simple first example, let\'s regress

price on size and brickdum.

Here is our model:

How do you interpret b2 ?

What is the expected price of a brick house

given of a given size, s?

(intercept)

(slope)

What is the expected price of a non-brick house

given its size?

(intercept)

(slope)

b2 is the expected difference in price between a

brick and non-brick house controlling for size.

Let\'s try it !!

+/- 2se = 39.3, this is the best we\'ve done !

what is the “brick effect”? b2 +/- 2 sb

23.4 +/- 2(3.7) = 23.4 +/- 7.4

2

We can

see the effect

of the dummy

by plotting

the fitted values

vs size. (StatPro

does this for you.)

The upper line

is for the brick

houses and

the lower line

is for the non-brick

houses.

One more scatterplot with fitted and actual values.

In this case we can really visualize our model—we just fit two lines with different intercepts!!

The blue line (nonbrick houses) has intercept a, the pink line’s (brick houses) intercept is a+b2

Note:

You could also create a dummy which was 1

if a house was non brick and 0 if brick.

That would be fine, but the meaning of b2 which

change. (Here it would just change sign.)

IMPORTANT: You CANNOT put both dummies in!

Given one, the information in the other is redundant.

You will get nasty error messages if you try to do this!!

We can interpret b2 as a shift in the intercept.

But that our model still assumes that the price

difference between a brick and non-brick house

does not depend on the size!

In other words, we are fitting two lines with different

intercepts, but the slopes are still the same.

The two variables do not "interact".

Sometimes we expect variables to interact. We’ll get

to this next week.

Now let\'s add brick to the regression of price on

size, nbath, and nbed:

+/- 2se = 35.2

Adding brick seems to be a good idea !!

This is a really useful technique! Suppose my model is:

Bwghti = a + b1Faminci + b2Cigsi + ei

Bwghti = birthweight of ith newborn, in ounces

Faminci = annual income of family i, in thousands of \$

Cigsi = 1 if mother smoked during pregnancy, 0 otherwise

What is this telling me about the impact of smoking on infant health?

Why do you think it is important to have family income in the model?

Adding a Categorical x in General

Let\'s regress price on size and neighborhood.

This time let\'s use StatPro\'s data utility for

creating dummies.

StatPro / Data Utilities / Create Dummy Variable(s)

I used StatPro to create one dummy for each

the neighborhoods.

.

.

.

eg. Nbhd_1 indicates if the house is in neighborhood 1 or not

Here’s the scatterplot broken out by neighborhoods. Again we might want to fit lines with different intercepts, particularly for neighborhood 3.

Now we add any TWO of the three dummies.

Given any two, the information in the third is

redundant.

Let\'s first do price on size and neighborhood:

where now I\'ve use N2 to denote the dummy

for neighborhood 2 and same for 3.

Our model:

b2: difference in intercepts between neighborhoods 1 and 2

b3: difference in intercepts between neighborhoods 1 and 3

The neighborhood corresponding to the dummy we

leave out becomes the "base case" we compare to.

Let\'s try it!

For two houses in neighborhoods 1 and 3 of equal size, the house in neighborhood 3 is worth between \$34K and \$48K more!

+/- 2se = 30.52 !!

“Neighborhood effects” are both significant

Here is

fits vs size.

Which line

corresponds

to which

neighborhood ?

Where do you

want to live ?

Again we assume

size and nbhd do

not “interact”; i.e.

the slopes are the same.

Aside: Omitted Variables

When we just regress size on price, we got a coefficient of 70. With neighborhood effects included, the coefficient on size drops to 46. Why??

Here’s the line we would have fitted if we had ignored neighborhood effects.

Aside: Omitted Variables

With just size, our data says a bigger house costs

more, but this is partly because a bigger house

is more likely to be in the nicer neighborhood.

If we include the neighborhood in the regression

then we are controlling for neighborhood effects, so the effect of size looks smaller.

This happens because size and neighborhood are correlated!!

More on this later…

ok, let\'s try price on size, nbed, nbath, brick, and

neighborhood.

Notice, though, that our estimate for the bedrooms coefficient is no longer significantly different from zero. This will often happen in multiple regression.

+/- 2se = 24 !!

Controlling for brick and nbhd makes our predictions

much more accurate.

Maybe we don\'t need bedrooms:

Dropping bedrooms did not increase se or

decrease R-Square, so no need to bother with it.

Summary: Adding a Categorical x

In general to add a categorical x, you can

create dummies, one for each possible category

(or level as we sometimes call it).

Use all but one of the dummies.

It does not matter which one you drop for the fit,

but the interpretation of the coefficients will depend

on which one you choose to drop.

Summary of

multiple

Regression:

Regression

finds a

linear

combination

of the variables

that is like y.

price vs combination

of size, nbath, brick, nbhd

price vs size:

With more information we can often make more accurate predictions!

The residuals

are the part

of y not

related to

the x\'s.

6. Outliers, Nonlinearities, and Omitted Variables

Regression of murder rates on unemployment rate and the presence of capital punishment (50 states over 3 years).

Murder rate doesn’t seem correlated with capital punishment…

…but seems strongly related to bad economic conditions.

DC, 1990

Here’s the scatterplot.

DC, 1993

The three huge outliers are for Washington DC.

DC, 1987

Fit looks better…

but be careful interpreting this!!

Remember, correlation does not imply causation!!!

Same regression, but without the DC observations.

Murder rates now seem significantly higher in states with capital punishment. Why did this change?

DC did not employ capital punishment.

Whether or not we throw these points out, we need to understand how they affect our results!

The Log Transformation

Suppose we have a multiplicative relationship:

Here (1+n) is a multiplicative error.

n is the percentage error.

Often we see this, the size of the error is a percentage

of the expected response.

This is obviously a nonlinear relationship. Can we estimate this model using linear regression?

Yes! Just take the log of both sides:

where a = log(c) and e = log(1+n).

We can estimate a multiplicative relationship

by regressing the log of Y on the log of x.

Key Idea: Taking the logs turns these nonlinear

relationships into linear ones in terms of the transformed

variables.

It also take a multiplicative (percentage error) and

turns it into the additive error of the regression model.

In practice, logging y and/or x can also often be a good

cure for heteroskedasticity. And you can always

do several different transformations and compare them.

FACT: When we do a linear regression,

Y = a + bX + e

b tells us: “when X changes by one UNIT, by how many UNITS does y change?”

When we do the same regression in logs,

log(Y) = a + b log(X) + e

b tells us: “when X changes by one PERCENT, what is the PERCENTAGE change in Y?”

One reason we might want to do this is to avoid making negative predictions (e.g., housing prices).

Example

Goal: relate the brain weight of a mammal to its body weight.

Each observation corresponds to a mammal species.

y: brain weight (grams)

x: body weight (grams)

Does a linear

model make

sense ?

logy vs logx

Looks pretty nice !! Let’s try a linear regression of

log brain weight on log body weight…

standardized resids vs fits

The big

residual

is the

chinchilla.

Very few people know that the chinchilla is a master

race of supreme intelligence.

No.

The book I got this from had chinchilla at 64 grams

instead of 6.4 grams (which I found in another book).

The next biggest positive residual is man, which is

what we would have expected.

(Well, maybe before reality TV)

We can also model nonlinearities by fitting a polynomial:

y = polynomial + error

For example with two x\'s we might have:

With many x\'s there are a lot of possibilities!

Note that product terms give us interaction.

It is no longer true that the effect of changing one x

does not depend on the value of the others.

Example: Suppose y=salary, x1=education, x2=age

Including x1x2 in the regression allows the returns to education to depend on your age.

Example: Interactions with Dummy Variables

The housing data again.

It makes no

sense to

square or log

a dummy !!!

y: price

s: size

N2: dummy for neighborhood 2

N3: dummy for neighborhood 3

model:

Slope

interpret:

Intercept

Fits vs size.

Now we

see that

lines don\'t

have to be

parallel !

But it does not

seem that there

is much

interaction.

On the other hand the lower slope for the "worst" neighborhood makes sense !!

here is the regression output:

what happens if you throw out each variable

with t-statistic less than 2?

Omitted Variables

(Why do regression coefficients change?)

Regress price on size:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -10091.130 18966.104 -0.532 0.596

SqFt 70.226 9.426 7.450 1.30e-11 ***

Regress price on size and two dummies for neighborhood:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 21241.174 13133.642 1.617 0.10835

SqFt 46.386 6.746 6.876 2.67e-10 ***

Nbhd2 10568.698 3301.096 3.202 0.00174 **

Nbhd3 41535.306 3533.668 11.754 < 2e-16 ***

When we add the neighborhood, the coefficient for

size drops from 70 to 46. Why?

neighborhoods are plotted with different colors and symbols.

We also have the regression lines fit with just the points in

each neighborhood.

Here’s the line we would have fitted if we had ignored neighborhood effects.

With just size, our data says a bigger house costs

more, but this is partly because a bigger house

is more likely to be in the nicer neighborhood.

If we include the neighborhood in the regression

then we are controlling for neighborhood effects, so the effect of size looks smaller.

This happens because size and neighborhood are correlated!!

Key idea:

When we add explanatory variables (x’s), to a regression model, the regression coefficients change.

How they change depends on how the x’s are correlated with each other.

The Sales Data(Omitted Variable Bias)

In this data we have weekly observations on

sales: # firm’s sales in units (in excess of base level)

p1 = price charged by our firm: \$ (in excess of base)

p2 = competitor’s price: \$ (in excess of base).

p1 p2 Sales

5.13567 5.2042 144.49

3.49546 8.0597 637.25

7.27534 11.6760 620.79

4.66282 8.3644 549.01

...

...

(each row corresponds

to a week)

If we regress Sales on own price (p1), we obtain the very surprising conclusion that a higher price is associated with more sales.

This means we can raise prices and expect to sell more!

WE’LL BE RICH!!

The regression line

has a positive slope !!

No. We’ve left something out.

Sales on own price:

The regression equation is

Sales = 211 + 63.7 p1

A multiple regression of Sales on own price (p1) and

competitor\'s price (p2) yield more sensible results:

The regression equation is

Sales = 116 - 97.7 p1 + 109 p2

Remember: -97.7 is the affect on sales of a change in p1 with p2 held fixed.

Demand for our product depends on our price AND our competitor’s price, and these prices are correlated!!

We MUST control for our competitor’s price when we ask how changing the price we charge affects sales.

How can we see what is going on ?

First look at the scatterplot of p1 versus p2. In weeks 82 and 99 of our sample, p2 is roughly the same.

When we look at the relationship between sales and p1 with p2 held fixed, it is strongly negative!!

week 99

82

99

week 82

Note the strong relationship between p1 and p2 !!

Here we select a subset of points where p1 varies

and p2 does is held (approximately) constant.

Looking at those same points on the Sales vs p1 plot, we see that for a fixed level of p2, variations in p1 are negatively correlated with sales!!

Different colors indicate different ranges of p2.

for each fixed level of p2

there is a negative relationship

between sales and p1

larger p1 are associated with

larger p2

Sales

p1

p2

p1

Because p1 and p2 are correlated, when we leave p2 out of the regression we get a very misleading answer!

This is known as omitted variable bias.

In this example, since a large p1 is associated with a large p2, when we regress Sales on just p1 it looks like higher p1 makes sales higher.

BUT, with p2 fixed, a larger p1 leads to lower sales!!

Assuming linearity, the multiple regression can “figure out” the effect of p1 with p2 fixed.

LOOK AT YOUR DATA !!!

THINK ABOUT WHAT YOUR MODEL IS SAYING !!!

ALL of the statistical tools we’ve talked about are based on models, which make ASSUMPTIONS about the data we see.

If these model assumptions are violated, your results can be highly misleading!!

Looking at your data and thinking carefully about what your model is saying can tell you whether your results are reasonable!!