Notes 5: Simple Linear Regression - PowerPoint PPT Presentation

Slide1 l.jpg
1 / 61

  • Updated On :
  • Presentation posted in: General

Notes 5: Simple Linear Regression. 1. The Simple Linear Regression Model 2. Estimates and Plug-in Prediction 3. Confidence Intervals and Hypothesis Tests 4. Fits, residuals, and R-squared. 1. The Simple Linear Regression Model. price: thousands of dollars

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Notes 5: Simple Linear Regression

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Slide1 l.jpg

Notes 5: Simple Linear Regression

1. The Simple Linear Regression Model

2. Estimates and Plug-in Prediction

3. Confidence Intervals and Hypothesis Tests

4. Fits, residuals, and R-squared

Slide2 l.jpg

1. The Simple Linear Regression Model

price: thousands of dollars

size: thousands of square feet

price vs size

from the housing

data we looked

at before.

Two numeric


We want to

build a formal

probability model

for the variables.

Slide3 l.jpg

With one numeric variable, we had data { y1, y2, …, yn }

We used an iid normal model:

where each p(y) is the density for Y ~ N(m,s2).

  • Why did we do this?

  • We have a story for what the next y will be,

  • so we can predict the next y.

  • We can do statistical inference for the parameters

  • of interest, for example the mean E(y) = m

Slide4 l.jpg

This time, instead of observations on one numeric variable { y1, y2, … , yn } we observe pairs ( xi , yi )

So our data looks like: { (x1,y1), (x2,y2), … , (xn,yn) }

In our housing example, this is a fancy way of saying we observe a sample of n houses and:

xn = square footage of the nth house

yn = price of the nth house

Note: By “simple” regression I mean we are looking at a relationship between two variables. In the next set of notes we do multiple regression (one y, lots of x’s).

Slide5 l.jpg

To model a single numeric variable we came up with a story for the marginal distribution p(yi).

To model two numeric variables we could come up with a story for the joint distribution p(xi,yi).

If we think of our data as iid draws from p(xi,yi),

There is something called the "bivariate normal distribution"

and people do this, but regression takes a different tack.

Slide6 l.jpg

Regression looks at the conditional distribution of

Y given X.

Instead of coming up with a story for p(x,y):

what do I think the next (x,y) pair will be?

regression just talks about p(y|x):

given a value for x, what will y be ?

We will often call y the dependent variable and x the independent variable, explanatory variable, or sometimes just the regressor.

Slide7 l.jpg

What kind of model should we use?

In the housing data, the "overall linear relationship"

is striking.

Given x, y is approximately a linear function of x.

yi = linear function of xi + “error”

This is the starting point for the linear regression model. We need the “errors” because this linear relationship is not exact. y depends on other things besides x that we don’t see in our sample.

Slide8 l.jpg

Why? Lots of reasons, but three would be:

(i) Sometimes you know x and just need to predict y

as in the housing price problem from the homework.

(ii) As we discussed before, the conditional distribution

is an excellent way to think about the relationship

between two variables.

(iii) Linear relationships are easy to work with and are

a good approximation in lots of real world problems.

Slide9 l.jpg

Aside: What are the potential disadvantages?

  • Relationship might not be linear.

This isn’t so bad really. We’ll do multiple regression next time, where

yi = c0 + c1x1i + c2x2i + … + ckxki + “error”

If we want we can define x2i = x1i2, x3i = x1i3, etc.

(II) Which variable is y and which is x??

There is no right answer!! Knowing some theory helps. It also depends on what you want to do.

Slide10 l.jpg

The Simple Linear Regression Model

This is just a linear relationship:

a is the intercept (the value of y when x=0)

b is the slope(change in y when x is

We use the normal distribution to describe the errors!

increased by 1)

Slide11 l.jpg

Here is a picture of our model.

We are “drawing a line” through our data.

But of course our observations don’t all lie

exactly along a straight line!


e for this



= b

“True" relationship between Y and X.


= a

It is described by the parameters a and b.



Slide12 l.jpg

How do we get y1 from x1 ?

"true" linear



= a+bx1+ e1

e for this


Realized e1

Also called

the “residual”




Slide13 l.jpg

Of course, the model is "behind the curtain",

all we see are the data.


We'll have to estimate

or guess the "true" model

parameters a, b, and s

from the data.


Slide14 l.jpg

The role ofs

s large

s small

Technically, E(Y|X) = a + bX

andVar(Y|X) = s2

In practice, a and b describe the “regression line”,

and s describes how close the relationship

is to linear, i.e. on average, how big the errors are.

Slide15 l.jpg

Note that we dropped

the subscripts.

(Y instead of Yi).

Here we just write

Y and x.

We must assume that

the model applies

to all (x,y) pairs we have

seen (the data) and

those we wish to think

about in the future.

Another way to think about the model

e independent of X


Given a value of x, Y is normal with

mean:E(Y|X) = a+bx

and variance: Var(Y|X) = s2

Slide16 l.jpg

If we knewa, b, and s, and had a value in mind for x,

can we make a prediction for y?

The “guess” we’d make for y is just a + bx

Since e ~ N(0,s2) there’s a 95% chance y is within +/-2s

y = a + bx

a+bx +2s


a+bx -2s

Use the “empirical rule”!

(a + bx)  2s


Let’s say: x = 2.2

Slide17 l.jpg

Given the model, and a value for x, can we predict Y?

This is just the “empirical rule” yet again!

Your guess:

How wrong could you be?

(with 95% probability)

Of course we don't know a, b, or s

We have to estimate them from the data!!

Slide18 l.jpg


Interpreting a, b, and s.

Given a value x, we describe what we think we will see for Y as normal with mean a + bx and variance s2.

b tells us: If the value we saw for X was one unit bigger, how much would our prediction for Y change?

a tells us: What would we predict for Y if x=0?

s tells us: If a + bx is our prediction for Y given x, how big is the error associated with this prediction?

Slide19 l.jpg

2. Estimates and Plug-in Prediction

Regression assumes there is a linear relationship between Y and X:

e independent of X

But in practice, we don’t know a, b and s. They are unknown parameters in our model. We have to estimate them using the data we see!

Fortunately, any stats software package will compute these estimates for you.

Slide20 l.jpg

Here is the output from the regression of price on size

in StatPro:

a is our estimate of a

b is our estimate of b

se is our estimate of s.


The rest of this stuff we’ll get to later.



Slide21 l.jpg

Think of the fitted regression line as an estimate

of the true linear relationship between X and Y.

If the fitted line is

y = a + bx

then a is our estimate of a and b is our estimate of b.

Basically we want to pick a and b to give us the line

that “best fits” our data.

The StatPro item "StErr of Est" is our estimate of s.

We'll denote this by se.

I've tried to use the same notation as the AWZ book.

Slide22 l.jpg

The formula for the slope (b) and the intercept (a) are

Remember, you can think of b as taking covariance and “standardizing” it so its units are ( units of y )/( unit of x )

The formula for a just makes sure our fitted line passes through the point

Slide23 l.jpg

a and b are often called the least squaresestimates of a and b. (See AWZ, p. 564)

Remember mean squared error (MSE) from the homework? Using calculus we can show that these minimize

Define the “residual”ei = yi - (a + bxi)

as the distance between the observed yi and the corresponding point on our fitted line, a + bxi

The least squares line yi = a + bxi minimizes

To get a feel for this, check out this website.

Slide24 l.jpg

Our estimate of s is just the sample standard deviation of the residuals ei

Here we divide by n-2 instead of n-1 for the same technical reasons (to get an unbiased estimator).

se just asks, “on average, how far are our observed values yi away from the line we fitted?”

Slide25 l.jpg

If we plug in our estimates for the true values

then a "plug-in" predictive interval given x is:


y = a + bx +/- 2se

Suppose we know


a+bx = 144.41

2se = 44.95

interval for y =

144.41 +/- 44.95

Of course, our

prediction for y depends on x!



Slide26 l.jpg





s se

plug-in predictive interval given a value for x:

a+bx +/- 2se

Slide27 l.jpg

3. Confidence Intervals and Hypothesis Tests

I randomly picked

10 of the houses

out of our data set.

With just those

10 observations,

I get the solid line

as my estimated


The dashed line

uses all the data.

Which line would you rather use to predict?

(the sold red

points are the selected


Slide28 l.jpg

With more data we expect we have a better

chance that our estimates will be close to the

true (or "population" values).

The "true line" is the one that "generalizes"

to the size and price of future houses,

not just the ones in our current data.

How big is our error?

We can show that a and b are unbiased estimates

of a and b, and both have normal sampling distributions.

We have standard errors and confidence intervals

for our estimates of the true slope and intercept.

Slide29 l.jpg

Let sa denote the standard error associated with the estimate a.

Let sb denote the standard error associated with the estimate b.

We could write down the formulas for sa and sb but let’s use StatPro.



Notation: it might make more sense to use se(b)

instead of sb, but I am following the book.

Slide30 l.jpg

95% confidence interval for a:



2 standard



(in excel)

95% confidence interval for b:

(in excel)

Remember if n is bigger than 30 or so, tval is about 2.

Slide31 l.jpg


Our estimates a and b both involve averages of x and y.

Similar to the sample mean, if n is large and the data are

iid, the sampling distributions of a and b will look normal.

(We can use a fancier version of the CLT to prove this.)

So even if e is NOT normal, the CI’s will still be

95% CI for a:a 2sa

95% CI for b:b  2sb

But normality of is still CRUCIAL for prediction!!

Slide32 l.jpg


For the housing data (using all n=128 observations)

the 95% CI for the slope is:

70.23 +/- 2(9.43) = 70.23 +/- 18.86

big !! (what are the units?)

Slide33 l.jpg

With only 10 observations b=135.50 and sb = 49.77.

Note how much bigger the standard error is than

with all 128 observations.

135.5 +/- 2.3*(50)

really big difference !!

Slide34 l.jpg


It the CI for slope and intercept are big the

plug-in predictive interval can be misleading!!

There are ways to correct for plugging in estimates

(just like we talked about for IID normal prediction),

but the formulas get hairy so we won’t cover this.

The predictive interval just gets bigger.

AND as before, when n is large we are ok just using

a + bx  2se

Slide35 l.jpg

Note: StatPro prints out the 95% CI’s for a and b

Slide36 l.jpg

Hypothesis tests on coefficients:

To test the null hypothesis


t is the

"t statistic"


for n > 30

or so, this

just means

“reject if

t > 2” !

We reject at level .05 if

(in Excel)

Otherwise, we fail to reject.

Intuitively, we reject if estimate is more than 2 se's away

from proposed value.

Slide37 l.jpg

Same thing for the slope:

To test the null hypothesis


We reject at level .05 if

(in Excel)

Otherwise, we fail to reject.

Intuitively, we reject if estimate is more than 2 se's away

from proposed value.

Slide38 l.jpg


The hypothesis: Ho: b = 0

plays a VERY important role and is often tested.


If the slope = 0, then the conditional distribution

of Y does not depend on x They are independent !!

(under the assumptions of our model)

This is particularly important in multiple regression.

Slide39 l.jpg


StatPro automatically prints

out the t-statistics for testing

whether the intercept=0 and

whether the slope=0.

To test b=0, the t-statistic is

(b-0)/sb = 70.2263/9.4265 = 7.45

We reject the null at level 5% because the t-stat is bigger

than 2 (in absolute value).

Slide40 l.jpg


Most regression packages automatically print out

the p-values for the hypotheses that the intercept=0

and that the slope is 0.

That's the p-value column in the StatPro output.

Is the intercept 0?, p-value = .59, fail to reject

Is the slope 0?, p-value = .0000, reject

From a practical standpoint, what does this mean?

Rejecting Ho: b = 0 means that we find evidence that square footage does significantly impact housing price!

Slide41 l.jpg


For n greater than about 30, the t-stat can be interpreted

as a z-value. Thus we can compute the p-value.

For the intercept:

standard normal pdf

2*(the standard normal cdf at -.53) = .596

which is the p-value given by StatPro.

Slide42 l.jpg

Example (the market model)

In finance, a popular model is to regress stock returns against returns on some market index, such as the S&P 500.

The slope of the regression line, referred to as “beta”, is a measure of how sensitive a stock is to movements in the market.

Usually, a beta less than 1 means the stock is less risky than the market, equal to 1 same risk as the market and greater than 1, riskier than the market.

Slide43 l.jpg

We will examine the market model for the stock General Electric, using the S&P 500 as a proxy for the market.

Three years of monthly data give 36 observations.

Slide44 l.jpg

Minitab ouput:





The regression equation is

ge = 0.00301 + 1.20 sp500

Predictor Coef Stdev t-ratio p

Constant 0.003013 0.006229 0.48 0.632

sp500 1.1995 0.1895 6.33 0.000

s = 0.03454 R-sq = 54.1% R-sq(adj) = 52.7%


We can test the hypothesis that the slope is zero: that is, are GE returns related to the market?

Ho: b = 0

Slide45 l.jpg

The test statistic is


so we reject the null hypothesis at level .05. We could have looked at the p-value (which is smaller than .05) and said the same thing right away.

Slide46 l.jpg

We might also test the hypothesis that GE has the same risk as the market: that is, the slope equals 1.

or Ho: b = 1

Here we subtract 1 from b because we are testing Ho: b=1

The t statistic is:

Now, 1.055 is less than 2.03 so we fail to reject.

What is the p-value ??

Slide47 l.jpg

What is the confidence interval for the GE beta?

1.2 +\- 2(.19) = [ .82 , 1.58 ]

Question: what does this interval tell us about our level of certainty about the beta for GE?

Slide48 l.jpg

4. Fits, residuals, and R-squared

Our model is:

We think of each (xi,yi) as having been generated by

part of y that depends on x

part of y that has nothing to do with x

Slide49 l.jpg

We want to ask, “How well does X explain Y”? We could think about this by “breaking up” Y into two parts:

a + bxi(part that’s explained by x)

ei(part that’s NOT explained)

But remember, we don’t see a or b !!

So let’s suppose we have some data

{ (x1,y1), (x2,y2), … (xn,yn) }

And we’ve “run a regression”. That is, we’ve computed, by hand or using some computer package, the estimates:a, b, se

Slide50 l.jpg

It turns out to be useful to estimate these two

parts for each observation in our sample.

For each (xi,yi) in the data:



We have:

“fitted value” for ith observation;

the part of y that’s “explained” by x

residual for ith observation.

Slide51 l.jpg

Fits andresids

for the

housing data.

Slide52 l.jpg

Note: Regression chooses the “best fitting line”, i.e.

a,b so that:

y vs x

resid off line vs x




Model: E(e)=0, cor(x,e)=0

pick a and b so that the

residuals {ei} also have

these properties.




Slide53 l.jpg


The sample correlation between residuals and fitted values is zero.

So we have:




Slide54 l.jpg

because resids have 0 sample average

because resids and fits have 0 sample correlation.

total variation in y =

variation explained by x + unexplained variation

Slide55 l.jpg


the closer R-squared is to 1, the better the fit.

Slide56 l.jpg


R-squared =28036.3627/(28036.3627 + 63648.8516) = 0.3057894

Slide57 l.jpg

Note: For simple linear regression,

R2 is also equal to the square of the correlation between y and x.

line has intercept 0 and slope 1

.553^2 = 0.305809


R2 = the square of the correlation between

y and the fits !!

Slide58 l.jpg


Housing data from

a different neighborhood.

price: thousands of dollars

size: thousands of square feet

The correlation is .974.

R2 = .974^2= 0.948676

Slide59 l.jpg

Minitab regression output:

The regression equation is

price = 5.76 + 14.8 size

Predictor Coef Stdev t-ratio p

Constant5.763 1.633 3.53 0.003

size14.8159 0.8829 16.78 0.000

s = 2.210 R-sq = 94.9% R-sq(adj) = 94.6%

Analysis of Variance


Regression 1 1374.7 1374.7 281.58 0.000

Error 15 73.2 4.9

Total 16 1447.9

Fit Stdev.Fit 95% C.I. 95% P.I.

38.358 0.669 ( 36.932, 39.783) ( 33.436, 43.279)

mintab s = our se

Slide60 l.jpg

For any x, the plug-in predictive interval has error

+/- 2se = +/-4.4 thousands of dollars:


Even though R2 is big, we still have a lot

of predictive uncertainty !!!

I think people over-emphasize R2.

I always look at how big se is, or equivalently how

wide the predictive interval will be.

Slide61 l.jpg

Another example using StatPro:

Regression using the beer data.

(Regress nbeer on weight)

Here’s the same file with the regression results color-coded.

  • Login