Loading in 5 sec....

Notes 5: Simple Linear RegressionPowerPoint Presentation

Notes 5: Simple Linear Regression

- By
**keira** - Follow User

- 190 Views
- Updated On :

Notes 5: Simple Linear Regression. 1. The Simple Linear Regression Model 2. Estimates and Plug-in Prediction 3. Confidence Intervals and Hypothesis Tests 4. Fits, residuals, and R-squared. 1. The Simple Linear Regression Model. price: thousands of dollars

Related searches for Notes 5: Simple Linear Regression

Download Presentation
## PowerPoint Slideshow about 'Notes 5: Simple Linear Regression' - keira

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Notes 5: Simple Linear Regression

1. The Simple Linear Regression Model

2. Estimates and Plug-in Prediction

3. Confidence Intervals and Hypothesis Tests

4. Fits, residuals, and R-squared

1. The Simple Linear Regression Model

price: thousands of dollars

size: thousands of square feet

price vs size

from the housing

data we looked

at before.

Two numeric

variables.

We want to

build a formal

probability model

for the variables.

With one numeric variable, we had data { y1, y2, …, yn }

We used an iid normal model:

where each p(y) is the density for Y ~ N(m,s2).

- Why did we do this?
- We have a story for what the next y will be,
- so we can predict the next y.
- We can do statistical inference for the parameters
- of interest, for example the mean E(y) = m

This time, instead of observations on one numeric variable { y1, y2, … , yn } we observe pairs ( xi , yi )

So our data looks like: { (x1,y1), (x2,y2), … , (xn,yn) }

In our housing example, this is a fancy way of saying we observe a sample of n houses and:

xn = square footage of the nth house

yn = price of the nth house

Note: By “simple” regression I mean we are looking at a relationship between two variables. In the next set of notes we do multiple regression (one y, lots of x’s).

To model a single numeric variable we came up with a story for the marginal distribution p(yi).

To model two numeric variables we could come up with a story for the joint distribution p(xi,yi).

If we think of our data as iid draws from p(xi,yi),

There is something called the "bivariate normal distribution"

and people do this, but regression takes a different tack.

Regression looks at the for the marginal distribution p(yconditional distribution of

Y given X.

Instead of coming up with a story for p(x,y):

what do I think the next (x,y) pair will be?

regression just talks about p(y|x):

given a value for x, what will y be ?

We will often call y the dependent variable and x the independent variable, explanatory variable, or sometimes just the regressor.

What kind of model should we use? for the marginal distribution p(y

In the housing data, the "overall linear relationship"

is striking.

Given x, y is approximately a linear function of x.

yi = linear function of xi + “error”

This is the starting point for the linear regression model. We need the “errors” because this linear relationship is not exact. y depends on other things besides x that we don’t see in our sample.

Why? Lots of reasons, but three would be: for the marginal distribution p(y

(i) Sometimes you know x and just need to predict y

as in the housing price problem from the homework.

(ii) As we discussed before, the conditional distribution

is an excellent way to think about the relationship

between two variables.

(iii) Linear relationships are easy to work with and are

a good approximation in lots of real world problems.

Aside for the marginal distribution p(y: What are the potential disadvantages?

- Relationship might not be linear.

This isn’t so bad really. We’ll do multiple regression next time, where

yi = c0 + c1x1i + c2x2i + … + ckxki + “error”

If we want we can define x2i = x1i2, x3i = x1i3, etc.

(II) Which variable is y and which is x??

There is no right answer!! Knowing some theory helps. It also depends on what you want to do.

The Simple Linear Regression Model for the marginal distribution p(y

This is just a linear relationship:

a is the intercept (the value of y when x=0)

b is the slope (change in y when x is

We use the normal distribution to describe the errors!

increased by 1)

Here is a picture of our model. for the marginal distribution p(y

We are “drawing a line” through our data.

But of course our observations don’t all lie

exactly along a straight line!

Y

e for this

observation

Slope

= b

“True" relationship between Y and X.

Intercept

= a

It is described by the parameters a and b.

X

1

How do we get y for the marginal distribution p(y1 from x1 ?

"true" linear

relationship

y1

= a+bx1+ e1

e for this

observation

Realized e1

Also called

the “residual”

a+bx1

X

x1

Of course, the model is "behind the curtain", for the marginal distribution p(y

all we see are the data.

Y

We'll have to estimate

or guess the "true" model

parameters a, b, and s

from the data.

X

The role of for the marginal distribution p(ys

s large

s small

Technically, E(Y|X) = a + bX

and Var(Y|X) = s2

In practice, a and b describe the “regression line”,

and s describes how close the relationship

is to linear, i.e. on average, how big the errors are.

Note that we dropped for the marginal distribution p(y

the subscripts.

(Y instead of Yi).

Here we just write

Y and x.

We must assume that

the model applies

to all (x,y) pairs we have

seen (the data) and

those we wish to think

about in the future.

Another way to think about the model

e independent of X

is:

Given a value of x, Y is normal with

mean: E(Y|X) = a+bx

and variance: Var(Y|X) = s2

If we for the marginal distribution p(yknewa, b, and s, and had a value in mind for x,

can we make a prediction for y?

The “guess” we’d make for y is just a + bx

Since e ~ N(0,s2) there’s a 95% chance y is within +/-2s

y = a + bx

a+bx +2s

a+bx

a+bx -2s

Use the “empirical rule”!

(a + bx) 2s

X

Let’s say: x = 2.2

Given the model, and a value for x, can we predict Y? for the marginal distribution p(y

This is just the “empirical rule” yet again!

Your guess:

How wrong could you be?

(with 95% probability)

Of course we don't know a, b, or s

We have to estimate them from the data!!

Important for the marginal distribution p(y

Interpreting a, b, and s.

Given a value x, we describe what we think we will see for Y as normal with mean a + bx and variance s2.

b tells us: If the value we saw for X was one unit bigger, how much would our prediction for Y change?

a tells us: What would we predict for Y if x=0?

s tells us: If a + bx is our prediction for Y given x, how big is the error associated with this prediction?

2. Estimates and Plug-in Prediction for the marginal distribution p(y

Regression assumes there is a linear relationship between Y and X:

e independent of X

But in practice, we don’t know a, b and s. They are unknown parameters in our model. We have to estimate them using the data we see!

Fortunately, any stats software package will compute these estimates for you.

Here is the output from the regression of price on size for the marginal distribution p(y

in StatPro:

a is our estimate of a

b is our estimate of b

se is our estimate of s.

se

The rest of this stuff we’ll get to later.

a

b

Think of the fitted regression line as an estimate for the marginal distribution p(y

of the true linear relationship between X and Y.

If the fitted line is

y = a + bx

then a is our estimate of a and b is our estimate of b.

Basically we want to pick a and b to give us the line

that “best fits” our data.

The StatPro item "StErr of Est" is our estimate of s.

We'll denote this by se.

I've tried to use the same notation as the AWZ book.

The formula for the slope ( for the marginal distribution p(yb) and the intercept (a) are

Remember, you can think of b as taking covariance and “standardizing” it so its units are ( units of y )/( unit of x )

The formula for a just makes sure our fitted line passes through the point

a for the marginal distribution p(y and b are often called the least squaresestimates of a and b. (See AWZ, p. 564)

Remember mean squared error (MSE) from the homework? Using calculus we can show that these minimize

Define the “residual” ei = yi - (a + bxi)

as the distance between the observed yi and the corresponding point on our fitted line, a + bxi

The least squares line yi = a + bxi minimizes

To get a feel for this, check out this website.

Our estimate of for the marginal distribution p(ys is just the sample standard deviation of the residuals ei

Here we divide by n-2 instead of n-1 for the same technical reasons (to get an unbiased estimator).

se just asks, “on average, how far are our observed values yi away from the line we fitted?”

If we plug in our estimates for the true values for the marginal distribution p(y

then a "plug-in" predictive interval given x is:

y=a+bx+2se

y = a + bx +/- 2se

Suppose we know

x=2.2.

a+bx = 144.41

2se = 44.95

interval for y =

144.41 +/- 44.95

Of course, our

prediction for y depends on x!

y=a+bx

y=a+bx-2se

summary: for the marginal distribution p(y

parameterestimate

aa

bb

s se

plug-in predictive interval given a value for x:

a+bx +/- 2se

3. Confidence Intervals and Hypothesis Tests for the marginal distribution p(y

I randomly picked

10 of the houses

out of our data set.

With just those

10 observations,

I get the solid line

as my estimated

line.

The dashed line

uses all the data.

Which line would you rather use to predict?

(the sold red

points are the selected

subset)

With more data we expect we have a better for the marginal distribution p(y

chance that our estimates will be close to the

true (or "population" values).

The "true line" is the one that "generalizes"

to the size and price of future houses,

not just the ones in our current data.

How big is our error?

We can show that a and b are unbiased estimates

of a and b, and both have normal sampling distributions.

We have standard errors and confidence intervals

for our estimates of the true slope and intercept.

Let s for the marginal distribution p(ya denote the standard error associated with the estimate a.

Let sb denote the standard error associated with the estimate b.

We could write down the formulas for sa and sb but let’s use StatPro.

sa

sb

Notation: it might make more sense to use se(b)

instead of sb, but I am following the book.

95% confidence interval for for the marginal distribution p(ya:

estimate

+/-

2 standard

errors

AGAIN!!!

(in excel)

95% confidence interval for b:

(in excel)

Remember if n is bigger than 30 or so, tval is about 2.

Aside for the marginal distribution p(y:

Our estimates a and b both involve averages of x and y.

Similar to the sample mean, if n is large and the data are

iid, the sampling distributions of a and b will look normal.

(We can use a fancier version of the CLT to prove this.)

So even if e is NOT normal, the CI’s will still be

95% CI for a:a 2sa

95% CI for b: b 2sb

But normality of is still CRUCIAL for prediction!!

Example for the marginal distribution p(y

For the housing data (using all n=128 observations)

the 95% CI for the slope is:

70.23 +/- 2(9.43) = 70.23 +/- 18.86

big !! (what are the units?)

With only 10 observations b=135.50 and s for the marginal distribution p(yb = 49.77.

Note how much bigger the standard error is than

with all 128 observations.

135.5 +/- 2.3*(50)

really big difference !!

Note for the marginal distribution p(y:

It the CI for slope and intercept are big the

plug-in predictive interval can be misleading!!

There are ways to correct for plugging in estimates

(just like we talked about for IID normal prediction),

but the formulas get hairy so we won’t cover this.

The predictive interval just gets bigger.

AND as before, when n is large we are ok just using

a + bx 2se

Note for the marginal distribution p(y: StatPro prints out the 95% CI’s for a and b

Hypothesis tests on coefficients: for the marginal distribution p(y

To test the null hypothesis

vs.

t is the

"t statistic"

Remember

for n > 30

or so, this

just means

“reject if

t > 2” !

We reject at level .05 if

(in Excel)

Otherwise, we fail to reject.

Intuitively, we reject if estimate is more than 2 se's away

from proposed value.

Same thing for the slope: for the marginal distribution p(y

To test the null hypothesis

vs.

We reject at level .05 if

(in Excel)

Otherwise, we fail to reject.

Intuitively, we reject if estimate is more than 2 se's away

from proposed value.

Note for the marginal distribution p(y:

The hypothesis: Ho: b = 0

plays a VERY important role and is often tested.

Why?

If the slope = 0, then the conditional distribution

of Y does not depend on x They are independent !!

(under the assumptions of our model)

This is particularly important in multiple regression.

Example for the marginal distribution p(y

StatPro automatically prints

out the t-statistics for testing

whether the intercept=0 and

whether the slope=0.

To test b=0, the t-statistic is

(b-0)/sb = 70.2263/9.4265 = 7.45

We reject the null at level 5% because the t-stat is bigger

than 2 (in absolute value).

p-values for the marginal distribution p(y

Most regression packages automatically print out

the p-values for the hypotheses that the intercept=0

and that the slope is 0.

That's the p-value column in the StatPro output.

Is the intercept 0?, p-value = .59, fail to reject

Is the slope 0?, p-value = .0000, reject

From a practical standpoint, what does this mean?

Rejecting Ho: b = 0 means that we find evidence that square footage does significantly impact housing price!

Note: for the marginal distribution p(y

For n greater than about 30, the t-stat can be interpreted

as a z-value. Thus we can compute the p-value.

For the intercept:

standard normal pdf

2*(the standard normal cdf at -.53) = .596

which is the p-value given by StatPro.

Example (the market model) for the marginal distribution p(y

In finance, a popular model is to regress stock returns against returns on some market index, such as the S&P 500.

The slope of the regression line, referred to as “beta”, is a measure of how sensitive a stock is to movements in the market.

Usually, a beta less than 1 means the stock is less risky than the market, equal to 1 same risk as the market and greater than 1, riskier than the market.

We will examine the market model for the stock General Electric, using the S&P 500 as a proxy for the market.

Three years of monthly data give 36 observations.

Minitab ouput: Electric, using the S&P 500 as a proxy for the market.

a

sa

b

sb

The regression equation is

ge = 0.00301 + 1.20 sp500

Predictor Coef Stdev t-ratio p

Constant 0.003013 0.006229 0.48 0.632

sp500 1.1995 0.1895 6.33 0.000

s = 0.03454 R-sq = 54.1% R-sq(adj) = 52.7%

se

We can test the hypothesis that the slope is zero: that is, are GE returns related to the market?

Ho: b = 0

The test statistic is Electric, using the S&P 500 as a proxy for the market.

and

so we reject the null hypothesis at level .05. We could have looked at the p-value (which is smaller than .05) and said the same thing right away.

We might also test the hypothesis that GE has the same risk as the market: that is, the slope equals 1.

or Ho: b = 1

Here we subtract 1 from b because we are testing Ho: b=1

The t statistic is:

Now, 1.055 is less than 2.03 so we fail to reject.

What is the p-value ??

What is the confidence interval for the GE beta? as the market: that is, the slope equals 1.

1.2 +\- 2(.19) = [ .82 , 1.58 ]

Question: what does this interval tell us about our level of certainty about the beta for GE?

4. Fits, residuals, and R-squared as the market: that is, the slope equals 1.

Our model is:

We think of each (xi,yi) as having been generated by

part of y that depends on x

part of y that has nothing to do with x

We want to ask, “How well does X explain Y”? We could think about this by “breaking up” Y into two parts:

a + bxi (part that’s explained by x)

ei (part that’s NOT explained)

But remember, we don’t see a or b !!

So let’s suppose we have some data

{ (x1,y1), (x2,y2), … (xn,yn) }

And we’ve “run a regression”. That is, we’ve computed, by hand or using some computer package, the estimates: a, b, se

It turns out to be useful to estimate these two think about this by “breaking up” Y into two parts:

parts for each observation in our sample.

For each (xi,yi) in the data:

y=a+bx

Define:

We have:

“fitted value” for ith observation;

the part of y that’s “explained” by x

residual for ith observation.

Note think about this by “breaking up” Y into two parts:: Regression chooses the “best fitting line”, i.e.

a,b so that:

y vs x

resid off line vs x

reg

line

Intuition:

Model: E(e)=0, cor(x,e)=0

pick a and b so that the

residuals {ei} also have

these properties.

slope

too

big

Note think about this by “breaking up” Y into two parts::

The sample correlation between residuals and fitted values is zero.

So we have:

(1)

(2)

(3)

because resids have 0 sample average think about this by “breaking up” Y into two parts:

because resids and fits have 0 sample correlation.

total variation in y =

variation explained by x + unexplained variation

R-squared think about this by “breaking up” Y into two parts:

the closer R-squared is to 1, the better the fit.

R think about this by “breaking up” Y into two parts:2

R-squared =28036.3627/(28036.3627 + 63648.8516) = 0.3057894

Note think about this by “breaking up” Y into two parts:: For simple linear regression,

R2 is also equal to the square of the correlation between y and x.

line has intercept 0 and slope 1

.553^2 = 0.305809

Note:

R2 = the square of the correlation between

y and the fits !!

Example think about this by “breaking up” Y into two parts:

Housing data from

a different neighborhood.

price: thousands of dollars

size: thousands of square feet

The correlation is .974.

R2 = .974^2= 0.948676

Minitab regression output: think about this by “breaking up” Y into two parts:

The regression equation is

price = 5.76 + 14.8 size

Predictor Coef Stdev t-ratio p

Constant5.763 1.633 3.53 0.003

size14.8159 0.8829 16.78 0.000

s = 2.210 R-sq = 94.9% R-sq(adj) = 94.6%

Analysis of Variance

SOURCE DF SS MS F p

Regression 1 1374.7 1374.7 281.58 0.000

Error 15 73.2 4.9

Total 16 1447.9

Fit Stdev.Fit 95% C.I. 95% P.I.

38.358 0.669 ( 36.932, 39.783) ( 33.436, 43.279)

mintab s = our se

For any x, the plug-in predictive interval has error think about this by “breaking up” Y into two parts:

+/- 2se = +/-4.4 thousands of dollars:

big!!!

Even though R2 is big, we still have a lot

of predictive uncertainty !!!

I think people over-emphasize R2.

I always look at how big se is, or equivalently how

wide the predictive interval will be.

Another example using StatPro: think about this by “breaking up” Y into two parts:

Regression using the beer data.

(Regress nbeer on weight)

Here’s the same file with the regression results color-coded.

Download Presentation

Connecting to Server..