Stats 330 lecture 6
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

STATS 330: Lecture 6 PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

STATS 330: Lecture 6. Inference for the Multiple Regression Model. Inference for the Regression model. Aim of today’s lecture: To discuss how we assess the significance of variables in the regression Key concepts: Standard errors Confidence intervals for the coefficients

Download Presentation

STATS 330: Lecture 6

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Stats 330 lecture 6

STATS 330: Lecture 6

Inference for the

Multiple Regression Model

330 Lecture 6


Inference for the regression model

Inference for the Regression model

Aim of today’s lecture:

To discuss how we assess the significance of variables in the regression

Key concepts:

  • Standard errors

  • Confidence intervals for the coefficients

  • Tests of significance

    Reference: Coursebook Section 3.2

330 Lecture 6


Variability of the regression coefficients

Variability of the regression coefficients

  • Imagine that we keep the x’s fixed, but resample the errors and refit the plane. How much would the plane (estimated coefficients) change?

  • This gives us an idea of the variability (accuracy) of the estimated coefficients as estimates of the coefficients of the true regression plane.

330 Lecture 6


The regression model cont

The regression model (cont)

  • The data is scattered above and below the plane:

  • Size of “sticks” is random, controlled by s2, doesn’t depend on

    x1, x2

330 Lecture 6


Variability of coefficients 2

Variability of coefficients (2)

  • Variability depends on

    • The arrangement of the x’s (the more correlation, the more change, see Lecture 8)

    • The error variance (the more scatter about the true plane, the more the fitted plane changes)

  • Measure variability by the standard error of the coefficients

330 Lecture 6


Cherries

Cherries

Call:

lm(formula = volume ~ diameter + height, data = cherry.df)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***

diameter 4.7082 0.2643 17.816 < 2e-16 ***

height 0.3393 0.1302 2.607 0.0145 *

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 3.882 on 28 degrees of freedom

Multiple R-Squared: 0.948, Adjusted R-squared: 0.9442

F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16

Standard errors of coefficients

330 Lecture 6


Confidence interval

Confidence interval

330 Lecture 6


Confidence interval 2

Confidence interval (2)

A 95% confidence interval for a regression coefficient is of the form

Estimated coefficient +/- standard error ´ t

where t is the 97.5% point of the appropriate t-distribution. The degrees of freedom are

n-k-1 where n=number of cases (observations) in the regression, and k is the number of variables (assuming we have a constant term)

330 Lecture 6


Example cherry trees

Example: cherry trees

Use functionconfint

> confint(cherry.lm)

2.5% 97.5%

(Intercept) -75.68226247 -40.2930554

diameter 4.16683899 5.2494820

height 0.07264863 0.6058538

Object created by lm

330 Lecture 6


Hypothesis test

Hypothesis test

  • Often we ask “do we need a particular variable, given the others are in the model?”

  • Note that this is not the same as asking “is a particular variable related to the response?”

  • Can test the former by examining the ratio of the coefficient to its standard error

330 Lecture 6


Hypothesis test 2

Hypothesis test (2)

  • This is the t-statistic t

  • The biggert , the more we need the variable

  • Equivalently, the smaller the p-value, the more we need the variable

330 Lecture 6


Cherries1

Cherries

t-values

Call:

lm(formula = volume ~ diameter + height, data = cherry.df)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***

diameter 4.7082 0.2643 17.816 < 2e-16 ***

height 0.3393 0.1302 2.607 0.0145 *

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 3.882 on 28 degrees of freedom

Multiple R-Squared: 0.948, Adjusted R-squared: 0.9442

F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16

p-values

All variables required since p=values small (<0.05)

330 Lecture 6


P value

P-value

Density curve for t with 28 degrees of freedom

P-value: total area is 0.0145

-2.607

2.607

330 Lecture 6


Other hypotheses

Other hypotheses

  • Overall significance of the regression: do none of the variables have a relationship with the response?

  • Use the F statistic: the bigger F, the more evidence that at least one variable has a relationship

    • equivalently, the smaller the p-value, the more evidence that at least one variable has a relationship

330 Lecture 6


Cherries2

Cherries

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***

diameter 4.7082 0.2643 17.816 < 2e-16 ***

height 0.3393 0.1302 2.607 0.0145 *

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 3.882 on 28 degrees of freedom

Multiple R-Squared: 0.948, Adjusted R-squared: 0.9442

F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16

p-value

F-value

330 Lecture 6


Testing if a subset is required

Testing if a subset is required

  • Often we want to test if a subset of variables is unnecessary

  • Terminology:

    Full model: model with all the variables

    Sub-model: model with a set of variables deleted.

  • Test is based on comparing the RSS of the submodel with the RSS of the full model. Full model RSS is always smaller (why?)

330 Lecture 6


Testing if a subset is adequate 2

Testing if a subset is adequate (2)

  • If the full model RSS is not much smaller than the submodel RSS, the submodel is adequate: we don’t need the extra variables.

  • To do the test, we

    • Fit both models, get RSS for both.

    • Calculate test statistic (see next slide)

    • If the test statistic is large, (equivalently the p-value is small) the submodel is not adequate

330 Lecture 6


Test statistic

Test statistic

  • Test statistic is

  • d is the number of variables dropped

  • s2 is the estimate of s2 from the full model (the residual mean square)

  • R has a function anova to do the calculations

330 Lecture 6


P values

P-values

  • When the smaller model is correct, the test statistic has an F-distribution with d and n-k-1 degrees of freedom

  • We assess if the value of F calculated from the sample is a plausible value from this distribution by means of a p-value

  • If the p-value is too small, we reject the hypothesis that the submodel is ok

330 Lecture 6


P values cont

P-values (cont)

Value of F

P-value

330 Lecture 6


Example

Example

  • Free fatty acid data: use physical measures to model a biochemical parameter in overweight children

  • Variables are

    • FFA: free fatty acid level in blood (response)

    • Age (months)

    • Weight (pounds)

    • Skinfold thickness (inches)

330 Lecture 6


Stats 330 lecture 6

Data

ffa age weight skinfold

0.759 105 67 0.96

0.274 107 70 0.52

0.685 100 54 0.62

0.526 103 60 0.76

0.859 97 61 1.00

0.652 101 62 0.74

0.349 99 71 0.76

1.120 101 48 0.62

1.059 107 59 0.56

1.035 100 51 0.44

… 20 observations in all

330 Lecture 6


Analysis 1

Analysis (1)

> model.full<- lm(ffa~age+weight+skinfold,data=fatty.df)

> summary(model.full)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.95777 1.40138 2.824 0.01222 *

age -0.01912 0.01275 -1.499 0.15323

weight -0.02007 0.00613 -3.274 0.00478 **

skinfold -0.07788 0.31377 -0.248 0.80714

This suggests that

  • age is not required if weight, skinfold retained,

  • skinfold is not required if weight, age retained

    Can we get away with just weight?

330 Lecture 6


Analysis 2

Analysis (2)

> model.sub<-lm(ffa~weight,data=fatty.df)

> anova(model.sub,model.full)

Analysis of Variance Table

Model 1: ffa ~ weight

Model 2: ffa ~ age + weight + skinfold

Res.Df RSS Df Sum of Sq F Pr(>F)

1 18 0.91007

  • 16 0.79113 2 0.11895 1.2028 0.3261

    Small F, large p-value suggest weight alone is adequate. But test should be interpreted with caution, as we “pretested”

330 Lecture 6


Testing a combination of coefficients

Testing a combination of coefficients

  • Cherry trees: Our model is V = c Db1Hb2 or

    log(V) = b0 + b1 log(D) + b2 log(H)

  • Dimension analysis suggests b1 + b2 = 3

  • How can we test this?

  • Test statistic is

  • P value is area under t-curve beyond +/- t

330 Lecture 6


Testing a combination cont

Testing a combination (cont)

  • We can use the “R330” function test.lc to compute the value of t:

> cherry.lm = lm(log(volume)~log(diameter)+log(height),data=cherry.df)

> cc = c(0,1,1)

> c = 3

> test.lc(cherry.lm,cc,c)

$est

[1] 3.099773

$std.err

[1] 0.1765222

$t.stat

[1] 0.5652165

$df

[1] 28

$p.val

[1] 0.5764278

330 Lecture 6


The r330 package

The “R330 package”

  • A set of functions written for the course, in the form of an R package

  • Install the package using the R packages menu (see coursebook for details). Then type

    library(R330)

330 Lecture 6


Testing a combination cont1

Testing a combination (cont)

  • In general, we might want to test

    c0b0 + c1b1 + c2b2 = c

    (in our example c0 = 0, c1=1, c2=1, c = 3)

  • Estimate is

  • Test statistic is

330 Lecture 6


  • Login