Stats 330 lecture 19
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Stats 330: Lecture 19 PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Stats 330: Lecture 19. Models with many continuous and categorical explanatory variables. Plan of the day. In today’s lecture , we look at some general strategies for choosing models having lots of continuous and categorical explanatory variables, and discuss an example. General Principle.

Download Presentation

Stats 330: Lecture 19

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Stats 330 lecture 19

Stats 330: Lecture 19

Models with many continuous and

categorical explanatory variables


Plan of the day

Plan of the day

In today’s lecture , we look at some general strategies for choosing models having lots of continuous and categorical explanatory variables, and discuss an example.


General principle

General Principle

  • For a problem with both continuous and categorical explanatory variables, the most general model is to fit separate regressions for each possible combination of the factor levels.

  • That is, we allow the categorical variables to interact with each other and the continuous variables.


Illustration

Illustration

  • Two factors A and B, two continuous explanatory variables X and Z

  • General model is

    y ~ A*B*X + A*B*Z

  • Suppose A has a levels and B has b levels, so there are a ´b factor level combinations

  • Each combination has a separate regression with 3 parameters

    • Constant term

    • Coefficient of X

    • Coefficient of Z


Illustration cont

Illustration (Cont)

  • There are a ´ b constant terms, we can arrange them in a table

  • Can split the table up into main effects and interactions as in 2 way anova

  • Listed in output as Intercept, A, B and A:B


Illustration cont1

Illustration (Cont)

  • There are a ´ b X-coefficients, we can also arrange them in a table

  • Again, we can split the table up into main effects and interactions as in 2 way anova

  • Listed in output as X, A:X, B:X and A:B:X

  • Ditto for Z

  • If all the A:X, B:X, A:B:X interactions are zero, coefficient of X is the same for all the a ´ b regressions


Model selection

Model selection

  • In these situations, the number of possible models is large

  • Need variable selection techniques

    • Anova

    • stepwise

  • Don’t include high order interactions unless you include lower order interactions


Caution

Caution

  • Sometimes we don’t have enough data to fit a separate regression to each factor level combination (need at least one more data point than number of continuous variables per combination)

  • In this case we drop out the higher level interactions, forcing coefficients to have common values.


Example risk factors for low birthweight

Example: Risk factors for low birthweight

These data were collected at Baystate Medical Center, Springfield, Mass. during 1986, as part of a study to identify risk factors for low-birthweight babies.

The response variable was birthweight, and data was collected on a variety of continuous and categorical explanatory variables


Variables

Variables

age : mother's age in years, continuous

lwt: mother's weight in pounds, continuous

race: mother's race (`1' = white, `2' = black, `3' = other), factor

smoke: smoking during pregnancy ( 1 =smoked, 0=didn’t smoke), factor

ht: history of hypertension (0=No, 1=Yes), factor

ui: presence of uterine irritability (0=No, 1=Yes), factor

bwt: birth weight in grams, continuous, response

Must be a factor!!


Preliminary plots

Preliminary plots


Plotting conclusions

Plotting conclusions

some relationships between bwt and the covariates

  • Slight relationship with lwt

  • Small effects due to the categorical variables

    On to fitting models……


Factor level combinations

Factor level combinations

  • There are 2 continuous explanatory variables, and 4 categorical explanatory variables, race (3 levels), smoke (2 levels) ht (2 levels) and ui (2 levels). There are 3x2x2x2=24 factor level combinations.

  • 24 regressions in all !!


Models

Models

  • The most general model would fit separate regression surfaces to each of the 24 combinations

  • Assuming planes are appropriate, this means 24 x 3 = 72 parameters. There are 189 observations, so this is rather a lot of parameters. (usually we want at least 5 observations per parameter). In fact not all factor level combinations have enough data to fit a plane (need at least 3 points)

  • The model fitting separate planes to each combination is

    bwt ~ age*race*smoke*ht*ui + lwt*race*smoke*ht*ui


Fitting

Fitting

  • Can fit the model and use the anova function to reduce number of variables

    > births.lm<-lm(bwt~age*race*smoke*ui*ht

    +lwt*race*smoke*ui*ht,

    data=births.df)

    > anova(births.lm)

  • Also use the stepwise function with the forward option

    > null.lm<-lm(bwt~1,data=births.df)

    > step(null.lm, formula(births.lm), direction="forward")


Results anova

Results: anova

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

age 1 806927 806927 2.0610 0.153251

race 2 4456772 2228386 5.6916 0.004167 **

smoke 1 7098861 7098861 18.1314 3.674e-05 ***

ui 1 6513795 6513795 16.6370 7.414e-05 ***

ht 1 2458238 2458238 6.2786 0.013317 *

lwt 1 2779537 2779537 7.0993 0.008579 **

age:race 2 368694 184347 0.4708 0.625420

age:smoke 1 2220991 2220991 5.6727 0.018520 *

race:smoke 2 1085210 542605 1.3859 0.253374

age:ui 1 187617 187617 0.4792 0.489886

race:ui 2 774013 387006 0.9885 0.374625

smoke:ui 1 43060 43060 0.1100 0.740641age:ht 1 1573461 1573461 4.0188 0.046844 *

race:ht 2 318415 159207 0.4066 0.666639

smoke:ht 1 115215 115215 0.2943 0.588322

race:lwt 2 1008962 504481 1.2885 0.278798

smoke:lwt 1 86923 86923 0.2220 0.638215


Results anova cont

Results: anova (cont)

Analysis of Variance Table

Df Sum Sq Mean Sq F value Pr(>F)

ui:lwt 1 196810 196810 0.5027 0.479457

ht:lwt 1 1145508 1145508 2.9258 0.089300 .

age:race:smoke 2 1063946 531973 1.3587 0.260218

age:race:ui 2 108742 54371 0.1389 0.870455

age:smoke:ui 1 533 533 0.0014 0.970632

race:smoke:ui 1 617235 617235 1.5765 0.211272

age:race:ht 2 1220320 610160 1.5584 0.213948

age:smoke:ht 1 406773 406773 1.0389 0.309752

race:smoke:lwt 2 1052747 526373 1.3444 0.263898

race:ui:lwt 2 786735 393367 1.0047 0.368668

smoke:ui:lwt 1 1128102 1128102 2.8813 0.091744 .

race:ht:lwt 1 435519 435519 1.1124 0.293310

age:race:smoke:ui 1 2544108 2544108 6.4980 0.011832 *

race:smoke:ui:lwt 1 150811 150811 0.3852 0.535806

Residuals 146 57162471 391524


Results stepwise forward both

Results: stepwise (forward/both)

Step: AIC= 2451.34

bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke

Df Sum of Sq RSS AIC

<none> 73000256 2451

- race:smoke 2 1657370 74657625 2452

+ ui:lwt 1 304152 72696104 2453

+ smoke:ht 1 168685 72831571 2453

- ht:lwt 1 1397486 74397742 2453

+ age 1 149901 72850355 2453

+ smoke:lwt 1 11843 72988412 2453

+ race:ht 2 497275 72502981 2454

+ race:lwt 2 441336 72558920 2454

- ui 1 6968046 79968302 2467


Comparisons

Comparisons

  • 3 models to compare

    • Full model

    • Model indicated by anova (model 2)

      bwt ~ age +ui + race + smoke + ht + lwt + age:ht + age:smoke,

    • Model chosen by stepwise (model 3)

      bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke,


Stats 330 lecture 19

extractAIC(model3.lm)


Deleting

Deleting?

  • Point 133 seems influential – big Cov ratio, HMD

  • Refitting without 133 now makes model 3 the best – will go with model 3

  • Could also just use a purely additive model (i.e parallel planes) - but adjusted R2 and AIC are slightly worse.


Summary model 3

Summary Model 3

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3158.801 267.867 11.792 < 2e-16 ***

ui1 -548.459 133.567 -4.106 6.12e-05 ***

race2 -561.784 187.680 -2.993 0.003152 **

race3 -500.440 133.004 -3.763 0.000228 ***

smoke1 -529.973 133.865 -3.959 0.000109 ***

ht1 -1978.134 711.642 -2.780 0.006026 **

lwt 2.426 1.788 1.357 0.176520

ht1:lwt 10.236 4.535 2.257 0.025217 *

race2:smoke1 255.066 300.258 0.849 0.396750

race3:smoke1 510.755 244.031 2.093 0.037768 *


Interpretation cont

Interpretation (cont)

Other things being equal:

  • Uterine irritability associated with lower birthweight

  • Smoking associated with lower birthweight, but differently for different races

  • Hypertension associated with lower birthweight

  • Race associated with lower birthweight

    • Black lower than white

    • “Other” lower than white

  • Higher mother’s weight associated with higher birthweight, for hypertension group

  • Smoking lowers birthweight more for race 1 (white).

  • These effects significant but small compared to variability.


Interpretation of interactions

Interpretation of interactions

-836 = -530 -561 + 255


Diagnostics for model 2

Diagnostics for model 2

Point 133 !!

Check for high-influence etc


  • Login