Overview of our study of the multiple linear regression model

Overview of our study of the multiple linear regression model Regression models with more than one slope parameter

Example 1 Is brain and body size predictive of intelligence? • Sample of n = 38 college students • Response (y): intelligence based on PIQ (performance) scores from the (revised) Wechsler Adult Intelligence Scale. • Potential predictor (x1): Brain size based on MRI scans (given as count/10,000). • Potential predictor (x2): Height in inches. • Potential predictor (x3): Weight in pounds.

Example 1 Scatter matrix plot

Scatter matrix plot • Illustrates the marginal relationships between each pair of variables without regard to the other variables. • The challenge is how the response y relates to all three predictors simultaneously.

Example 1 and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. A multiple linear regression model with three quantitative predictors • where … • yi is intelligence (PIQ) of student i • xi1 is brain size (MRI) of student i • xi2 is height (Height) of student i • xi3 is weight (Weight) of student i

Example 1 Some research questions • Which predictors – brain size, height, or weight – explain some variation in PIQ? • What is the effect of brain size on PIQ, after taking into account height and weight? • What is the PIQ of an individual with a given brain size, height, and weight?

Example 1 The regression equation is PIQ = 111 + 2.06 Brain - 2.73 Height + 0.001 Weight Predictor Coef SE Coef T P Constant 111.35 62.97 1.77 0.086 Brain 2.0604 0.5634 3.66 0.001 Height -2.732 1.229 -2.22 0.033 Weight 0.0006 0.1971 0.00 0.998 S = 19.79 R-Sq = 29.5% R-Sq(adj) = 23.3% Analysis of Variance Source DF SS MS F P Regression 3 5572.7 1857.6 4.74 0.007 Residual Error 34 13321.8 391.8 Total 37 18894.6 Source DF Seq SS Brain 1 2697.1 Height 1 2875.6 Weight 1 0.0

Example 2 Baby bird breathing habits in burrows? • Experiment with n = 120 nestling bank swallows • Response (y): % increase in “minute ventilation”, Vent, i.e., total volume of air breathed per minute • Potential predictor (x1): percentage of oxygen, O2, in the air the baby birds breathe • Potential predictor (x2): percentage of carbon dioxide, CO2, in the air the baby birds breathe

Example 2 Three-dimensional scatter plot

Example 2 and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. A first order model with two quantitative predictors • where … • yi is percentage of minute ventilation • xi1 is percentage of oxygen • xi2 is percentage of carbon dioxide

Example 2 Some research questions • Is oxygen related to minute ventilation, after taking into account carbon dioxide? • Is carbon dioxide related to minute ventilation, after taking into account oxygen? • What is the mean minute ventilation of all nestling bank swallows whose breathing air is comprised of 15% oxygen and 5% carbon dioxide?

Example 2 The regression equation is Vent = 86 - 5.33 O2 + 31.1 CO2 Predictor Coef SE Coef T P Constant 85.9 106.0 0.81 0.419 O2 -5.330 6.425 -0.83 0.408 CO2 31.103 4.789 6.50 0.000 S = 157.4 R-Sq = 26.8% R-Sq(adj) = 25.6% Analysis of Variance Source DF SS MS F P Regression 2 1061819 530909 21.44 0.000 Residual Error 117 2897566 24766 Total 119 3959385 Source DF Seq SS O2 1 17045 CO2 1 1044773

Example 3 Is baby’s birth weight related to smoking during pregnancy? • Sample of n = 32 births • Response (y): birth weight in grams of baby • Potential predictor (x1): smoking status of mother (yes or no) • Potential predictor (x2): length of gestation in weeks

Example 3 and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. A first order modelwith one binary predictor • where … • yi is birth weight of baby i • xi1 is length of gestation of baby i • xi2 = 1, if mother smokes and xi2 = 0, if not

Example 3 Estimated first order modelwith one binary predictor The regression equation is Weight = - 2390 + 143 Gest - 245 Smoking

Example 3 Some research questions • Is baby’s birth weight related to smoking during pregnancy? • How is birth weight related to gestation, after taking into account smoking status?

Example 3 The regression equation is Weight = - 2390 + 143 Gest - 245 Smoking Predictor Coef SE Coef T P Constant -2389.6 349.2 -6.84 0.000 Gest 143.100 9.128 15.68 0.000 Smoking -244.54 41.98 -5.83 0.000 S = 115.5 R-Sq = 89.6% R-Sq(adj) = 88.9% Analysis of Variance Source DF SS MS F P Regression 2 3348720 1674360 125.45 0.000 Residual Error 29 387070 13347 Total 31 3735789 Source DF Seq SS Gest 1 2895838 Smoking 1 452881

Example 4 Compare three treatments (A, B, C) for severe depression • Random sample of n = 36 severely depressed individuals. • y = measure of treatment effectiveness • x1 = age (in years) • x2 = 1 if patient received A and 0, if not • x3 = 1 if patient received B and 0, if not

Example 4 Compare three treatments (A, B, C) for severe depression

Example 4 A second order model with one quantitative predictor, a three-group qualitative variable, and interactions • where … • yi is treatment effectiveness for patient i • xi1 is age of patient i • xi2 = 1, if treatment A and xi2 = 0, if not • xi3 = 1, if treatment B and xi3 = 0, if not

Example 4 The estimated regression function Regression equation is y = 6.21 + 1.03 age + 41.3 x2 + 22.7 x3 - 0.703 agex2 - 0.510 agex3

Example 4 Potential research questions • Does the effectiveness of the treatment depend on age? • Is one treatment superior to the other treatment for all ages? • What is the effect of age on the effectiveness of the treatment?

Example 4 Regression equation is y = 6.21 + 1.03 age + 41.3 x2 + 22.7 x3 - 0.703 agex2 - 0.510 agex3 Predictor Coef SE Coef T P Constant 6.211 3.350 1.85 0.074 age 1.03339 0.07233 14.29 0.000 x2 41.304 5.085 8.12 0.000 x3 22.707 5.091 4.46 0.000 agex2 -0.7029 0.1090 -6.45 0.000 agex3 -0.5097 0.1104 -4.62 0.000 S = 3.925 R-Sq = 91.4% R-Sq(adj) = 90.0% Analysis of Variance Source DF SS MS F P Regression 5 4932.85 986.57 64.04 0.000 Residual Error 30 462.15 15.40 Total 35 5395.00 Source DF Seq SS age 1 3424.43 x2 1 803.80 x3 1 1.19 agex2 1 375.00 agex3 1 328.42

Example 5 How is the length of a bluegill fish related to its age? • In 1981, n = 78 bluegills randomly sampled from Lake Mary in Minnesota. • y = length (in mm) • x1 = age (in years)

Example 5 Scatter plot

Example 5 and … the independent error terms i follow a normal distribution with mean 0 and equal variance 2. A second order polynomial model with one quantitative predictor • where … • yi is length of bluegill (fish) i (in mm) • xi is age of bluegill (fish) i (in years)

Example 5 Estimated regression function

Example 5 Potential research questions • How is the length of a bluegill fish related to its age? • What is the length of a randomly selected five-year-old bluegill fish?

Example 5 The regression equation is length = 148 + 19.8 c_age - 4.72 c_agesq Predictor Coef SE Coef T P Constant 147.604 1.472 100.26 0.000 c_age 19.811 1.431 13.85 0.000 c_agesq -4.7187 0.9440 -5.00 0.000 S = 10.91 R-Sq = 80.1% R-Sq(adj) = 79.6% Analysis of Variance Source DF SS MS F P Regression 2 35938 17969 151.07 0.000 Residual Error 75 8921 119 Total 77 44859 ... Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 165.90 2.77 (160.39, 171.42) (143.49, 188.32) Values of Predictors for New Observations New c_age c_agesq 1 1.37 1.88

The good news! • Everything you learned about the simple linear regression model extends, with at most minor modification, to the multiple linear regression model: • same assumptions, same model checking • (adjusted) R2 • t-tests and t-intervals for one slope • prediction (confidence) intervals for (mean) response

New things we need to learn! • The above research scenarios (models) and a few more • The “general linear test” which helps to answer many research questions • F-tests for more than one slope • Interactions between two or more predictor variables • Identifying influential data points

New things we need to learn! • Detection of (“variance inflation factors”) correlated predictors (“multicollinearity”) and the limitations they cause • Selection of variables from a large set of variables for inclusion in a model (“stepwiseregression and “best subsets regression”)

Overview of our study of the multiple linear regression model