Multiple Regression continued…

STAT E-150Statistical Methods Multiple Regression continued…

When we discussed simple linear regression, we briefly introduced prediction intervals and confidence intervals: Confidence Intervals and Prediction Intervals Let x be a specific value of x. The predicted value of y is We can create two different intervals: a prediction interval for an individual value of x a confidence interval for the mean predicted value at x

The basic format for an interval is When we want to find a mean predicted value, When we want to find an individual predicted value,

Let us return to our earlier discussion of the age of adolescent mothers and the weight of their babies. We found that there was a linear relationship between these variables: weight = 245.15 age – 1163.45 How can we use this model to make predictions?

Suppose we want to predict the weight of a baby born to a mother who is 16 years old. When we analyze the data, we can choose to save the predicted values, the confidence interval and the prediction interval for each predictor value. The results will appear in the datasheet: x-value predicted 95% CI 95% CI y-value confidence interval prediction interval

What weight is expected for a baby of a 16 year old mother?

What weight is expected for a baby of a 16 year old mother? 2759 g

What is the prediction interval estimate for the weight of a baby of a 16 year old mother?

What is the prediction interval estimate for the weight of a baby of a 16 year old mother? 2251.24 to 3266.66g What does it tell you? We are 95% confident that the birthweight of a baby born to a 16 year old mother is between 2575.59 and 2942.31 g.

What is the prediction interval estimate for the weight of a baby of a 16 year old mother? 2251.24 to 3266.66 g What does it tell you? We are 95% confident that the birthweight of a baby born to a 16 year old mother is between 2251.24and 3266.66g.

What is the confidence interval estimate for the mean weight of babies of 16 year old mothers?

What is the confidence interval estimate for the mean weight of babies of 16 year old mothers? 2575.59 to 2942.31 g What does it tell you? We are 95% confident

What is the confidence interval estimate for the mean weight of babies of 16 year old mothers? 2575.59 to 2942.31 g What does it tell you? We are 95% confident that the mean birthweight of babies born to 16 year old mothers is between 2575.59and 2942.31g. We are 95% confident

The 95% confidence interval is (2575.59, 2942.31) The 95% prediction interval is (2251.24, 3266.66) Which is interval is wider? Why?

The 95% confidence interval is (2575.59, 2942.31) The 95% prediction interval is (2251.24, 3266.66) Which is interval is wider? Why? The prediction interval is wider, because means vary less than individual values.

In the data concerning body fat percentages in men, the predictor variables were waist and height, and we found a regression equation which we can now use to make predictions: %BodyFat= 1.773 waist - .601 height – 3.110 We can find prediction intervals and confidence intervals as we did when we used a single predictor.

Suppose we want to predict the body fat percentage associated with a waist size of 34 inches and a height of 6 feet. We can proceed as we did with a single predictor, by entering these values in the data window, and then saving the results of the linear regression analysis.

When you scroll to the right, you will see these results: What is the predicted body fat %?

When you scroll to the right, you will see these results: What is the predicted body fat %? 13.874%

When you scroll to the right, you will see these results: What is the prediction interval? What does it tell you?

When you scroll to the right, you will see these results: What is the prediction interval? What does it tell you? The 95% prediction interval is (5.05, 22.69)

When you scroll to the right, you will see these results: What is the prediction interval? What does it tell you? We are 95% confident that a man who is 6 feet tall and has a 34 inch waist will have a body fat percentage between 5.05 and 22.69.

When you scroll to the right, you will see these results: What is the confidence interval? What does it tell you?

When you scroll to the right, you will see these results: What is the confidence interval? What does it tell you? The 95% confidence interval is (13.10, 14.65)

When you scroll to the right, you will see these results: What is the confidence interval? What does it tell you? We are 95% confident that the mean body fat percentage for men who are 6 feet tall and have a 34 inch waist is between 13.10 and 14.65.

Models with Categorical Predictors Categorical (or qualitative) variables can also be included in multiple regression models. These variables are coded as numbers so that we can employ the methods we have discussed. These coded values are called indicator variables or dummy variables. They are often coded using 0 and 1, where 0 = absence or 0 = "no" 1 = presence 1 = "yes"

Example: One way colleges measure success is by graduation rates. The Education Trust publishes 6-year graduation rates along with other college characteristics on its website, www.collegeresults.org.

Here is a sample of the data, which represents a random sample of 22 colleges selected from the 1037 colleges in the United States with enrollments under 5000 students:

We define these variables: y = 6-year graduation rate x1= median SAT score of students accepted to the college x2= student-related expense per full-time student (in dollars)

The regression model is y = β0 + β1x1 + β2x2 + β3x3 + ε For single-sex colleges: Rate= β0 + β1 SAT+ β2 Expense+ β3(1) = β0 + β1 SAT+ β2 Expense+ β3 + ε For coeducational colleges: Rate= β0 + β1 SAT+ β2 Expense+ β3(0) = β0 + β1 SAT+ β2 Expense + ε In either case, the slopes are determined using data from both types of colleges.

For single-sex colleges, the intercept is β0 + β3: Rate= β0 + β1 SAT+ β2 Expense+ β3(1) = β0 + β1 SAT+ β2 Expense+ β3 + ε = (β0+ β3) + β1SAT+ β2 Expense+ ε For coeducational colleges: Rate= β0 + β1 SAT+ β2 Expense+ β3(0) = β0 + β1 SAT+ β2 Expense + ε In other words, the coefficient of the indicator variable represents the difference in intercepts for the regression lines for the two types of colleges.

What are the hypotheses? H0: β1 = β2 = β3 = 0Ha: The coefficients are not all zero

Here is part of the SPSS analysis: What is your conclusion?

What is your conclusion? Since F is large and p is close to 0, the null hypothesis is rejected. We can conclude that there is a linear relationship between the 6- year graduation rate and the median SAT score , the student-related expense per full-time student, and the gender of the student body.

What is the regression equation?

What is the regression equation? y = .001x1 + .00000697x2+ .125x3 - .391

For single-sex colleges: y = .001x1 + .00000697x2+ .125(1) - .391 y = .001x1 + .00000697x2- .266

For coed colleges: y = .001x1 + .00000697x2 - .391

What is the meaning of the coefficient β3? We can interpret the value .125 as the “correction” we would make to the predicted graduation rate to incorporate the difference associated with having only male or only female students.

What is the meaning of the coefficient β3? We can interpret the value .125 as the difference in intercepts for the two different types of colleges.

Interaction and Collinearity If the change in the mean y-value associated with a 1-unit increase in one predictor variable depends on the value of a second predictor variable, there is interaction between the two predictor variables. If we represent the variables as x1 and x2, the interaction can be modeled by including their product, x1x2, as a predictor variable.

Interaction and Collinearity The regression model for two predictor variables would now include a cross-product term: Y = β0 + β1x1 + β2x2 + β3x1x2 +ε where β1 + β3x2 represents the change in Y for every one-unit increase in x1, keepingx2 fixed β2 + β3x1 represents the change in Y for every one-unit increase in x2, keepingx1 fixed If you find that there is a linear association, be sure to check the coefficient of the interaction term.

We determine collinearity by examining a correlation matrix: What is the correlation between Pct BF and Height? -.029Is this value significant? No; p=.322 Pct BF and Waist? Is this value significant? Height and Waist? Is this value significant?

We determine collinearity by examining a correlation matrix: What is the correlation between Pct BF and Height? -.029Is this value significant? No; p = .322 Pct BF and Waist? .824Is this value significant? Yes; p = .000 Height and Waist? .187Is this value significant? Yes; p = .002 It is important to note that this information only refers to the pair of variables in question, without regard to the influences of other variables.

Another way to assess collinearity: VIF is the Variance Inflation Factor, which indicates whether a predictor has a strong linear relationship with the other predictors. There is reason for concern if the largest VIF is greater than 5. The Tolerance statistic is the reciprocal of the VIF. There is a serious problem if this value is less than .2.

Multiple Regression continued…