Welcome to Chapter 14 MBA 541

Welcome to Chapter 14MBA 541 BENEDICTINEUNIVERSITY • Regression and Correlation • Multiple Regression and Correlation Analysis • Chapter 14

Chapter 14 Please, Read Chapter 14 in Lind before viewing this presentation. Statistical Techniques in Business & Economics Lind

Goals When you have completed this chapter, you will be able to: • ONE • Describe the relationship between two or more independent variables and the dependent variable using a multiple regression equation. • TWO • Compute and interpret the multiple standard error of estimate and the coefficient of determination. • THREE • Interpret a correlation matrix.

Goals When you have completed this chapter, you will be able to: • FOUR • Set up and interpret an ANOVA table. • FIVE • Conduct a test of hypothesis to determine if any of the set of regression coefficients differ from zero. • SIX • Conduct a test of hypothesis on each of the regression coefficients.

Multiple Regression Analysis • The general multiple regression with k independent variables is given by: where: a is the Y-intercept, and X1 to Xk are the independent variables. • Greek letters are used for a (α) and b (β) when denoting population parameters.

Multiple Regression Analysis • In the general multiple regression equation, bj is the net change in Y for each unit change in Xj (holding all other values constant) where j = 1 to k. • Each bj is called a partial regression coefficient, a net regression coefficient, or just a regression coefficient. • The least squares criterion is used to develop the multiple regression equation. • Because determining b1, b2, etc. is very tedious, a software package such as Excel or MINITAB is recommended.

Multiple Standard Errorof Estimate • The Multiple Standard Error of Estimate is a measure of the effectiveness of the regression equation. • The formula is: • It is measured in the same units as the dependent variable.

Multiple Regression and Correlation Assumptions • Assumptions about Multiple Regression and Correlation: • The independent variables and the dependent variable have a linear relationship. • The dependent variable is continuous and at least interval scale. • The variation in (Y-Y’) must be approximately the same for all values of Y’. When this is the case, differences exhibit homoscedasticity. • The residuals should follow the normal distribution with a mean of 0. • Successive values of the dependent variable must be uncorrelated.

ANOVA Table • SSR = Regression variation. Variation accounted for by the set of independent variables. • SSE = Error variation. Variation not accounted for by the independent variables. • SS Total = Total Variation.

Correlation Matrix • A Correlation Matrix is used to show all the correlation coefficients between all pairs of the variables. • The matrix is useful for locating correlated independent variables. • It shows how strongly each independent variable is correlated with the dependent variable.

Global Test • The Global Test is used to investigate whether any of the independent variables have significant coefficients. • The hypotheses are: H0: β1 = β2 = … = βk = 0 H1: Not all βs are equal to 0 • The test statistic is the F distribution. • The numerator degrees of freedom is k, the number of independent variables. • The denominator degrees of freedom is n-(k+1), where n is the sample size.

Test for Individual Variables • The Test of Individual Variables is used to determine which independent variables have nonzero regression coefficients. • The variables that have zero regression coefficients are usually dropped from the analysis. • The test statistic is the t distribution with n-(k+1) degrees of freedom.

Example 1 • A market researcher for Super Dollar Super Markets is studying the yearly amount that families of four or more spend on food. • Three independent variables are thought to be related to yearly food expenditures (Food). • These variables are: total family income (Income) in $00, size of the family (Size), and whether the family has children in college (College).

Example 1 (Continued) • Notice that the variable, College, is called a dummy or indicator variable. It can take only one of two possible outcomes. That is, a child is a college student or not. • We usually code one value of the dummy variable as “1” and the other “0”. • Other examples of dummy variables include: gender, the part is acceptable or unacceptable, the voter will or will not vote for the incumbent governor.

Example 1 (Continued)

Example 1 (Continued) • Use a computer software package, such as MINITAB or Excel, to develop a correlation matrix. • From the analysis provided by MINITAB (as presented on the next slide), write out the regression equation: • This regression equation can be used to estimate the food expenditures for families of different sizes, with or without college students, and with various incomes.

Example 1 (Continued) The regression equation is Food = 954 + 1.09 Income + 748 Size + 565 College Predictor Coef SE Coef T P Constant 954 1581 0.60 0.563 Income 1.092 3.153 0.35 0.738 Size 748.4 303.0 2.47 0.039 Student 564.5 495.1 1.14 0.287 S = 572.7 R-Sq = 80.4% R-Sq(adj) = 73.1% Analysis of Variance Source DF SS MS F P Regression 3 10762903 3587634 10.94 0.003 Residual Error 8 2623764 327970 Total 11 13386667

Example 1 (Continued) • This regression equation indicates that: • Each additional $100 of income per year will increase the amount spent on food by $1.09 per year. • An additional family member will increase the amount spent per year on food by $748. • A family with a college student will spend $565 more per year on food than those without a college student. • For a family of 4, with no college students, and an income of $50,000 (which is input as 500), this regression equation predicts the food expenditure will be $4,491.

Example 1 (Continued) • Regarding the Coefficient of Determination: • From the regression output (shown on Slide 17), we note that the coefficient of determination is 80.4%. • This means that more than 80% of the variation in the amount spent on food is accounted for by the variables of income, family size, and student status.

Example 1 (Continued) • Regarding the Correlation Matrix: • The strongest correlation between the dependent variable and an independent variable is between family size and amount spent on food. • None of the correlations among the independent variables should cause problems. All are between -0.70 and +0.70.

Example 1 (Continued) • Conduct a Global Test of Hypothesis to determine if any of the regression coefficients are not zero. • The hypotheses are: H0: β1 = β2 = β3 = 0 H1: At least one β ≠ 0 • The test statistic is the F distribution. • H0 is rejected if F > 4.07. • From the MINITAB output, the computed value of F is 10.94. • Decision: H0 is rejected. • Not all the regression coefficients are zero.

Example 1 (Continued) • Conduct an Individual Test to determine which coefficients are not zero. • For the independent variable, family size, the hypotheses are: H0: β2 = 0 H1: β2 ≠ 0 • Using the 5% level of significance, reject H0 if the p-value < 0.05. • From the MINITAB output, the only significant variable is Size (family size) using the p-values. The other variables can be omitted from the model.

Example 1 (Continued) • We rerun the analysis using only the significant independent variable, family size. • The new regression equation is: • As shown on the next slide, which presents the MINITAB output, the coefficient of determination is 76.8%. • We dropped two independent variables, and the R-squared term was reduced by only 3.6%.

Example 1 (Continued) Regression Analysis: Food versus Size The regression equation is Food = 340 + 1031 Size Predictor Coef SE Coef T P Constant 339.7 940.7 0.36 0.726 Size 1031.0 179.4 5.75 0.000 S = 557.7 R-Sq = 76.8% R-Sq(adj) = 74.4% Analysis of Variance Source DF SS MS F P Regression 1 10275977 10275977 33.03 0.000 Residual Error 10 3110690 311069 Total 11 13386667

Analysis of Residuals • A Residual is the difference between the actual value of Y and the predicted value Y’. • Residuals should be approximately normally distributed. Histograms and stem-and-leaf charts are useful in checking this requirement. • A plot of the residuals and their corresponding Y’ values is used for showing that there are no trends or patterns in the residuals.

Residual Plot Residual Plot Against Estimated Values of Y 1000 500 Residuals 0 -500 4500 6000 7500 Y’

Histogram of Residuals 8 7 6 5 4 3 2 1 0 Frequency -600 -200 200 600 1000 Residuals

Welcome to Chapter 14 MBA 541