Multiple Regression – Assumptions and Outliers

Multiple Regression – Assumptions and Outliers Multiple Regression and Assumptions Multiple Regression and Outliers Strategy for Solving Problems Practice Problems

Multiple Regression and Assumptions • Multiple regression is most effect at identifying relationship between a dependent variable and a combination of independent variables when its underlying assumptions are satisfied: each of the metric variables are normally distributed, the relationships between metric variables are linear, and the relationship between metric and dichotomous variables is homoscedastic. • Failing to satisfy the assumptions does not mean that our answer is wrong. It means that our solution may under-report the strength of the relationships.

Multiple Regression and Outliers • Outliers can distort the regression results. When an outlier is included in the analysis, it pulls the regression line towards itself. This can result in a solution that is more accurate for the outlier, but less accurate for all of the other cases in the data set. • We will check for univariate outliers on the dependent variable and multivariate outliers on the independent variables.

Relationship between assumptions and outliers • The problems of satisfying assumptions and detecting outliers are intertwined. For example, if a case has a value on the dependent variable that is an outlier, it will affect the skew, and hence, the normality of the distribution. • Removing an outlier may improve the distribution of a variable. • Transforming a variable may reduce the likelihood that the value for a case will be characterized as an outlier.

Order of analysis is important • The order in which we check assumptions and detect outliers will affect our results because we may get a different subset of cases in the final analysis. • In order to maximize the number of cases available to the analysis, we will evaluate assumptions first. We will substitute any transformations of variable that enable us to satisfy the assumptions. • We will use any transformed variables that are required in our analysis to detect outliers.

Strategy for solving problems Our strategy for solving problems about violations of assumptions and outliers will include the following steps: • Run type of regression specified in problem statement on variables using full data set. • Test the dependent variable for normality. If it does not satisfy the criteria for normality unless transformed, substitute the transformed variable in the remaining tests that call for the use of the dependent variable. • Test for normality, linearity, homoscedasticity using scripts. Decide which transformations should be used. • Substitute transformations and run regression entering all independent variables, saving studentized residuals and Mahalanobis distance scores. Compute probabilities for D². • Remove the outliers (studentized residual greater than 3 or Mahalanobis D² with p <= 0.001), and run regression with the method and variables specified in the problem. • Compare R² for analysis using transformed variables and omitting outliers (step 5) to R² obtained for model using all data and original variables (step 1).

Transforming dependent variables • If dependent variable is not normally distributed: • Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria. • If no transformation satisfies normality criteria, use untransformed variable and add caution for violation of assumption. • If a transformation satisfies normality, use the transformed variable in the tests of the independent variables. We will use the following logic to transform variables:

Transforming independent variables - 1 • If independent variable is normally distributed and linearly related to dependent variable, use as is. • If independent variable is normally distributed but not linearly related to dependent variable: • Try log, square root, square, and inverse transformation. Use first transformed variable that satisfies linearity criteria and does not violate normality criteria • If no transformation satisfies linearity criteria and does not violate normality criteria, use untransformed variable and add caution for violation of assumption

Transforming independent variables - 2 • If independent variable is linearly related to dependent variable but not normally distributed: • Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria and does not reduce correlation. • Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria and has significant correlation. • If no transformation satisfies normality criteria with a significant correlation, use untransformed variable and add caution for violation of assumption

Transforming independent variables - 3 • If independent variable is not linearly related to dependent variable and not normally distributed: • Try log, square root, square, and inverse transformation. Use first transformed variable that satisfies normality criteria and has significant correlation. • If no transformation satisfies normality criteria with a significant correlation, used untransformed variable and add caution for violation of assumption

Impact of transformations and omitting outliers • We evaluate the regression assumptions and detect outliers with a view toward strengthening the relationship. • This may not happen. The regression may be the same, it may be weaker, and it may be stronger. We cannot be certain of the impact until we run the regression again. • In the end, we may opt not to exclude outliers and not to employ transformations; the analysis informs us of the consequences of doing either.

Notes • Whenever you start a new problem, make sure you have removed variables created for previous analysis and have included all cases back into the data set. • I have added the square transformation to the checkboxes for transformations in the normality script. Since this is an option for linearity, we need to be able to evaluate its impact on normality. • If you change the options for output in pivot tables from labels to names, you will get an error message when you use the linearity script. To solve the problem, change the option for output in pivot tables back to labels.

Problem 1 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98]. After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Dissecting problem 1 - 1 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98]. After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic The problem may give us different levels of significance for the analysis. In this problem, we are told to use 0.01 as alpha for the regression analysis as well as for testing assumptions.

Dissecting problem 1 - 2 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98]. After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic The method for selecting variables is derived from the research question. In this problem we are asked to idnetify the best subset of predicotrs, so we do a stepwise multiple regression.

Dissecting problem 1 - 3 The purpose of testing for assumptions and outliers is to identify a stronger model. The main question to be answered in this problem is whether or not the use transformed variables to satisfy assumptions and the removal of outliers improves the overall relationship between the independent variables and the dependent variable, as measured by R². In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98]. After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic Specifically, the question asks whether or not the R² for a regression analysis after substituting transformed variables and eliminating outliers is 10.8% higher than a regression analysis using the original format for all variables and including all cases.

R² before transformations or removing outliers To start out, we run a stepwise multiple regression analysis with income98 as the dependent variable and sex, earnrs, and rincom98 as the independent variables. We select stepwise as the method to select the best subset of predictors.

R² before transformations or removing outliers Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 51.1%. This is the benchmark that we will use to evaluate the utility of transformations and the elimination of outliers.

R² before transformations or removing outliers For this particular question, we are not interested in the statistical significance of the overall relationship prior to transformations and removing outliers. In fact, it is possible that the relationship is not statistically significant due to variables that are not normal, relationships that are not linear, and the inclusion of outliers.

Normality of the dependent variable: total family income In evaluating assumptions, the first step is to examine the normality of the dependent variable. If it is not normally distributed, or cannot be normalized with a transformation, it can affect the relationships with all other variables. To test the normality of the dependent variable, run the script: NormalityAssumptionAndTransformations.SBS First, move the dependent variable INCOME98 to the list box of variables to test. Second, click on the OK button to produce the output.

Normality of the dependent variable: total family income The dependent variable "total family income" [income98] satisfies the criteria for a normal distribution. The skewness (-0.628) and kurtosis (-0.248) were both between -1.0 and +1.0. No transformation is necessary.

Linearity and independent variable: how many in family earned money First, move the dependent variable INCOME98 to the text box for the dependent variable. To evaluate the linearity of the relationship between number of earners and total family income, run the script for the assumption of linearity: LinearityAssumptionAndTransformations.SBS Second, move the independent variable, EARNRS, to the list box for independent variables. Third, click on the OK button to produce the output.

Linearity and independent variable: how many in family earned money The independent variable "how many in family earned money" [earnrs] satisfies the criteria for the assumption of linearity with the dependent variable "total family income" [income98], but does not satisfy the assumption of normality. The evidence of linearity in the relationship between the independent variable "how many in family earned money" [earnrs] and the dependent variable "total family income" [income98] was the statistical significance of the correlation coefficient (r = 0.505). The probability for the correlation coefficient was <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that r = 0 and conclude that there is a linear relationship between the variables.

Normality of independent variable:how many in family earned money After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable. To test the normality of number of earners in family, run the script: NormalityAssumptionAndTransformations.SBS First, move the independent variable EARNRS to the list box of variables to test. Second, click on the OK button to produce the output.

Normality of independent variable:how many in family earned money The independent variable "how many in family earned money" [earnrs] satisfies the criteria for the assumption of linearity with the dependent variable "total family income" [income98], but does not satisfy the assumption of normality. In evaluating normality, the skewness (0.742) was between -1.0 and +1.0, but the kurtosis (1.324) was outside the range from -1.0 to +1.0.

Normality of independent variable:how many in family earned money The logarithmic transformation improves the normality of "how many in family earned money" [earnrs] without a reduction in the strength of the relationship to "total family income" [income98]. In evaluating normality, the skewness (-0.483) and kurtosis (-0.309) were both within the range of acceptable values from -1.0 to +1.0. The correlation coefficient for the transformed variable is 0.536. The square root transformation also has values of skewness and kurtosis in the acceptable range. However, by our order of preference for which transformation to use, the logarithm is preferred.

Transformation for how many in family earned money • The independent variable, how many in family earned money, had a linear relationship to the dependent variable, total family income. • The logarithmic transformation improves the normality of "how many in family earned money" [earnrs] without a reduction in the strength of the relationship to "total family income" [income98]. • We will substitute the logarithmic transformation of how many in family earned money in the regression analysis.

Normality of independent variable:respondent’s income After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable. To test the normality of respondent’s in family, run the script: NormalityAssumptionAndTransformations.SBS First, move the independent variable RINCOM89 to the list box of variables to test. Second, click on the OK button to produce the output.

Normality of independent variable: respondent’s income The independent variable "income" [rincom98] satisfies the criteria for both the assumption of normality and the assumption of linearity with the dependent variable "total family income" [income98]. In evaluating normality, the skewness (-0.686) and kurtosis (-0.253) were both within the range of acceptable values from -1.0 to +1.0.

Linearity and independent variable: respondent’s income First, move the dependent variable INCOME98 to the text box for the dependent variable. To evaluate the linearity of the relationship between respondent’s income and total family income, run the script for the assumption of linearity: LinearityAssumptionAndTransformations.SBS Second, move the independent variable, RINCOM89, to the list box for independent variables. Third, click on the OK button to produce the output.

Linearity and independent variable: respondent’s income The evidence of linearity in the relationship between the independent variable "income" [rincom98] and the dependent variable "total family income" [income98] was the statistical significance of the correlation coefficient (r = 0.577). The probability for the correlation coefficient was <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that r = 0 and conclude that there is a linear relationship between the variables.

Homoscedasticity: sex First, move the dependent variable INCOME98 to the text box for the dependent variable. To evaluate the homoscedasticity of the relationship between sex and total family income, run the script for the assumption of homogeneity of variance: HomoscedasticityAssumptionAnd Transformations.SBS Second, move the independent variable, SEX, to the list box for independent variables. Third, click on the OK button to produce the output.

Homoscedasticity: sex Based on the Levene Test, the variance in "total family income" [income98] is homogeneous for the categories of "sex" [sex]. The probability associated with the Levene Statistic (0.031) is greater than the level of significance, so we fail to reject the null hypothesis and conclude that the homoscedasticity assumption is satisfied.

Adding a transformed variable Even though we do not need a transformation for any of the variables in this analysis, we will demonstrate how to use a script, such as the normality script, to add a transformed variable to the data set, e.g. a logarithmic transformation for highest year of school. First, move the variable that we want to transform to the list box of variables to test. Second, mark the checkbox for the transformation we want to add to the data set, and clear the other checkboxes. Third, clear the checkbox for Delete transformed variables from the data. This will save the transformed variable. Fourth, click on the OK button to produce the output.

The transformed variable in the data editor If we scroll to the extreme right in the data editor, we see that the transformed variable has been added to the data set. Whenever we add transformed variables to the data set, we should be sure to delete them before starting another analysis.

The regression to identify outliers We use the regression procedure to identify both univariate and multivariate outliers. We start with the same dialog we used for the last analysis, in which income98 as the dependent variable and sex, earnrs, and rincom98 were the independent variables. First, we substitute the logarithmic transformation of earnrs, logearn, into the list of independent variables. Second, we change the method of entry from Stepwise to Enter so that all variables will be included in the detection of outliers. Third, we want to save the calculated values of the outlier statistics to the data set. Click on the Save… button to specify what we want to save.

Saving the measures of outliers First, mark the checkbox for Studentized residuals in the Residuals panel. Studentized residuals are z-scores computed for a case based on the data for all other cases in the data set. Third, click on the OK button to complete the specifications. Second, mark the checkbox for Mahalanobis in the Distances panel. This will compute Mahalanobis distances for the set of independent variables.

The variables for identifying outliers The variables for identifying univariate outliers for the dependent variable are in a column which SPSS has names sre_1. The variables for identifying multivariate outliers for the independent variables are in a column which SPSS has names mah_1.

Computing the probability for Mahalanobis D² To compute the probability of D², we will use an SPSS function in a Compute command. First, select the Compute… command from the Transform menu.

Formula for probability for Mahalanobis D² First, in the target variable text box, type the name "p_mah_1" as an acronym for the probability of the mah_1, the Mahalanobis D² score. Second, to complete the specifications for the CDF.CHISQ function, type the name of the variable containing the D² scores, mah_1, followed by a comma, followed by the number of variables used in the calculations, 3. Since the CDF function (cumulative density function) computes the cumulative probability from the left end of the distribution up through a given value, we subtract it from 1 to obtain the probability in the upper tail of the distribution. Third, click on the OK button to signal completion of the computer variable dialog.

Multivariate outliers Using the probabilities computed in p_mah_1 to identify outliers, scroll down through the list of case to see if we can find cases with a probability less than 0.001. There are no outliers for the set of independent variables.

Univariate outliers Similarly, we can scroll down the values of sre_1, the studentized residual to see the one outlier with a value larger than ± 3.0. Based on these criteria, there are 4 outliers.There are 4 cases that have a score on the dependent variable that is sufficiently unusual to be considered outliers (case 20000357: studentized residual=3.08; case 20000416: studentized residual=3.57; case 20001379: studentized residual=3.27; case 20002702: studentized residual=-3.23).

Omitting the outliers To omit the outliers from the analysis, we select in the cases that are not outliers. First, select the Select Cases… command from the Transform menu.

Specifying the condition to omit outliers First, mark the If condition is satisfied option button to indicate that we will enter a specific condition for including cases. Second, click on the If… button to specify the criteria for inclusion in the analysis.

The formula for omitting outliers To eliminate the outliers, we request the cases that are not outliers. The formula specifies that we should include cases if the studentized residual (regardless of sign) if less than 3 and the probability for Mahalanobis D² is higher than the level of significance, 0.001. After typing in the formula, click on the Continue button to close the dialog box,

Completing the request for the selection To complete the request, we click on the OK button.

The omitted multivariate outlier SPSS identifies the excluded cases by drawing a slash mark through the case number. Most of the slashes are for cases with missing data, but we also see that the case with the low probability for Mahalanobis distance is included in those that will be omitted.

Running the regression without outliers We run the regression again, excluding the outliers. Select the Regression | Linear command from the Analyze menu.

Opening the save options dialog We specify the dependent and independent variables, substituting any transformed variables required by assumptions. When we used regression to detect outliers, we entered all variables. Now we are testing the relationship specified in the problem, so we change the method to Stepwise. On our last run, we instructed SPSS to save studentized residuals and Mahalanobis distance. To prevent these values from being calculated again, click on the Save… button.

Clearing the request to save outlier data First, clear the checkbox for Studentized residuals. Third, click on the OK button to complete the specifications. Second, clear the checkbox form Mahalanobis distance.

Multiple Regression – Assumptions and Outliers

Multiple Regression – Assumptions and Outliers

Presentation Transcript