**1. **SW388R7
Data Analysis & Computers II
Slide 1 Multiple Regression ? Assumptions and Outliers
Multiple Regression and Assumptions
Multiple Regression and Outliers
Strategy for Solving Problems
Practice Problems

**2. **SW388R7
Data Analysis & Computers II
Slide 2 Multiple Regression and Assumptions Multiple regression is most effect at identifying relationship between a dependent variable and a combination of independent variables when its underlying assumptions are satisfied: each of the metric variables are normally distributed, the relationships between metric variables are linear, and the relationship between metric and dichotomous variables is homoscedastic.
Failing to satisfy the assumptions does not mean that our answer is wrong. It means that our solution may under-report the strength of the relationships.

**3. **SW388R7
Data Analysis & Computers II
Slide 3 Multiple Regression and Outliers Outliers can distort the regression results. When an outlier is included in the analysis, it pulls the regression line towards itself. This can result in a solution that is more accurate for the outlier, but less accurate for all of the other cases in the data set.
We will check for univariate outliers on the dependent variable and multivariate outliers on the independent variables.

**4. **SW388R7
Data Analysis & Computers II
Slide 4 Relationship between assumptions and outliers The problems of satisfying assumptions and detecting outliers are intertwined. For example, if a case has a value on the dependent variable that is an outlier, it will affect the skew, and hence, the normality of the distribution.
Removing an outlier may improve the distribution of a variable.
Transforming a variable may reduce the likelihood that the value for a case will be characterized as an outlier.

**5. **SW388R7
Data Analysis & Computers II
Slide 5 Order of analysis is important The order in which we check assumptions and detect outliers will affect our results because we may get a different subset of cases in the final analysis.
In order to maximize the number of cases available to the analysis, we will evaluate assumptions first. We will substitute any transformations of variable that enable us to satisfy the assumptions.
We will use any transformed variables that are required in our analysis to detect outliers.

**6. **SW388R7
Data Analysis & Computers II
Slide 6 Strategy for solving problems Run type of regression specified in problem statement on variables using full data set.
Test the dependent variable for normality. If it does not satisfy the criteria for normality unless transformed, substitute the transformed variable in the remaining tests that call for the use of the dependent variable.
Test for normality, linearity, homoscedasticity using scripts. Decide which transformations should be used.
Substitute transformations and run regression entering all independent variables, saving studentized residuals and Mahalanobis distance scores. Compute probabilities for D?.
Remove the outliers (studentized residual greater than 3 or Mahalanobis D? with p <= 0.001), and run regression with the method and variables specified in the problem.
Compare R? for analysis using transformed variables and omitting outliers (step 5) to R? obtained for model using all data and original variables (step 1).

**7. **SW388R7
Data Analysis & Computers II
Slide 7 Transforming dependent variables If dependent variable is not normally distributed:
Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria.
If no transformation satisfies normality criteria, use untransformed variable and add caution for violation of assumption.
If a transformation satisfies normality, use the transformed variable in the tests of the independent variables.

**8. **SW388R7
Data Analysis & Computers II
Slide 8 Transforming independent variables - 1 If independent variable is normally distributed and linearly related to dependent variable, use as is.
If independent variable is normally distributed but not linearly related to dependent variable:
Try log, square root, square, and inverse transformation. Use first transformed variable that satisfies linearity criteria and does not violate normality criteria
If no transformation satisfies linearity criteria and does not violate normality criteria, use untransformed variable and add caution for violation of assumption

**9. **SW388R7
Data Analysis & Computers II
Slide 9 Transforming independent variables - 2 If independent variable is linearly related to dependent variable but not normally distributed:
Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria and does not reduce correlation.
Try log, square root, and inverse transformation. Use first transformed variable that satisfies normality criteria and has significant correlation.
If no transformation satisfies normality criteria with a significant correlation, use untransformed variable and add caution for violation of assumption

**10. **SW388R7
Data Analysis & Computers II
Slide 10 Transforming independent variables - 3 If independent variable is not linearly related to dependent variable and not normally distributed:
Try log, square root, square, and inverse transformation. Use first transformed variable that satisfies normality criteria and has significant correlation.
If no transformation satisfies normality criteria with a significant correlation, used untransformed variable and add caution for violation of assumption

**11. **SW388R7
Data Analysis & Computers II
Slide 11 Impact of transformations and omitting outliers We evaluate the regression assumptions and detect outliers with a view toward strengthening the relationship.
This may not happen. The regression may be the same, it may be weaker, and it may be stronger. We cannot be certain of the impact until we run the regression again.
In the end, we may opt not to exclude outliers and not to employ transformations; the analysis informs us of the consequences of doing either.

**12. **SW388R7
Data Analysis & Computers II
Slide 12 Notes Whenever you start a new problem, make sure you have removed variables created for previous analysis and have included all cases back into the data set.
I have added the square transformation to the checkboxes for transformations in the normality script. Since this is an option for linearity, we need to be able to evaluate its impact on normality.
If you change the options for output in pivot tables from labels to names, you will get an error message when you use the linearity script. To solve the problem, change the option for output in pivot tables back to labels.

**13. **SW388R7
Data Analysis & Computers II
Slide 13 Problem 1 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

**14. **SW388R7
Data Analysis & Computers II
Slide 14 Dissecting problem 1 - 1 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

**15. **SW388R7
Data Analysis & Computers II
Slide 15 Dissecting problem 1 - 2 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

**16. **SW388R7
Data Analysis & Computers II
Slide 16 Dissecting problem 1 - 3 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors of "total family income" [income98] from the list: "sex" [sex], "how many in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions and removing outliers, the total proportion of variance explained by the regression analysis increased by 10.8%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

**17. **SW388R7
Data Analysis & Computers II
Slide 17 R? before transformations or removing outliers

**18. **SW388R7
Data Analysis & Computers II
Slide 18 R? before transformations or removing outliers

**19. **SW388R7
Data Analysis & Computers II
Slide 19 R? before transformations or removing outliers

**20. **SW388R7
Data Analysis & Computers II
Slide 20 Normality of the dependent variable: total family income

**21. **SW388R7
Data Analysis & Computers II
Slide 21 Normality of the dependent variable: total family income

**22. **SW388R7
Data Analysis & Computers II
Slide 22 Linearity and independent variable: how many in family earned money

**23. **SW388R7
Data Analysis & Computers II
Slide 23 Linearity and independent variable: how many in family earned money

**24. **SW388R7
Data Analysis & Computers II
Slide 24 Normality of independent variable:how many in family earned money

**25. **SW388R7
Data Analysis & Computers II
Slide 25 Normality of independent variable:how many in family earned money

**26. **SW388R7
Data Analysis & Computers II
Slide 26 Normality of independent variable:how many in family earned money

**27. **SW388R7
Data Analysis & Computers II
Slide 27 Transformation for how many in family earned money The independent variable, how many in family earned money, had a linear relationship to the dependent variable, total family income.
The logarithmic transformation improves the normality of "how many in family earned money" [earnrs] without a reduction in the strength of the relationship to "total family income" [income98].
We will substitute the logarithmic transformation of how many in family earned money in the regression analysis.

**28. **SW388R7
Data Analysis & Computers II
Slide 28 Normality of independent variable:respondent?s income

**29. **SW388R7
Data Analysis & Computers II
Slide 29 Normality of independent variable: respondent?s income

**30. **SW388R7
Data Analysis & Computers II
Slide 30 Linearity and independent variable: respondent?s income

**31. **SW388R7
Data Analysis & Computers II
Slide 31 Linearity and independent variable: respondent?s income

**32. **SW388R7
Data Analysis & Computers II
Slide 32 Homoscedasticity: sex

**33. **SW388R7
Data Analysis & Computers II
Slide 33 Homoscedasticity: sex

**34. **SW388R7
Data Analysis & Computers II
Slide 34 Adding a transformed variable

**35. **SW388R7
Data Analysis & Computers II
Slide 35 The transformed variable in the data editor

**36. **SW388R7
Data Analysis & Computers II
Slide 36 The regression to identify outliers

**37. **SW388R7
Data Analysis & Computers II
Slide 37 Saving the measures of outliers

**38. **SW388R7
Data Analysis & Computers II
Slide 38 The variables for identifying outliers

**39. **SW388R7
Data Analysis & Computers II
Slide 39 Computing the probability for Mahalanobis D?

**40. **SW388R7
Data Analysis & Computers II
Slide 40 Formula for probability for Mahalanobis D?

**41. **SW388R7
Data Analysis & Computers II
Slide 41 Multivariate outliers

**42. **SW388R7
Data Analysis & Computers II
Slide 42 Univariate outliers

**43. **SW388R7
Data Analysis & Computers II
Slide 43 Omitting the outliers

**44. **SW388R7
Data Analysis & Computers II
Slide 44 Specifying the condition to omit outliers

**45. **SW388R7
Data Analysis & Computers II
Slide 45 The formula for omitting outliers

**46. **SW388R7
Data Analysis & Computers II
Slide 46 Completing the request for the selection

**47. **SW388R7
Data Analysis & Computers II
Slide 47 The omitted multivariate outlier

**48. **SW388R7
Data Analysis & Computers II
Slide 48 Running the regression without outliers

**49. **SW388R7
Data Analysis & Computers II
Slide 49 Opening the save options dialog

**50. **SW388R7
Data Analysis & Computers II
Slide 50 Clearing the request to save outlier data

**51. **SW388R7
Data Analysis & Computers II
Slide 51 Opening the statistics options dialog

**52. **SW388R7
Data Analysis & Computers II
Slide 52 Requesting descriptive statistics

**53. **SW388R7
Data Analysis & Computers II
Slide 53 Requesting the output

**54. **SW388R7
Data Analysis & Computers II
Slide 54 Sample size requirement

**55. **SW388R7
Data Analysis & Computers II
Slide 55 Significance of regression relationship

**56. **SW388R7
Data Analysis & Computers II
Slide 56 Increase in proportion of variance

**57. **SW388R7
Data Analysis & Computers II
Slide 57 Problem 2 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

**58. **SW388R7
Data Analysis & Computers II
Slide 58 Dissecting problem 2 - 1 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

**59. **SW388R7
Data Analysis & Computers II
Slide 59 Dissecting problem 2 - 2 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

**60. **SW388R7
Data Analysis & Computers II
Slide 60 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age" [age], "highest year of school completed" [educ], and "sex" [sex] to the dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression assumptions and removing outliers, the proportion of variance explained by the regression analysis increased by 3.6%.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic Dissecting problem 2 - 3

**61. **SW388R7
Data Analysis & Computers II
Slide 61 R? before transformations or removing outliers

**62. **SW388R7
Data Analysis & Computers II
Slide 62 R? before transformations or removing outliers

**63. **SW388R7
Data Analysis & Computers II
Slide 63 Normality of the dependent variable

**64. **SW388R7
Data Analysis & Computers II
Slide 64 Normality of the dependent variable

**65. **SW388R7
Data Analysis & Computers II
Slide 65 Normality of independent variable: Age

**66. **SW388R7
Data Analysis & Computers II
Slide 66 Normality of independent variable: Age

**67. **SW388R7
Data Analysis & Computers II
Slide 67 Linearity and independent variable: Age

**68. **SW388R7
Data Analysis & Computers II
Slide 68 Linearity and independent variable: Age

**69. **SW388R7
Data Analysis & Computers II
Slide 69 Transformation for Age The independent variable age satisfied the criteria for normality.
The independent variable age did not have a linear relationship to the dependent variable occupational prestige. However, none of the transformations linearized the relationship.
No transformation will be used - it would not help linearity and is not needed for normality.

**70. **SW388R7
Data Analysis & Computers II
Slide 70 Linearity and independent variable: Highest year of school completed

**71. **SW388R7
Data Analysis & Computers II
Slide 71 Linearity and independent variable: Highest year of school completed

**72. **SW388R7
Data Analysis & Computers II
Slide 72 Normality of independent variable: Highest year of school completed

**73. **SW388R7
Data Analysis & Computers II
Slide 73 Normality of independent variable: Highest year of school completed

**74. **SW388R7
Data Analysis & Computers II
Slide 74 Transformation for highest year of school The independent variable, highest year of school, had a linear relationship to the dependent variable, occupational prestige.
The independent variable, highest year of school, did not satisfy the criteria for normality. None of the transformations for normalizing the distribution of "highest year of school completed" [educ] were effective.
No transformation will be used - it would not help normality and is not needed for linearity. A caution should be added to any findings.

**75. **SW388R7
Data Analysis & Computers II
Slide 75 Homoscedasticity: sex

**76. **SW388R7
Data Analysis & Computers II
Slide 76 Homoscedasticity: sex

**77. **SW388R7
Data Analysis & Computers II
Slide 77 Adding a transformed variable

**78. **SW388R7
Data Analysis & Computers II
Slide 78 The transformed variable in the data editor

**79. **SW388R7
Data Analysis & Computers II
Slide 79 The regression to identify outliers

**80. **SW388R7
Data Analysis & Computers II
Slide 80 Saving the measures of outliers

**81. **SW388R7
Data Analysis & Computers II
Slide 81 The variables for identifying outliers

**82. **SW388R7
Data Analysis & Computers II
Slide 82 Computing the probability for Mahalanobis D?

**83. **SW388R7
Data Analysis & Computers II
Slide 83 Formula for probability for Mahalanobis D?

**84. **SW388R7
Data Analysis & Computers II
Slide 84 The multivariate outlier

**85. **SW388R7
Data Analysis & Computers II
Slide 85 The univariate outlier

**86. **SW388R7
Data Analysis & Computers II
Slide 86 Omitting the outliers

**87. **SW388R7
Data Analysis & Computers II
Slide 87 Specifying the condition to omit outliers

**88. **SW388R7
Data Analysis & Computers II
Slide 88 The formula for omitting outliers

**89. **SW388R7
Data Analysis & Computers II
Slide 89 Completing the request for the selection

**90. **SW388R7
Data Analysis & Computers II
Slide 90 The omitted multivariate outlier

**91. **SW388R7
Data Analysis & Computers II
Slide 91 Running the regression without outliers

**92. **SW388R7
Data Analysis & Computers II
Slide 92 Opening the save options dialog

**93. **SW388R7
Data Analysis & Computers II
Slide 93 Clearing the request to save outlier data

**94. **SW388R7
Data Analysis & Computers II
Slide 94 Opening the statistics options dialog

**95. **SW388R7
Data Analysis & Computers II
Slide 95 Requesting descriptive statistics

**96. **SW388R7
Data Analysis & Computers II
Slide 96 Requesting the output

**97. **SW388R7
Data Analysis & Computers II
Slide 97 Sample size requirement

**98. **SW388R7
Data Analysis & Computers II
Slide 98 Significance of regression relationship

**99. **SW388R7
Data Analysis & Computers II
Slide 99 Increase in proportion of variance

**100. **SW388R7
Data Analysis & Computers II
Slide 100 Impact of assumptions and outliers - 1

**101. **SW388R7
Data Analysis & Computers II
Slide 101 Impact of assumptions and outliers - 2

**102. **SW388R7
Data Analysis & Computers II
Slide 102 Impact of assumptions and outliers - 3

**103. **SW388R7
Data Analysis & Computers II
Slide 103 Impact of assumptions and outliers - 4

**104. **SW388R7
Data Analysis & Computers II
Slide 104 Impact of assumptions and outliers - 5