Multiple Regression

Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems

General criteria for split sample validation • It is expected that the results obtained from split sample validations will vary somewhat from the results obtained from the analysis using the full data set. We will use the following as our criteria that the validation verified our analysis and supports the generalizability of our findings: • First, the overall relationship between the dependent variable and the set of independent variable must be statistically significant for both validation analyses. • Second, the R² for each validation must be within 5% (plus or minus) of the R² for the model using the full sample.

General criteria for split sample validation - 2 • Third, the pattern of statistical significance for the coefficients of the independent variables for both validation analyses must be the same as the pattern for the full analysis, i.e. the same variables are statistically significant or not significant. • For stepwise multiple regression, we require that the same variables be significant, but it is not required that they enter in exactly the same order. • For hierarchical multiple regression, the R² change for the validation must be statistically significant.

Notes • Findings are stated on the results for the analysis of the full data set, not the validations. • If our validation analysis does not support the findings of the analysis on the full data set, we will declare the answer to the problem to be false. There is, however, another common option, which is to only cite statistical significance for independent variables supported by the validation analysis as well as the full data set analysis. All other variables are considered to be non-significant. Generally it is the independent variables with the weakest individual relationship to the dependent variable which fail to validate.

Problem 1 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 552804 as the random number seed. The variables "age" [age], "highest year of school completed" [educ], and "sex" [sex] have a moderate relationship to the variable "occupational prestige score" [prestg80]. Survey respondents who were older had more prestigious occupations. Survey respondents who had completed more years of school had more prestigious occupations. The variable sex did not have a relationship to "occupational prestige score" [prestg80]. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Steps prior to the validation analysis • Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers. • Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation. • When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate.

OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 1 The probability of the F statistic (36.639) for the overall regression relationship is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable.

OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 2 The Multiple R for the relationship between the set of independent variables and the dependent variable is 0.554, which would be characterized as moderate using the rule of thumb than a correlation less than or equal to 0.20 is characterized as very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong. The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 30.7%.

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 For the independent variable age, the probability of the t statistic (2.749) for the b coefficient is 0.006 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with age is equal to zero (b = 0) and conclude that there is a statistically significant relationship between age and occupational prestige score. The b coefficient associated with age (0.125) is positive, indicating a direct relationship in which higher numeric values for age are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who were older had more prestigious occupations.

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 For the independent variable highest year of school completed, the probability of the t statistic (10.352) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with highest year of school completed is equal to zero (b = 0) and conclude that there is a statistically significant relationship between highest year of school completed and occupational prestige score. The b coefficient associated with highest year of school completed (2.810) is positive, indicating a direct relationship in which higher numeric values for highest year of school completed are associated with higher numeric values for occupational prestige score. Therefore, the positive value of b implies that survey respondents who had completed more years of school had more prestigious occupations.

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 For the independent variable sex, the probability of the t statistic (-0.917) for the b coefficient is 0.360 which is greater than the level of significance of 0.05. We fail to reject the null hypothesis that the slope associated with sex is equal to zero (b = 0) and conclude that there is not a statistically significant relationship between sex and occupational prestige score.

Setting the random number seed To set the random number seed, select the Random Number Seed… command from the Transform menu.

Set the random number seed First, click on the Set seed to option button to activate the text box. Second, type in the random seed stated in the problem. Third, click on the OK button to complete the dialog box. Note that SPSS does not provide you with any feedback about the change.

Select the compute command To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.

The formula for the split variable First, type the name for the new variable, split, into the Target Variable text box. Second, the formula for the value of split is shown in the text box. The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50. If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false. Third, click on the OK button to complete the dialog box.

The split variable in the data editor In the data editor, the split variable shows a random pattern of zero’s and one’s. To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.

Repeat the regression with first validation sample To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.

Using "split" as the selection variable First, scroll down the list of variables and highlight the variable split. Second, click on the right arrow button to move the split variable to the Selection Variable text box.

Setting the value of split to select cases When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the Rule… button to enter a value for split.

Completing the value selection First, type the value for the first half of the sample, 0, into the Value text box. Second, click on the Continue button to complete the value entry.

Requesting output for the first validation sample Click on the OK button to request the output. When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable. Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.

Repeat the regression with second validation sample To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.

Setting the value of split to select cases Since the split variable is already in the Selection Variable text box, we only need to change its value. Click on the Rule… button to enter a different value for split.

Requesting output for the second validation sample Click on the OK button to request the output. When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable.

SPLIT-SAMPLE VALIDATION - 1 In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant. In the first validation, the probability for the F statistic testing overall relationship was <0.001. For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001. Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.

SPLIT-SAMPLE VALIDATION - 2 The proportion of variance in the relationship utilizing the full data set was 30.7% compared to 24.5% for the first split sample validation and 39.6% for the second split sample validation. In both of the split-sample validation analyses, the proportion of variance in the dependent variable explained by the independent variables was not within 5% of the variance explained in the model using the full data set (30.7%). The strength of the relationship between the independent variables and the dependent variable is not supported by the validation. The answer to the question is false, because the validation did not verify the strength of the relationship in the analysis of the full data set.

Table of validation results: standard regression It may be helpful to create a table for our validation results and fill in its cells as we complete the analysis. This table shows us that the validation failed in the strength of relationship row (R²), where the R² for the validations did not fall within 5% if the R² for the full data set. Had we satisfied this criteria, the validation would have failed in the Significant Coefficients row, where age of respondent was not statistically significant in the first validation analysis.

Problem 2 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Validate the results of your regression analysis by splitting the sample in two, using 397155 as the random number seed. After controlling for the effects of the variables "age" [age] and "sex" [sex], the addition of the variable "happiness of marriage" [hapmar] reduces the error in predicting "general happiness" [happy] by 37.0%. After controlling for age and sex, the variable happiness of marriage makes an individual contribution to reducing the error in predicting general happiness. Survey respondents who were less happy with their marriages were less happy overall. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

Steps prior to the validation analysis • Prior to the split sample validation analysis, we must test for conformity to assumptions and examine outliers, making whatever transformations are needed and removing outliers. • Next, we must solve the regression problem to make certain the that findings (existence, strength, direction, and importance of relationships) stated in the problem are correct and, therefore, in need of validation before final interpretation. • When we do the validation, we include whatever transformations and omission of outliers were present in the model we want to validate. • For hierarchical regression, the R² change statistics must be statistically significant in order for the model to be validated.

OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 1 The probability of the F statistic (25.476) for the overall regression relationship for all indpendent variables is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of all independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the set of all independent variables and the dependent variable.

OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 2 The R Square Change statistic for the increase in R² associated with the added variables (happiness of marriage) is 0.370. Using a proportional reduction in error interpretation for R², information provided by the added variables reduces our error in predicting general happiness by 37.0%. The R Square statistic is used in the validation analysis. In this example, the proportion of variance in the dependent variable explained by all of the independent variables is 37.2%.

OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT VARIABLES - 3 The probability of the F statistic (75.972) for the change in R² associated with the addition of the predictor variables to the regression analysis containing the control variables is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no improvement in the relationship between the set of independent variables and the dependent variable when the predictors are added (R² Change = 0). We support the research hypothesis that there is a statistically significant improvement in the relationship between the set of independent variables and the dependent variable.

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 1 For the independent variable happiness of marriage, the probability of the t statistic (8.716) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with happiness of marriage is equal to zero (b = 0) and conclude that there is a statistically significant relationship between happiness of marriage and general happiness.

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 2 The b coefficient associated with happiness of marriage (0.658) is positive, indicating a direct relationship in which higher numeric values for happiness of marriage are associated with higher numeric values for general happiness. The independent variable happiness of marriage is an ordinal variable that is coded so that higher numeric values are associated with survey respondents who were less happy with their marriages. The dependent variable general happiness is also an ordinal variable. It is coded so that higher numeric values are associated with survey respondents who were less happy overall. Therefore, the positive value of b implies that survey respondents who were less happy with their marriages were less happy overall.

RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO DEPENDENT VARIABLE - 3 Age and sex are variables whose effects are controlled for, so there is no requirement that they be statistically significant.