Satisfying Assumptions of Linear Regression

1 / 96

# Satisfying Assumptions of Linear Regression - PowerPoint PPT Presentation

Satisfying Assumptions of Linear Regression. Correcting violations of assumptions Detecting outliers Transforming variables Sample problem Solving problems with the script Other features of the script Logic for homework problems. Consequences of failing to satisfy assumptions.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Satisfying Assumptions of Linear Regression' - ankti

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Satisfying Assumptions of Linear Regression

• Correcting violations of assumptions
• Detecting outliers
• Transforming variables
• Sample problem
• Solving problems with the script
• Other features of the script
• Logic for homework problems
Consequences of failing to satisfy assumptions
• When a regression fails to meet the assumptions, the probabilities that we base our findings on lose their accuracy. Generally, we fail to detect relationships for which we might otherwise have found support, increasing our chances of making a type II error.
• If we are using the regression to model expected values for the dependent variable, our predictions may be biased in that we are systematically making non-random errors for subsets of our population.
Correcting violations of assumptions - 1
• There are three strategies available to us to correct our violations of assumptions:
• 1. we can exclude outliers from our analysis
• 2. we can transform our variables
• 3. we can add a polynomial term (square, cube, etc.) for an independent variable.
• Employing one strategy generally has an impact on other the other strategies. For example, transforming a variable may change a case’s status as an outlier. Excluding an outlier reduces the skew of the distribution, thereby improving normality.
Correcting violations of assumptions - 2
• The availability of multiple strategies creates the opportunity to report our findings for a relationship in different ways, requiring us to choose to report one that we can defend.
• Unless we test all possible combinations, we cannot be certain that we are reporting the optimal relationship.
• When we utilize these remedies, we are required to report them with our findings.
Outliers
• Outliers are cases that have data values that are very different from the data values for the majority of cases in the data set.
• Outliers are important because they can change the results of our data analysis.
• Whether we include or exclude outliers from a data analysis depends on the reason why the case is an outlier and the purpose of the analysis.
Different types of outliers
• A case can be an outlier because it has an unusual value for the dependent variable, the independent variable, or both.
• A case is an outlier on the dependent variable if it has a very large studentized residual.
• A case is an outlier on the independent variable if it has high leverage.
• A case is an outlier on both if it has a large value for Cook’s distance.
Detecting outliers
• If the absolute value of the studentized residuals is larger than 2.0, it is an outlier.
• A leverage value identifies an outlier if it is greater than 2 x (number of IV’s + 1) / number of cases:
• A Cook’s distance is an outlier if it is greater than 4 / (number of cases – number of Iv’s – 1)
• If a case has outlier values greater than twice the values listed above, we will identify it as an extreme outlier.
Removing outliers
• In our problems, we will remove a case from the analysis if it exceeds two or more of the criteria for extreme outliers.
• When one extreme outlier is removed, the resulting analysis may reveal one or more additional extreme outliers, which could be subsequently be removed until the resulting analysis does not indicate the presence of any additional extreme outliers.
• In our problems, we will only remove extreme outliers once. This allows us to identify a more accurate model that accommodates unusual cases.
Detecting an removing outliers in SPSS
• The SPSS regression command allows us to save studentized residuals, Cook’s distances, and leverage values to the data editor.
• To remove them from the analysis, we use the Select Cases command. Since we select cases to be included in the analysis, we should write the command to identify cases that are not outliers.
Transformations
• Transformations change the shape of a distribution, generally by reducing the skewness to more closely approximate a normal distribution.
• We are accustomed to expecting numbers on a decimal scale, but mathematically there are may scales for number, e.g. binary, octal, hexadecimal, as well as geometric scales.
• Transformations are legitimate as long as they preserve the numeric properties of the numbers.
Transformations change the measurement scale

In the diagram to the right, the values of 5 through 20 are plotted on the different scales used in the transformations. These scales would be used in plotting the horizontal axis of the histogram depicting the distribution.

When comparing values measured on the decimal scale to which we are accustomed, we see that each transformation changes the distance between the benchmark measurements. All of the transformations increase the distance between small values and decrease the distance between large values. This has the effect of moving the positively skewed values to the left, reducing the effect of the skewing and producing a distribution that more closely resembles a normal distribution.

Transformations:Computing transformations in SPSS
• In SPSS, transformations are obtained by computing a new variable. SPSS functions are available for the logarithmic (LG10) and square root (SQRT) transformations. The inverse transformation uses a formula which divides minus one by the original value for each case.
• For each of these calculations, there may be data values which are not mathematically permissible. For example, the log of zero is not defined mathematically, division by zero is not permitted, and the square root of a negative number results in an “imaginary” value. We will adjust the values passed to the function to make certain that these illegal operations do not occur.
Transformations:Two forms for computing transformations
• There are two forms for each of the transformations to induce normality, depending on whether the distribution is skewed negatively to the left or skewed positively to the right.
• Both forms use the same SPSS functions and formula to calculate the transformations.
• The two forms differ in the value or argument passed to the functions and formula. The argument to the functions is an adjustment to the original value of the variable to make certain that all of the calculations are mathematically correct.
Transformations:Functions and formulas for transformations
• Symbolically, if we let x stand for the argument passes to the function or formula, the calculations for the transformations are:
• Logarithmic transformation: compute log = LG10(x)
• Square root transformation: compute sqrt = SQRT(x)
• Inverse transformation: compute inv = -1 / (x)
• Square transformation: compute s2 = x * x
• For all transformations, the argument must be greater than zero to guarantee that the calculations are mathematically legitimate.
• For positively skewed variables, the argument is an adjustment to the original value based on the minimum value for the variable.
• If the minimum value for a variable is zero, the adjustment requires that we add one to each value, e.g. x + 1.
• If the minimum value for a variable is a negative number (e.g., –6), the adjustment requires that we add the absolute value of the minimum value (e.g. 6) plus one (e.g. x + 6 + 1, which equals x +7).
Transformations:Example of positively skewed variable
• Suppose our dataset contains the number of books read (books) for 5 subjects: 1, 3, 0, 5, and 2, and the distribution is positively skewed.
• The minimum value for the variable books is 0. The adjustment for each case is books + 1.
• The transformations would be calculated as follows:
• Compute logBooks = LG10(books + 1)
• Compute sqrBooks = SQRT(books + 1)
• Compute invBooks = -1 / (books + 1)
• If the distribution of a variable is negatively skewed, the adjustment of the values reverses, or reflects, the distribution so that it becomes positively skewed. The transformations are then computed on the values in the positively skewed distribution.
• Reflection is computed by subtracting all of the values for a variable from one plus the absolute value of maximum value for the variable. This results in a positively skewed distribution with all values larger than zero.
• When an analysis uses a transformation involving reflection, we must remember that this will reverse the direction of all of the relationships in which the variable is involved.
• Our interpretation of relationships must be reversed if reflection has been used, or we can apply a second reflection to the transformed values so that the direction of the transformed variables matches that of the original variables. This is the approach that we will follow.
Transformations:Example of negatively skewed variable
• Suppose our dataset contains the number of books read (books) for 5 subjects: 1, 3, 0, 5, and 2, and the distribution is negatively skewed.
• The maximum value for the variable books is 5. The adjustment for each case is 6 - books.
• The transformations would be calculated as follows:
• Compute logBooks = LG10(6 - books)
• Compute sqrBooks = SQRT(6 - books)
• Compute invBooks = -1 / (6 - books)
Transformations:The Square Transformation for Linearity
• The square transformation is computed by multiplying the value for the variable by itself.
• It does not matter whether the distribution is positively or negatively skewed.
• It does matter if the variable has negative values, since we would not be able to distinguish their squares from the square of a comparable positive value (e.g. the square of -4 is equal to the square of +4). If the variable has negative values, we add the absolute value of the minimum value to each score before squaring it.
Transformations:Example of the square transformation
• Suppose our dataset contains change scores (chg) for 5 subjects that indicate the difference between test scores at the end of a semester and test scores at mid-term: -10, 0, 10, 20, and 30.
• The minimum score is -10. The absolute value of the minimum score is 10.
• The transformation would be calculated as follows:
• Compute squarChg = (chg + 10) * (chg + 10)
Which transformation to use

The recommendation of which transform to use is often summarized in a pictorial chart like the above. In practice, it is difficult to determine which distribution is most like your variable. It is often more efficient to compute all transformations and examine the statistical properties of each.

Both the histogram and the normality plot for Total Time Spent on the Internet (netime) indicate that the variable is not normally distributed.

Computing transformations in SPSS: Determine whether reflection is required

Skewness, in the table of Descriptive Statistics, indicates whether or not reflection (reversing the values) is required in the transformation.

If Skewness is positive, as it is in this problem, reflection is not required. If Skewness is negative, reflection is required.

In this problem, the minimum value is 0, so 1 will be added to each value in the formula, i.e. the argument to the SPSS functions and formula for the inverse will be:

netime + 1.

To compute the transformation, select the Compute… command from the Transform menu.

Computing transformations in SPSS: Specifying the transform variable name and function

First, in the Target Variable text box, type a name for the log transformation variable, e.g. “lgnetime“.

Third, click on the up arrow button to move the highlighted function to the Numeric Expression text box.

Second, scroll down the list of functions to find LG10, which calculates logarithmic values use a base of 10. (The logarithmic values are the power to which 10 is raised to produce the original number.)

Computing transformations in SPSS: Adding the variable name to the function

Second, click on the right arrow button. SPSS will replace the highlighted text in the function (?) with the name of the variable.

First, scroll down the list of variables to locate the variable we want to transform. Click on its name so that it is highlighted.

Following the rules stated for determining the constant that needs to be included in the function either to prevent mathematical errors, or to do reflection, we include the constant in the function argument. In this case, we add 1 to the netime variable.

Click on the OK button to complete the compute request.

Computing transformations in SPSS: The transformed variable

The transformed variable which we requested SPSS compute is shown in the data editor in a column to the right of the other variables in the dataset.

To compute the transformation, select the Compute… command from the Transform menu.

Computing transformations in SPSS: Specifying the transform variable name and function

First, in the Target Variable text box, type a name for the square root transformation variable, e.g. “sqnetime“.

Third, click on the up arrow button to move the highlighted function to the Numeric Expression text box.

Second, scroll down the list of functions to find SQRT, which calculates the square root of a variable.

Computing transformations in SPSS: Adding the variable name to the function

Second, click on the right arrow button. SPSS will replace the highlighted text in the function (?) with the name of the variable.

First, scroll down the list of variables to locate the variable we want to transform. Click on its name so that it is highlighted.

Following the rules stated for determining the constant that needs to be included in the function either to prevent mathematical errors, or to do reflection, we include the constant in the function argument. In this case, we add 1 to the netime variable.

Click on the OK button to complete the compute request.

Computing transformations in SPSS: The transformed variable

The transformed variable which we requested SPSS compute is shown in the data editor in a column to the right of the other variables in the dataset.

Computing transformations in SPSS: Computing the inverse transformation

To compute the transformation, select the Compute… command from the Transform menu.

Computing transformations in SPSS: Specifying the transform variable name and formula

First, in the Target Variable text box, type a name for the inverse transformation variable, e.g. “innetime“.

Second, there is not a function for computing the inverse, so we type the formula directly into the Numeric Expression text box.

Third, click on the OK button to complete the compute request.

Computing transformations in SPSS: The transformed variable

The transformed variable which we requested SPSS compute is shown in the data editor in a column to the right of the other variables in the dataset.

Computing transformations in SPSS: Adjustment to the argument for the square transformation

It is mathematically correct to square a value of zero, so the adjustment to the argument for the square transformation is different. What we need to avoid are negative numbers, since the square of a negative number produces the same value as the square of a positive number.

In this problem, the minimum value is 0, no adjustment is needed for computing the square. If the minimum was a number less than zero, we would add the absolute value of the minimum (dropping the sign) as an adjustment to the variable.

Computing transformations in SPSS: Computing the square transformation

To compute the transformation, select the Compute… command from the Transform menu.

Computing transformations in SPSS: Specifying the transform variable name and formula

First, in the Target Variable text box, type a name for the inverse transformation variable, e.g. “s2netime“.

Second, there is not a function for computing the square, so we type the formula directly into the Numeric Expression text box.

Third, click on the OK button to complete the compute request.

Computing transformations in SPSS:The transformed variable

The transformed variable which we requested SPSS compute is shown in the data editor in a column to the right of the other variables in the dataset.

Sample homework problem

Based on information from the data set 2001WorldFactbook.sav, select the best answer from the list below. Use .05 for alpha in the regression analysis and .01 for the diagnostic tests.

A simple linear regression between "population growth rate" [pgrowth] and "birth rate" [birthrat] will satisfy the regression assumptions if we choose to interpret which of the following models.

1 The original variables including all cases

2 The original variables excluding extreme outliers

3 The transformed variables including all cases

4 The transformed variables excluding extreme outliers

5 The quadratic model including all cases

6 The quadratic model excluding extreme outliers

7 None of the proposed models satisfies the assumptions

This week’s problems have a different format.

The task is to work through the different solutions in the order shown in the problem until one of them satisfies the regression assumptions, or none of them satisfies the regression assumptions.

Run the script - 1

We will use a second script to solve this week’s problems.

Select Run Script from the Utilities menu.

Run the script - 2

Navigate to the folder where you downloaded the script.

Highlight the script (.SBS) file to run.

Click on the Run button to run the script.

Assumption of linearity - 1

Click on the arrow button to move the variable to the text box for the dependent variable.

Highlight the dependent variable in the list of variables.

Assumption of linearity - 1

Highlight the independent variable in the list of variables.

Click on the arrow button to move the variable to the list box for the independent variable.

Initial test of conformity to assumptions - 1

Run the regression with all cases to test the initial conformity to the assumptions.

Initial test of conformity to assumptions - 2

The Durbin-Watson statistic (1.93) fell within the acceptable range from 1.50 to 2.50, indicating that the assumption of independence of errors was satisfied.

Initial test of conformity to assumptions - 3

The lack of fit test (F(157, 59) = 1.78, p = .006) indicated that the assumption of linearity was violated.

Initial test of conformity to assumptions - 4

The Breusch-Pagan test (Breusch-Pagan(1) = 679.27, p < .001) indicated that the assumption of homogeneity of error variance was violated.

Initial test of conformity to assumptions - 5

The Shapiro-Wilk test of studentized residuals (Shapiro-Wilk(218) = 0.81, p < .001) indicated that the assumption of normality of errors was violated.

Initial test of conformity to assumptions - 6

One extreme outlier was found in the data. Montserrat was an extreme outlier (the cook's distance (21.295252) was larger than the cutoff value of 0.037037,the leverage (0.331496) was larger than the cutoff value of 0.036697 and the studentized residual (-9.173) was smaller than the cutoff value of -4.0).

Initial test of conformity to assumptions - 7

We could exclude the cases one at a time by selecting the case in the list of cases included and clicking on the arrow button, or we can have the script do it for us using the Exclude extreme outlier button.

The script will remove the extreme outliers by clicking on the Exclude extreme outliers button.

Initial test of conformity to assumptions - 8

Case number 136, Montserrat, is added to the list of cases to exclude.

Model: original variables, excluding outliers

To see whether or not removing the outlier resolves the violation of assumptions, run the regression again.

Run the regression with all cases to test the initial conformity to the assumptions.

Removing the one extreme outlier solved the violation of the assumption of linearity.

The lack of fit test (F(156, 59) = 0.94, p = .617) indicated that the assumption of linearity was satisfied.

The Durbin-Watson statistic (2.01) fell within the acceptable range from 1.50 to 2.50, indicating that the assumption of independence of errors was satisfied.

The Breusch-Pagan test (Breusch-Pagan(1) = 29.24, p < .001) indicated that the assumption of homogeneity of error variance was violated.

The Shapiro-Wilk test of studentized residuals (Shapiro-Wilk(217) = 0.97, p < .001) indicated that the assumption of normality of errors was violated.

Selecting transformations

Since removing outliers did not solve all of our violations, we will try transformations of the variables.

We restore all of the cases to the analysis by clicking on the Include all cases button.

Test the normality of the dependent variable - 1

First, click on the dependent variable to select it.

Click on the Test normality button.

Test the normality of the dependent variable - 2

There is a statistical procedure named the Box-Cox transformation which SPSS does not compute and which I have not added to the script.

However, we can use the test of normality as a surrogate. As the statistical value of the Shapiro-Wilk statistic gets larger, it is associated with a higher probability.

We will select the transformation with the largest Shapiro-Wilk statistic as the transformation which best “normalizes” the variable, provided it is at least 0.01 larger than the statistical value for the untransformed variable.

For this variable, we would choose the Logarithmic transformation.

Choosing one transformation does not mean that it is particularly effective, only that it is better than the others.

Test the normality of the independent variable - 1

First, click on the independent variable to select it.

Click on the Test normality button.

Test the normality of the independent variable - 2

For this variable, we would also choose the Logarithmic transformation.

Substituting the transformed variables - 1

For both the dependent and independent variables, the log transform were the most promising. We will now substitute for both of these in the analysis.

First, select the variable we want to transform, birthrat.

Mark the option button for Logarithm.

Click on the Apply transformation button.

Substituting the transformed variables - 2

If we look in the data editor, we see that the log transformed variable has been added to the data set.

Substituting the transformed variables - 3

… and the log transformed variable has been substituted in the text box for the dependent variable.

Substituting the transformed variables - 4

The process for transforming the independent variable is the same.

First, select the variable we want to transform, pgrowth.

Mark the option button for Logarithm.

Click on the Apply transformation button.

Substituting the transformed variables - 5

If we look in the data editor, we see that the log transformed variable has been added to the data set.

Substituting the transformed variables - 5

… and the log transformed independent variable has been substituted in the text box for the independent variable.

Run the regression with all cases to test the regression with transformed variables.

Evaluating assumptions for transformed variables - 1

The lack of fit test (F(157, 59) = 1.38, p = .080) indicated that the assumption of linearity was satisfied.

Evaluating assumptions for transformed variables - 2

The Durbin-Watson statistic (1.94) fell within the acceptable range from 1.50 to 2.50, indicating that the assumption of independence of errors was satisfied.

Evaluating assumptions for transformed variables - 3

The Breusch-Pagan test (Breusch-Pagan(1) = 29.02, p < .001) indicated that the assumption of homogeneity of error variance was violated.

Evaluating assumptions for transformed variables - 4

The Shapiro-Wilk test of studentized residuals (Shapiro-Wilk(218) = 0.96, p < .001) indicated that the assumption of normality of errors was violated.

Evaluating assumptions for transformed variables - 5

Substituting the transformations did not satisfy all of the assumptions.

Since extreme outliers were present in this solution, we will exclude them in our next test.

Substituting the transformed variables - 5

To remove the extreme outliers, click on the Exclude extreme outliers button.

Substituting the transformed variables - 5

The two extreme outliers, Dominica and Montserrat were added to the list of Cases excluded.

Run the regression again to test the regression with transformed variables, excluding extreme outliers.

The Breusch-Pagan test (Breusch-Pagan(1) = .3367, p < .562) indicated that the assumption of homogeneity of error variance was satisfied.

The lack of fit test (F(155, 59) = 1.06, p = .413) indicated that the assumption of linearity was satisfied.

The Shapiro-Wilk test of studentized residuals (Shapiro-Wilk(216) = 0.99, p = .104) indicated that the assumption of normality of errors was satisfied.

The Durbin-Watson statistic (2.01) fell within the acceptable range from 1.50 to 2.50, indicating that the assumption of independence of errors was satisfied.

The reqression using the transformed variables and excluding extreme outliers met all of the assumptions of simple linear regression.

This is the analysis that we should use to report our findings.

Testing for a quadratic relationship - 1

When the model with transformed variables does not satisfy the assumptions, we will test the quadratic model with a squared independent variable term.

First, highlight the dependent variable so that we can change it.

Third, click on the Apply transformation button to complete the transformation.

Second, mark the option button No transformation to return the variable to its original form.

Testing for a quadratic relationship - 1

First, highlight the independent variable so that we can change it.

Third, click on the Apply transformation button to complete the transformation.

Second, mark the option button No transformation to return the variable to its original form.

Testing for a quadratic relationship - 3

Click on the Include all cases button to restore all of the cases to the analysis.

Testing for a quadratic relationship - 1

Click on the Add Quadratic IV to add the square of pgrowth to the list of independent variables.

Testing for a quadratic relationship - 1

The quadratic or squared variable is added to the list of independent variables.

Run the regression to test the quadratic equation.

Other features of the script - 1

The default order of the variables in the variables list box follows the order in which they appear in the data editor.

The Sort variables button will sort the variable list alphabetically. Each time it is clicked, it switches the order from alphabetical order to variable order.

Other features of the script - 2

The default options are to Delete output from previous commands and to Delete variables created in this analysis. Either option can be turned off by clearing its check box.

Other features of the script - 3

To clear all of the previous selections and start a new problem, click on Reset.

To close the script, click on the Cancel button or click on the window close box.

No

Linear fit?

No

Homoscedasticity?

No

Residualsnormal?

No

Independent errors?

Yes

Yes

Yes

Yes

Yes

Logic for satisfying the assumptions of simple linear regression - 1

Run regression using original variables,

including all cases

Correct answer is 1

No

Linear fit?

No

Homoscedasticity?

No

Residualsnormal?

No

Independent errors?

Yes

Yes

Yes

Yes

Yes

Logic for satisfying the assumptions of simple linear regression - 2

Run regression using original variables,

excluding extreme outliers

Yes

Extreme outliers?

If there are no extreme outliers, this is the same model we did for the previous slide.

No

Correct answer is 2

No

Linear fit?

No

Homoscedasticity?

No

Residualsnormal?

No

Independent errors?

Yes

Yes

Yes

Yes

Yes

Logic for satisfying the assumptions of simple linear regression - 3

Run regression using transformed variables,

including all cases

Correct answer is 3

No

Linear fit?

No

Homoscedasticity?

No

Residualsnormal?

No

Independent errors?

Yes

Yes

Yes

Yes

Yes

Logic for satisfying the assumptions of simple linear regression - 4

Run regression using transformed variables,

excluding extreme outliers

Yes

Extreme outliers?

No

If there are no extreme outliers, this is the same model we did for the previous slide.

Correct answer is 2

No

Linear fit?

No

Homoscedasticity?

No

Residualsnormal?

No

Independent errors?

Yes

Yes

Yes

Yes

Yes

Logic for satisfying the assumptions of simple linear regression - 5

Run regression using quadratic model,

including all cases

Correct answer is 5

No

Linear fit?

No

Homoscedasticity?

No

Residualsnormal?

No

Independent errors?

Yes

Yes

Yes

Yes

Yes

Logic for satisfying the assumptions of simple linear regression - 6

Extreme outliers?

Run regression using quadratic model,

excluding extreme outliers

Yes

No

Correct answer is 7

Correct answer is 6

Correct answer is 7