Detecting Outliers

Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables (multivariate outliers). Outliers generally have a large impact on the solution, i.e. the outlier case can conceivably change the value or score that we would predict for every other case in the study. Our concern with outliers is to answer the question of whether our analysis is more valid with the outlier case included or more valid with the outlier case excluded. To answer this question, we must have methods for detecting and assessing outliers. The method for detecting univariate outliers is to convert the scores on the variable to standard scores and scan for very large positive and negative standard scores. We will normally apply this strategy to the analysis of a metric dependent variable. The detection of multivariate outliers is used to detect unusual cases for the combined set of metric independent variables, using a multivariate distance measure analogous to standard score distance from the mean of the sample. The decision to exclude or retain the outlier case is based on our understanding of the cause of the outlier and the impact it is having on the results. If the outlier is a data entry error or an obvious misstatement by a respondent, it probably should be excluded. If the outlier is an unusual but probable value, it should be retained. We can improve our understanding of the impact of the outlier by running an analysis twice, one with the outlier included and again with the outlier excluded. Detecting Outliers

1. Detecting Univariate Outliers To detect univariate outliers, we convert our numeric variables to their standard score equivalents. Outliers will be those cases associated with large standard z-score values, e.g. smaller than -2.5 and larger than +2.5. Standardizing variables converts them to a standard deviation unit of measurement so that the distance from the mean for any case on any variable is expressed in comparable units. The Descriptives procedure can create standard scores for our variables and add them to our data. SPSS names the z-score variables by preceding the variable name with the letter z. The name for the standard score equivalent for x1 is zx1. To locate the outliers for each variable, we can either sort the data set by the z-score variable or use the SPSS Examine procedure to print out the highest and lowest values for the z-score variables to the output window. The use of standard scores to detect outliers presumes that the variable is normally distributed. When a variable is not normally distributed, a boxplot may be more effective in identifying outliers. A boxplot identifies outliers using a somewhat different criteria. Cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box are identified as outliers. The box length is the inter-quartile range, or the difference between the case at the 25th quartile and the case at the 75th quartile. Detecting Outliers

Compute Standard Scores for the Metric Variables Detecting Outliers

The Standard Scores in the SPSS Data Editor Detecting Outliers

Use the Explore Procedure to Locate Large Standard Scores Indicating Outliers Detecting Outliers

Specify Outliers as the Desired Statistics Detecting Outliers

Extreme Values as Outliers Detecting Outliers

2. Detecting Multivariate Outliers Standard scores measure the statistical distance of a data point from the mean for all cases, measured in standard deviation units along the horizontal axis of a normal distribution plot. There is a similar measure of statistical distance in multidimensional space, known at Mahalanobis D² (d-squared). This statistic measures the distance from the centroid (multidimensional equivalent of a mean) for a set of scores (or vector) for each of the independent variables included in the analysis. The larger the value of the Mahalanobis D² for a case, and the smaller its corresponding probability value, the more likely the case is to be a multivariate outlier. The probability value enables us to make a decision about the statistical test of the null hypothesis, which is that the vector of scores for a case is equal to the centroid of the distribution for all cases. Mahalanobis D² can be computed in SPSS with the regression procedure for a set of independent variables. The Save option will add the D² values to the data set. SPSS does not compute the probability of Mahalanobis D². Mahalanobis D² is distributed as a chi-square statistic with degrees of freedom equal to the number of independent variables in the analysis. The SPSS cumulative density function will compute the area under the chi-square curve from the left end of the distribution to the point corresponding to our statistical value. The right-tail probability of obtaining a D² value this size is equal to one minus the cumulative density function value. We use the probability values to identify the cases which are most distant, or different, from the other cases in the sample. We would make our decision about omitting or including extreme cases by re-running the analysis without them and comparing the results we obtain with and without them to determine whether our results are more representative with or without the extreme cases. Detecting Outliers

Request a Multiple Regression to Compute Mahalanobis Distance Statistics Detecting Outliers

Specify the Variables to Include in the Analysis Detecting Outliers

Add the Mahalanobis Distance Statistic to the Data Set Detecting Outliers

The Mahalanobis Distance Statistics in the Data Editor Detecting Outliers

Compute the Probability Values for the Mahalanobis D² Statistics Detecting Outliers

Sorting the Data Set to Locate Statistically Significant D² Scores Detecting Outliers

Highlight Cases with Statistically Significant Mahalanobis D² Scores Detecting Outliers

The Case ID's for the Multivariate Outliers Detecting Outliers

Detecting Outliers