Detecting outliers
Download
1 / 16

Detecting Outliers - PowerPoint PPT Presentation


  • 328 Views
  • Updated On :

Detecting Outliers.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Detecting Outliers' - kerry


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Detecting outliers l.jpg
Detecting Outliers

Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables (multivariate outliers).  Outliers generally have a large impact on the solution, i.e. the outlier case can conceivably change the value or score that we would predict for every other case in the study.  Our concern with outliers is to answer the question of whether our analysis is more valid with the outlier case included or more valid with the outlier case excluded.

To answer this question, we must have methods for detecting and assessing outliers.  The method for detecting univariate outliers is to convert the scores on the variable to standard scores and scan for very large positive and negative standard scores.  We will normally apply this strategy to the analysis of a metric dependent variable.  The detection of multivariate outliers is used to detect unusual cases for the combined set of metric independent variables, using a multivariate distance measure analogous to standard score distance from the mean of the sample.

The decision to exclude or retain the outlier case is based on our understanding of the cause of the outlier and the impact it is having on the results.  If the outlier is a data entry error or an obvious misstatement by a respondent, it probably should be excluded.  If the outlier is an unusual but probable value, it should be retained.  We can improve our understanding of the impact of the outlier by running an analysis twice, one with the outlier included and again with the outlier excluded.

Detecting Outliers


1 detecting univariate outliers l.jpg
1. Detecting Univariate Outliers

To detect univariate outliers, we convert our numeric variables to their standard score equivalents. Outliers will be those cases associated with large standard z-score values, e.g. smaller than -2.5 and larger than +2.5. Standardizing variables converts them to a standard deviation unit of measurement so that the distance from the mean for any case on any variable is expressed in comparable units. 

The Descriptives procedure can create standard scores for our variables and add them to our data. SPSS names the z-score variables by preceding the variable name with the letter z. The name for the standard score equivalent for x1 is zx1.

To locate the outliers for each variable, we can either sort the data set by the z-score variable or use the SPSS Examine procedure to print out the highest and lowest values for the z-score variables to the output window.

The use of standard scores to detect outliers presumes that the variable is normally distributed. When a variable is not normally distributed, a boxplot may be more effective in identifying outliers.  A boxplot identifies outliers using a somewhat different criteria. Cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box are identified as outliers. The box length is the inter-quartile range, or the difference between the case at the 25th quartile and the case at the 75th quartile.

Detecting Outliers




Use the explore procedure to locate large standard scores indicating outliers l.jpg
Use the Explore Procedure to Locate Large Standard Scores Indicating Outliers

Detecting Outliers


Specify outliers as the desired statistics l.jpg
Specify Outliers as the Desired Statistics Indicating Outliers

Detecting Outliers


Extreme values as outliers l.jpg
Extreme Values as Outliers Indicating Outliers

Detecting Outliers


2 detecting multivariate outliers l.jpg
2. Detecting Multivariate Outliers Indicating Outliers

Standard scores measure the statistical distance of a data point from the mean for all cases, measured in standard deviation units along the horizontal axis of a normal distribution plot. There is a similar measure of statistical distance in multidimensional space, known at Mahalanobis D² (d-squared). This statistic measures the distance from the centroid (multidimensional equivalent of a mean) for a set of scores (or vector) for each of the independent variables included in the analysis. The larger the value of the Mahalanobis D² for a case, and the smaller its corresponding probability value, the more likely the case is to be a multivariate outlier. The probability value enables us to make a decision about the statistical test of the null hypothesis, which is that the vector of scores for a case is equal to the centroid of the distribution for all cases.

Mahalanobis D² can be computed in SPSS with the regression procedure for a set of independent variables. The Save option will add the D² values to the data set. SPSS does not compute the probability of Mahalanobis D².  Mahalanobis D² is distributed as a chi-square statistic with degrees of freedom equal to the number of independent variables in the analysis.  The SPSS cumulative density function will compute the area under the chi-square curve from the left end of the distribution to the point corresponding to our statistical value.  The right-tail probability of obtaining a D² value this size is equal to one minus the cumulative density function value. 

We use the probability values to identify the cases which are most distant, or different, from the other cases in the sample.  We would make our decision about omitting or including extreme cases by re-running the analysis without them and comparing the results we obtain with and without them to determine whether our results are more representative with or without the extreme cases.

Detecting Outliers


Request a multiple regression to compute mahalanobis distance statistics l.jpg
Request a Multiple Regression to Compute Mahalanobis Distance Statistics

Detecting Outliers


Specify the variables to include in the analysis l.jpg
Specify the Variables to Include in the Analysis Distance Statistics

Detecting Outliers


Add the mahalanobis distance statistic to the data set l.jpg
Add the Mahalanobis Distance Statistic to the Data Set Distance Statistics

Detecting Outliers


The mahalanobis distance statistics in the data editor l.jpg
The Mahalanobis Distance Statistics in the Data Editor Distance Statistics

Detecting Outliers





The case id s for the multivariate outliers l.jpg
The Case ID's for the Multivariate Outliers D² Scores

Detecting Outliers


ad