Psych 5510/6510. Chapter Nine: Outliers and Data Having Undue Influence. Spring, 2009. Effect of Outliers. Set 1: 1, 3, 5, 9, 14 : Sample Mean = est. μ = 6.4, MSE = S 2 = 26.8 Confidence interval mean: 0 μ 12.8 Set 2: 1, 3, 5, 9, 140 :
Chapter Nine: Outliers and Data Having Undue Influence
Set 1: 1, 3, 5, 9, 14:
Sample Mean = est. μ = 6.4, MSE = S2 = 26.8
Confidence interval mean: 0 μ 12.8
Set 2: 1, 3, 5, 9, 140:
Sample Mean = est. μ = 31.6, MSE = S2 = 3680.8
Confidence interval mean: -43.7 μ 106.9
This is less likely to be caught when inputting a series of data for the computer to analyze, than when computing the analysis yourself with a calculator.
Discovering that more than one kind of thing is being measured (e.g. that more than one population is appearing in the group) can be very interesting.
Rather than get rid of outliers, need to identify them so they can be examined and possibly the existence of two different populations can be incorporated into the model.
(The taller curve with thinner tails is normally distributed)
Thick tails lead to more frequent extreme scores and greater
error variance than sampling from a normal distribution.
Note error in data entry for student #6. PRE and F don’t change much, but there is a
huge difference in the parameters, including a reversal in the direction of the slope.
Note reversal of slope and big change in intercept, yet model still
statistically significant, without looking at graph you might think
that you have a pretty good model.
Leverage involves determining whether any particular observation has unusual values for X.
The approach we will use involves looking at the ‘lever’ that goes with each observation.
Buried within the regression equation is the fact that all of the X and all of the Y scores in the data set go into computing the values of the b’s in the regression equation. This means that for any one observation the X scores for all of the observations influence it’s predicted value of Y.
In a much more complicated but equivalent version of the regression equation you plug in all of the X scores for all of the observations to predict a particular value of Y.
Alternative regression equation:
Ŷi= a complicated formula that includes all of the X scores in the data set, not just the X score for observation ‘i’.
If an observation has unusual X scores then its predicted value of Y is not very strongly influenced by the X scores of the other observations, instead its prediction is influenced mainly by its own X scores.
If, on the other hand, an observation has X scores similar to those of the other observations, then its predicted value of Y is influenced not only by its own X scores but also by those of the other observations.
This gives us a way of determining whether an observation has unusual X scores. If its predicted value of Y is heavily influenced by its own X scores then those X scores must have been unusual. If its predicted value of Y is influenced by the X scores of other observations then its X scores must have been similar to those of the other observations.
A ‘lever’ (symbolized as hii) is a measure of how much an observation’s own X scores influences its predicted value of Y (it measures how much leverage its own X scores had in the prediction). If the observation has unusual X scores (compared to other observations) then its lever is a large value, if the observation has X scores similar to other observations then its lever is a smaller value.
Levers will always have a value between 0 and 1. The
mean (expected) value of a lever--if its values of X conform
to those of the other observations--is PA/n. If an observation
has a lever much greater than that then it is a flag that it has
some unusual X values.
The bigger the lever, the more its X scores stood out as different from the rest. How big does a lever have to be to draw our attention?
Expected (mean) value of the levers is PA/n = 2/13 = .15
Discrepancy involves determining whether any particular observation has unusual values for Y.
Q: Is Yi unusual in respect to what?
A: To the model.
In other words, look for observations that differ greatly from the regression line (giving them a large error). An examination of error terms is referred to as an ‘analysis of the residuals’.
Approach: if an observation is unusual (way off the regression line that would fit all of the other observations) then creating a parameter just to handle it should greatly reduce error (visually think of the original regression line being freed from the pull of the outlier...look back the original scatter plots with and without the outlier). We will start by looking at how that works with our outlier.
Model C is the original Model, in this case:
For Model A, add another variable (X2) that has a score of X2= 0 everywhere but at the outlier, where X2=1
If PRE is significant, then it was worthwhile to handle that one outlier individually in the model, i.e. it doesn’t belong with the other scores.
The dummy variable is there to handle observation #6.
SAT=96.55 - .50(HSRANK)
SAT=6.71 + .50(HSRANK)+55.49(Dummy)
PRE=0.68 F*1,10=21.4, t*=4.6, p<.01
Thus it was worthwhile to introduce a dummy variable to account for the outlier.
Stat programs will do this for each observation, one at a time, to determine whether or not it is worthwhile to create a dummy variable to handle just that observation. They report this as the Studentized Deleted Residual, which is the square root of the value of F* above.
HSRANK (X1) to one that has a dummy variable (X2) just
to handle that one value of Y, report the t value for the PRE.
Problem: If each t test has a .05 chance of making a type 1 error, then the overall error rate is too large with this approach..
(alpha=.05/13=.0038 for this example), but note that p values are not provided by SPSS, so,
The third approach to identifying unusual scores is to see if dropping the score would dramatically change the model, this is known as influence.
Procedure: compare the estimates of the parameters in the model with the outlier, to the estimates of the parameters in the model without the outlier.
We want to compare the following:
If deleting the k’th observation greatly changes the values
of the b’s, then it must have been having a large influence
on the values when it was included, as can be seen in the
previous 2 slides, omitting the outlier greatly changes the b’s (notice how the slope and intercept both changed).
We want to compare the following:
If the values of the b’s change in the two models then the predictions made by the two models will also change. The easiest way to see if the models differ is by comparing their predictions, specifically looking at (Ŷi - Ŷi,[k]) for each observation. As you might guess, to see the total difference between the two models across all observations we will use:
Cook’s D (distance)
There are only informal guidelines for when Cook’s D is considered large:
Again, this will be used simply to draw your attention to where to look for outliers.
Identify what is exemplified by ‘A’, ‘B’, and ‘C’
Leverage: leads us to falsely think we have found something
interesting. Unusual X scores inflate SSx, which leads to smaller
confidence intervals, making it easier to reject H0.
Discrepancy: shooting ourselves in the foot by causing us
to miss something interesting. Scores off the regression line add
to SSE(A), which reduces SSR, making PRE smaller.
Influence: all bets are off, model just doesn’t fit the majority of scores.
The following four scatter plots all have the same model, PRE, and significance!
Ŷi=3.0 + .5Xi
PRE=.666 F*=17.95 p<.01
the regression line would better fit the other scores.
determining the slope, what the heck is X doing?
Partial regression plots can be of help in visually identifying outliers when there is more than one predictor variable.