1 / 68

# Outliers and influential data points - PowerPoint PPT Presentation

Outliers and influential data points. The distinction. An outlier is a data point whose response y does not follow the general trend of the rest of the data.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Outliers and influential data points

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Outliers and influential data points

### The distinction

• An outlier is a data point whose response y does not follow the general trend of the rest of the data.

• A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated slope coefficients, hypothesis test results, etc.

### Any outliers? Any influential data points?

Without the blue data point:

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 2.96 + 5.04 x

Predictor Coef SE Coef T P

Constant 2.958 2.009 1.47 0.157

x 5.0373 0.3633 13.86 0.000

S = 4.711 R-Sq = 91.0% R-Sq(adj) = 90.5%

### Any outliers? Any influential data points?

Without the blue data point:

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 2.47 + 4.93 x

Predictor Coef SE Coef T P

Constant 2.468 1.076 2.29 0.033

x 4.9272 0.1719 28.66 0.000

S = 2.709 R-Sq = 97.7% R-Sq(adj) = 97.6%

### Any outliers? Any influential data points?

Without the blue data point:

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 8.50 + 3.32 x

Predictor Coef SE Coef T P

Constant 8.505 4.222 2.01 0.058

x 3.3198 0.6862 4.84 0.000

S = 10.45 R-Sq = 55.2% R-Sq(adj) = 52.8%

### Impact on regression analyses

• Not every outlier strongly influences the regression analysis.

• Always determine if the regression analysis is unduly influenced by one or a few data points.

• Simple plots for simple linear regression.

• Summary measures for multiple linear regression.

## The leverageshii

### The leverageshii

The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn:

where the weights hi1, hi2, …, hii, …, hin depend only on the predictor values.

For example:

### The leverageshii

Because the predicted response can be written as:

the leverage, hii, quantifies the influence that the observed response yi has on its predicted value .

### Properties of the leverages hii

• The leverage hii is:

• a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points.

• a number between 0 and 1, inclusive.

• The sum of the hiiequals p, the number of parameters.

### Any high leverages hii?

HI1

0.176297 0.157454 0.127014 0.119313 0.086145

0.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492

Sum of HI1 = 2.0000

### Any high leverages hii?

HI1

0.153481 0.139367 0.116292 0.110382 0.084374

0.077557 0.066879 0.063589 0.050033 0.052121

0.047632 0.048156 0.049557 0.055893 0.057574

0.078121 0.088549 0.096634 0.096227 0.110048

0.357535

Sum of HI1 = 2.0000

## Identifying data points whose x values are extreme .... and therefore potentially influential

### Using leverages to identify extreme x values

Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….

…or if it’s greater than 0.99 (whichever is smallest).

x y HI1

14.00 68.00 0.357535

Unusual Observations

Obs x y Fit SE Fit Residual St Resid

21 14.0 68.00 71.449 1.620 -3.449 -1.59 X

X denotes an observation whose X value gives it large

influence.

x y HI2

13.00 15.00 0.311532

Unusual Observations

Obs x y Fit SE Fit Residual St Resid

21 13.0 15.00 51.66 5.83 -36.66 -4.23RX

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large influence.

### Important distinction!

• The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis.

• The leverage depends only on the predictor values.

• Whether the data point is influential or not depends on the observed value yi.

## Identifying outliers(unusual y values)

### Identifying outliers

• Residuals

• Standardized residuals

• also called internally studentized residuals

### Residuals

Ordinary residuals defined for each observation, i = 1, …, n:

x y FITS1 RESI1

1 2 2.2 -0.2

2 5 4.4 0.6

3 6 6.6 -0.6

4 9 8.8 0.2

### Standardized residuals

Standardized residuals defined for each observation, i = 1, …, n:

MSE1 0.400000

x y FITS1 RESI1 HI1 SRES1

1 2 2.2 -0.2 0.7 -0.57735

2 5 4.4 0.6 0.3 1.13389

3 6 6.6 -0.6 0.3 -1.13389

4 9 8.8 0.2 0.7 0.57735

### Standardized residuals

• Standardized residuals quantify how large the residuals are in standard deviation units.

• An observation with a standardized residual that is larger than 3 (in absolute value) is generally deemed an outlier.

• Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).

### An outlier?

S = 4.711

x y FITS1 HI1 s(e) RESI1 SRES1

0.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.82635

0.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.24916

1.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.43544

1.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.99818

2.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191

...

8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.05561

9.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.77679

4.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110

• Unusual Observations

• Obs x y Fit SE Fit Residual St Resid

• 4.00 40.00 23.11 1.06 16.89 3.68R

• R denotes an observation with a large standardized residual.

### Why should we care?(Regression of y on xwith outlier)

The regression equation is y = 2.95763 + 5.03734 x

S = 4.71075 R-Sq = 91.0 % R-Sq(adj) = 90.5 %

Analysis of Variance

Source DF SS MS F P

Regression 1 4265.82 4265.82 192.230 0.000

Error 19 421.63 22.19

Total 20 4687.46

### Why should we care?(Regression of y on xwithout outlier)

The regression equation is y = 1.73217 + 5.11687 x

S = 2.5919 R-Sq = 97.3 % R-Sq(adj) = 97.2 %

Analysis of Variance

Source DF SS MS F P

Regression 1 4386.07 4386.07 652.841 0.000

Error 18 120.93 6.72

Total 19 4507.00

## Identifying influential data points

### Identifying influential data points

• Deleted residuals

• Deleted t residuals

• also called studentized deleted residuals

• also called externally studentized residuals

• Difference in fits, DFITS

• Cook’s distance measure

### Basic idea of these four measures

• Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations.

• Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.

### Deleted residuals

yi = the observed response for ith observation

= predicted response for ith observationbased on the estimated model with the ith observation deleted

Deleted residual

### Deleted t residuals

A deleted t residual is just a standardized deleted residual:

The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

x y RESI1 TRES1

1 2.1 -1.59 -1.7431

2 3.8 0.24 0.1217

3 5.2 1.77 1.6361

10 2.1 -0.42 -19.7990

Do any of the deleted t residuals stick out like a sore thumb?

Row x y RESI1 SRES1 TRES1

1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916

2 0.45401 4.1673 -1.0774 -0.24916 -0.24291

3 1.09765 6.5703 -1.9166 -0.43544 -0.42596

...

19 8.70156 46.5475 -0.2429 -0.05561 -0.05413

20 9.16463 45.7762 -3.3468 -0.77679 -0.76837

21 4.00000 40.0000 16.8930 3.68110 6.69012

Do any of the deleted t residuals stick out like a sore thumb?

### DFITS

The difference in fits:

is the number of standard deviations that the fitted value changes when the ith case is omitted.

### Using DFITS

An observation is deemed influential …

… if the absolute value of its DFIT value is

greater than:

… or if the absolute value of its DFIT value

sticks out like a sore thumb from the other

DFIT values.

x y DFIT1

14.00 68.00 -1.23841

Row x y DFIT1

1 0.1000 -0.0716 -0.52503

2 0.4540 4.1673 -0.08388

3 1.0977 6.5703 -0.18232

4 1.2794 13.8150 0.75898

5 2.2061 11.4501 -0.21823

6 2.5006 12.9554 -0.20155

7 3.0403 20.1575 0.27774

8 3.2358 17.5633 -0.08230

9 4.4531 26.0317 0.13865

10 4.1699 22.7573 -0.02221

11 5.2847 26.3030 -0.18487

12 5.5924 30.6885 0.05523

13 5.9209 33.9402 0.19741

14 6.6607 30.9228 -0.42449

15 6.7995 34.1100 -0.17249

16 7.9794 44.4536 0.29918

17 8.4154 46.5022 0.30960

18 8.7161 50.0568 0.63049

19 8.7016 46.5475 0.14948

20 9.1646 45.7762 -0.25094

2114.0000 68.0000 -1.23841

x y DFIT2

13.00 15.00 -11.4670

Row x y DFIT2

1 0.1000 -0.0716 -0.4028

2 0.4540 4.1673 -0.2438

3 1.0977 6.5703 -0.2058

4 1.2794 13.8150 0.0376

5 2.2061 11.4501 -0.1314

6 2.5006 12.9554 -0.1096

7 3.0403 20.1575 0.0405

8 3.2358 17.5633 -0.0424

9 4.4531 26.0317 0.0602

10 4.1699 22.7573 0.0092

11 5.2847 26.3030 0.0054

12 5.5924 30.6885 0.0782

13 5.9209 33.9402 0.1278

14 6.6607 30.9228 0.0072

15 6.7995 34.1100 0.0731

16 7.9794 44.4536 0.2805

17 8.4154 46.5022 0.3236

18 8.7161 50.0568 0.4361

19 8.7016 46.5475 0.3089

20 9.1646 45.7762 0.2492

2113.0000 15.0000 -11.4670

x y DFIT3

4.00 40.00 1.5505

Row x y DFIT3

1 0.10000 -0.0716 -0.37897

2 0.45401 4.1673 -0.10501

3 1.09765 6.5703 -0.16248

4 1.27936 13.8150 0.36737

5 2.20611 11.4501 -0.17547

6 2.50064 12.9554 -0.16377

7 3.04030 20.1575 0.10670

8 3.23583 17.5633 -0.09265

9 4.45308 26.0317 0.03061

10 4.16990 22.7573 -0.05850

11 5.28474 26.3030 -0.16025

12 5.59238 30.6885 -0.02183

13 5.92091 33.9402 0.05988

14 6.66066 30.9228 -0.34036

15 6.79953 34.1100 -0.18835

16 7.97943 44.4536 0.10017

17 8.41536 46.5022 0.09771

18 8.71607 50.0568 0.29275

19 8.70156 46.5475 -0.02188

20 9.16463 45.7762 -0.33969

21 4.00000 40.0000 1.55050

### Cook’s distance

Cook’s distance

• Di depends on both residual ei and leverage hii.

• Di summarizes how much each of the estimated coefficients change when deleting the ith observation.

• A large Di indicates yi has a strong influence on the estimated coefficients.

### Using Cook’s distance

• If Di is greater than 1, then the ith data point is worthy of further investigation.

• If Di is greater than 4, then the ith data point is most certainly influential.

• Or, if Di sticks out like a sore thumb from the other Di values, it is most certainly influential.

x y COOK1

14.00 68.00 0.701960

Row x y COOK1

1 0.1000 -0.0716 0.134156

2 0.4540 4.1673 0.003705

3 1.0977 6.5703 0.017302

4 1.2794 13.8150 0.241688

5 2.2061 11.4501 0.024434

6 2.5006 12.9554 0.020879

7 3.0403 20.1575 0.038414

8 3.2358 17.5633 0.003555

9 4.4531 26.0317 0.009944

10 4.1699 22.7573 0.000260

11 5.2847 26.3030 0.017379

12 5.5924 30.6885 0.001605

13 5.9209 33.9402 0.019747

14 6.6607 30.9228 0.081345

15 6.7995 34.1100 0.015290

16 7.9794 44.4536 0.044621

17 8.4154 46.5022 0.047961

18 8.7161 50.0568 0.173897

19 8.7016 46.5475 0.011657

20 9.1646 45.7762 0.032320

21 14.0000 68.0000 0.701960

x y COOK2

13.00 15.00 4.04801

Row x y COOK2

1 0.1000 -0.0716 0.08172

2 0.4540 4.1673 0.03076

3 1.0977 6.5703 0.02198

4 1.2794 13.8150 0.00075

5 2.2061 11.4501 0.00901

6 2.5006 12.9554 0.00629

7 3.0403 20.1575 0.00086

8 3.2358 17.5633 0.00095

9 4.4531 26.0317 0.00191

10 4.1699 22.7573 0.00004

11 5.2847 26.3030 0.00002

12 5.5924 30.6885 0.00320

13 5.9209 33.9402 0.00848

14 6.6607 30.9228 0.00003

15 6.7995 34.1100 0.00280

16 7.9794 44.4536 0.03958

17 8.4154 46.5022 0.05229

18 8.7161 50.0568 0.09180

19 8.7016 46.5475 0.04809

20 9.1646 45.7762 0.03194

21 13.0000 15.0000 4.04801

x y COOK3

4.00 40.00 0.36391

Row x y COOK3

1 0.10000 -0.0716 0.073075

2 0.45401 4.1673 0.005801

3 1.09765 6.5703 0.013793

4 1.27936 13.8150 0.067493

5 2.20611 11.4501 0.015960

6 2.50064 12.9554 0.013909

7 3.04030 20.1575 0.005955

8 3.23583 17.5633 0.004498

9 4.45308 26.0317 0.000494

10 4.16990 22.7573 0.001799

11 5.28474 26.3030 0.013191

12 5.59238 30.6885 0.000251

13 5.92091 33.9402 0.001886

14 6.66066 30.9228 0.056276

15 6.79953 34.1100 0.018263

16 7.97943 44.4536 0.005272

17 8.41536 46.5022 0.005020

18 8.71607 50.0568 0.043959

19 8.70156 46.5475 0.000253

20 9.16463 45.7762 0.058966

21 4.00000 40.0000 0.363914

### A strategy for dealing with problematic data points

• Don’t forget that the above methods are just statistical tools. It’s okay to use common sense and knowledge about the situation.

• First, check for obvious data errors.

• If a data entry error, simply correct it.

• If not representative of the population, delete it.

• If a procedural error invalidates the measurement, delete it.

### A comment about deleting data points

• Do not delete data just because they do not fit your preconceived regression model.

• You must have a good, objective reason for deleting data points.

• If you delete any data after you’ve collected it, justify and describe it in your reports.

• If not sure what to do about a data point, analyze data twice and report both results.

### A strategy for dealing with problematic data points (cont’d)

• Then, consider model misspecification.

• Any important variables missing?

• Any nonlinearity that needs to be modeled?

• Any missing interaction terms?

• If nonlinearity an issue, one possibility is to reduce scope of model and fit linear model.