1 / 68

# Outliers and influential data points - PowerPoint PPT Presentation

Outliers and influential data points. The distinction. An outlier is a data point whose response y does not follow the general trend of the rest of the data.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Outliers and influential data points' - plato-russo

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Outliers and influential data points

• An outlier is a data point whose response y does not follow the general trend of the rest of the data.

• A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated slope coefficients, hypothesis test results, etc.

No outliers? No influential data points?

Any outliers? Any influential data points?

Any outliers? Any influential data points?

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 2.96 + 5.04 x

Predictor Coef SE Coef T P

Constant 2.958 2.009 1.47 0.157

x 5.0373 0.3633 13.86 0.000

S = 4.711 R-Sq = 91.0% R-Sq(adj) = 90.5%

Any outliers? Any influential data points?

Any outliers? Any influential data points?

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 2.47 + 4.93 x

Predictor Coef SE Coef T P

Constant 2.468 1.076 2.29 0.033

x 4.9272 0.1719 28.66 0.000

S = 2.709 R-Sq = 97.7% R-Sq(adj) = 97.6%

Any outliers? Any influential data points?

Any outliers? Any influential data points?

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 8.50 + 3.32 x

Predictor Coef SE Coef T P

Constant 8.505 4.222 2.01 0.058

x 3.3198 0.6862 4.84 0.000

S = 10.45 R-Sq = 55.2% R-Sq(adj) = 52.8%

• Not every outlier strongly influences the regression analysis.

• Always determine if the regression analysis is unduly influenced by one or a few data points.

• Simple plots for simple linear regression.

• Summary measures for multiple linear regression.

### The leverageshii

The leverageshii

The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn:

where the weights hi1, hi2, …, hii, …, hin depend only on the predictor values.

For example:

The leverageshii

Because the predicted response can be written as:

the leverage, hii, quantifies the influence that the observed response yi has on its predicted value .

• The leverage hii is:

• a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points.

• a number between 0 and 1, inclusive.

• The sum of the hiiequals p, the number of parameters.

0.176297 0.157454 0.127014 0.119313 0.086145

0.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492

Sum of HI1 = 2.0000

0.153481 0.139367 0.116292 0.110382 0.084374

0.077557 0.066879 0.063589 0.050033 0.052121

0.047632 0.048156 0.049557 0.055893 0.057574

0.078121 0.088549 0.096634 0.096227 0.110048

0.357535

Sum of HI1 = 2.0000

### Identifying data points whose x values are extreme .... and therefore potentially influential

Using leverages to identify extreme x values

Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….

…or if it’s greater than 0.99 (whichever is smallest).

14.00 68.00 0.357535

Unusual Observations

Obs x y Fit SE Fit Residual St Resid

21 14.0 68.00 71.449 1.620 -3.449 -1.59 X

X denotes an observation whose X value gives it large

influence.

13.00 15.00 0.311532

Unusual Observations

Obs x y Fit SE Fit Residual St Resid

21 13.0 15.00 51.66 5.83 -36.66 -4.23RX

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large influence.

• The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis.

• The leverage depends only on the predictor values.

• Whether the data point is influential or not depends on the observed value yi.

### Identifying outliers(unusual y values)

• Residuals

• Standardized residuals

• also called internally studentized residuals

Ordinary residuals defined for each observation, i = 1, …, n:

x y FITS1 RESI1

1 2 2.2 -0.2

2 5 4.4 0.6

3 6 6.6 -0.6

4 9 8.8 0.2

Standardized residuals defined for each observation, i = 1, …, n:

MSE1 0.400000

x y FITS1 RESI1 HI1 SRES1

1 2 2.2 -0.2 0.7 -0.57735

2 5 4.4 0.6 0.3 1.13389

3 6 6.6 -0.6 0.3 -1.13389

4 9 8.8 0.2 0.7 0.57735

• Standardized residuals quantify how large the residuals are in standard deviation units.

• An observation with a standardized residual that is larger than 3 (in absolute value) is generally deemed an outlier.

• Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).

x y FITS1 HI1 s(e) RESI1 SRES1

0.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.82635

0.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.24916

1.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.43544

1.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.99818

2.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191

...

8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.05561

9.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.77679

4.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110

• Unusual Observations

• Obs x y Fit SE Fit Residual St Resid

• 4.00 40.00 23.11 1.06 16.89 3.68R

• R denotes an observation with a large standardized residual.

Why should we care?(Regression of y on xwith outlier)

The regression equation is y = 2.95763 + 5.03734 x

S = 4.71075 R-Sq = 91.0 % R-Sq(adj) = 90.5 %

Analysis of Variance

Source DF SS MS F P

Regression 1 4265.82 4265.82 192.230 0.000

Error 19 421.63 22.19

Total 20 4687.46

Why should we care?(Regression of y on xwithout outlier)

The regression equation is y = 1.73217 + 5.11687 x

S = 2.5919 R-Sq = 97.3 % R-Sq(adj) = 97.2 %

Analysis of Variance

Source DF SS MS F P

Regression 1 4386.07 4386.07 652.841 0.000

Error 18 120.93 6.72

Total 19 4507.00

### Identifying influential data points

• Deleted residuals

• Deleted t residuals

• also called studentized deleted residuals

• also called externally studentized residuals

• Difference in fits, DFITS

• Cook’s distance measure

• Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations.

• Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.

yi = the observed response for ith observation

= predicted response for ith observationbased on the estimated model with the ith observation deleted

Deleted residual

Deleted t residuals

A deleted t residual is just a standardized deleted residual:

The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

1 2.1 -1.59 -1.7431

2 3.8 0.24 0.1217

3 5.2 1.77 1.6361

10 2.1 -0.42 -19.7990

Do any of the deleted t residuals stick out like a sore thumb?

1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916

2 0.45401 4.1673 -1.0774 -0.24916 -0.24291

3 1.09765 6.5703 -1.9166 -0.43544 -0.42596

...

19 8.70156 46.5475 -0.2429 -0.05561 -0.05413

20 9.16463 45.7762 -3.3468 -0.77679 -0.76837

21 4.00000 40.0000 16.8930 3.68110 6.69012

Do any of the deleted t residuals stick out like a sore thumb?

The difference in fits:

is the number of standard deviations that the fitted value changes when the ith case is omitted.

An observation is deemed influential …

… if the absolute value of its DFIT value is

greater than:

… or if the absolute value of its DFIT value

sticks out like a sore thumb from the other

DFIT values.

14.00 68.00 -1.23841

1 0.1000 -0.0716 -0.52503

2 0.4540 4.1673 -0.08388

3 1.0977 6.5703 -0.18232

4 1.2794 13.8150 0.75898

5 2.2061 11.4501 -0.21823

6 2.5006 12.9554 -0.20155

7 3.0403 20.1575 0.27774

8 3.2358 17.5633 -0.08230

9 4.4531 26.0317 0.13865

10 4.1699 22.7573 -0.02221

11 5.2847 26.3030 -0.18487

12 5.5924 30.6885 0.05523

13 5.9209 33.9402 0.19741

14 6.6607 30.9228 -0.42449

15 6.7995 34.1100 -0.17249

16 7.9794 44.4536 0.29918

17 8.4154 46.5022 0.30960

18 8.7161 50.0568 0.63049

19 8.7016 46.5475 0.14948

20 9.1646 45.7762 -0.25094

2114.0000 68.0000 -1.23841

13.00 15.00 -11.4670

1 0.1000 -0.0716 -0.4028

2 0.4540 4.1673 -0.2438

3 1.0977 6.5703 -0.2058

4 1.2794 13.8150 0.0376

5 2.2061 11.4501 -0.1314

6 2.5006 12.9554 -0.1096

7 3.0403 20.1575 0.0405

8 3.2358 17.5633 -0.0424

9 4.4531 26.0317 0.0602

10 4.1699 22.7573 0.0092

11 5.2847 26.3030 0.0054

12 5.5924 30.6885 0.0782

13 5.9209 33.9402 0.1278

14 6.6607 30.9228 0.0072

15 6.7995 34.1100 0.0731

16 7.9794 44.4536 0.2805

17 8.4154 46.5022 0.3236

18 8.7161 50.0568 0.4361

19 8.7016 46.5475 0.3089

20 9.1646 45.7762 0.2492

2113.0000 15.0000 -11.4670

4.00 40.00 1.5505

1 0.10000 -0.0716 -0.37897

2 0.45401 4.1673 -0.10501

3 1.09765 6.5703 -0.16248

4 1.27936 13.8150 0.36737

5 2.20611 11.4501 -0.17547

6 2.50064 12.9554 -0.16377

7 3.04030 20.1575 0.10670

8 3.23583 17.5633 -0.09265

9 4.45308 26.0317 0.03061

10 4.16990 22.7573 -0.05850

11 5.28474 26.3030 -0.16025

12 5.59238 30.6885 -0.02183

13 5.92091 33.9402 0.05988

14 6.66066 30.9228 -0.34036

15 6.79953 34.1100 -0.18835

16 7.97943 44.4536 0.10017

17 8.41536 46.5022 0.09771

18 8.71607 50.0568 0.29275

19 8.70156 46.5475 -0.02188

20 9.16463 45.7762 -0.33969

21 4.00000 40.0000 1.55050

Cook’s distance

• Di depends on both residual ei and leverage hii.

• Di summarizes how much each of the estimated coefficients change when deleting the ith observation.

• A large Di indicates yi has a strong influence on the estimated coefficients.

• If Di is greater than 1, then the ith data point is worthy of further investigation.

• If Di is greater than 4, then the ith data point is most certainly influential.

• Or, if Di sticks out like a sore thumb from the other Di values, it is most certainly influential.

x y COOK1 time?

14.00 68.00 0.701960

1 0.1000 -0.0716 0.134156

2 0.4540 4.1673 0.003705

3 1.0977 6.5703 0.017302

4 1.2794 13.8150 0.241688

5 2.2061 11.4501 0.024434

6 2.5006 12.9554 0.020879

7 3.0403 20.1575 0.038414

8 3.2358 17.5633 0.003555

9 4.4531 26.0317 0.009944

10 4.1699 22.7573 0.000260

11 5.2847 26.3030 0.017379

12 5.5924 30.6885 0.001605

13 5.9209 33.9402 0.019747

14 6.6607 30.9228 0.081345

15 6.7995 34.1100 0.015290

16 7.9794 44.4536 0.044621

17 8.4154 46.5022 0.047961

18 8.7161 50.0568 0.173897

19 8.7016 46.5475 0.011657

20 9.1646 45.7762 0.032320

21 14.0000 68.0000 0.701960

x y COOK2 time?

13.00 15.00 4.04801

1 0.1000 -0.0716 0.08172

2 0.4540 4.1673 0.03076

3 1.0977 6.5703 0.02198

4 1.2794 13.8150 0.00075

5 2.2061 11.4501 0.00901

6 2.5006 12.9554 0.00629

7 3.0403 20.1575 0.00086

8 3.2358 17.5633 0.00095

9 4.4531 26.0317 0.00191

10 4.1699 22.7573 0.00004

11 5.2847 26.3030 0.00002

12 5.5924 30.6885 0.00320

13 5.9209 33.9402 0.00848

14 6.6607 30.9228 0.00003

15 6.7995 34.1100 0.00280

16 7.9794 44.4536 0.03958

17 8.4154 46.5022 0.05229

18 8.7161 50.0568 0.09180

19 8.7016 46.5475 0.04809

20 9.1646 45.7762 0.03194

21 13.0000 15.0000 4.04801

x y COOK3 time?

4.00 40.00 0.36391

1 0.10000 -0.0716 0.073075

2 0.45401 4.1673 0.005801

3 1.09765 6.5703 0.013793

4 1.27936 13.8150 0.067493

5 2.20611 11.4501 0.015960

6 2.50064 12.9554 0.013909

7 3.04030 20.1575 0.005955

8 3.23583 17.5633 0.004498

9 4.45308 26.0317 0.000494

10 4.16990 22.7573 0.001799

11 5.28474 26.3030 0.013191

12 5.59238 30.6885 0.000251

13 5.92091 33.9402 0.001886

14 6.66066 30.9228 0.056276

15 6.79953 34.1100 0.018263

16 7.97943 44.4536 0.005272

17 8.41536 46.5022 0.005020

18 8.71607 50.0568 0.043959

19 8.70156 46.5475 0.000253

20 9.16463 45.7762 0.058966

21 4.00000 40.0000 0.363914

• Don’t forget that the above methods are just statistical tools. It’s okay to use common sense and knowledge about the situation.

• First, check for obvious data errors.

• If a data entry error, simply correct it.

• If not representative of the population, delete it.

• If a procedural error invalidates the measurement, delete it.

A comment about time?deleting data points

• Do not delete data just because they do not fit your preconceived regression model.

• You must have a good, objective reason for deleting data points.

• If you delete any data after you’ve collected it, justify and describe it in your reports.

• If not sure what to do about a data point, analyze data twice and report both results.

• Then, consider model misspecification.

• Any important variables missing?

• Any nonlinearity that needs to be modeled?

• Any missing interaction terms?

• If nonlinearity an issue, one possibility is to reduce scope of model and fit linear model.