Outliers and influential data points
This presentation is the property of its rightful owner.
Sponsored Links
1 / 68

Outliers and influential data points PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on
  • Presentation posted in: General

Outliers and influential data points. The distinction. An outlier is a data point whose response y does not follow the general trend of the rest of the data.

Download Presentation

Outliers and influential data points

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Outliers and influential data points

Outliers and influential data points


The distinction

The distinction

  • An outlier is a data point whose response y does not follow the general trend of the rest of the data.

  • A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated slope coefficients, hypothesis test results, etc.


No outliers no influential data points

No outliers? No influential data points?


Any outliers any influential data points

Any outliers? Any influential data points?


Any outliers any influential data points1

Any outliers? Any influential data points?


Outliers and influential data points

Without the blue data point:

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 2.96 + 5.04 x

Predictor Coef SE Coef T P

Constant 2.958 2.009 1.47 0.157

x 5.0373 0.3633 13.86 0.000

S = 4.711 R-Sq = 91.0% R-Sq(adj) = 90.5%


Any outliers any influential data points2

Any outliers? Any influential data points?


Any outliers any influential data points3

Any outliers? Any influential data points?


Outliers and influential data points

Without the blue data point:

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 2.47 + 4.93 x

Predictor Coef SE Coef T P

Constant 2.468 1.076 2.29 0.033

x 4.9272 0.1719 28.66 0.000

S = 2.709 R-Sq = 97.7% R-Sq(adj) = 97.6%


Any outliers any influential data points4

Any outliers? Any influential data points?


Any outliers any influential data points5

Any outliers? Any influential data points?


Outliers and influential data points

Without the blue data point:

The regression equation is y = 1.73 + 5.12 x

Predictor Coef SE Coef T P

Constant 1.732 1.121 1.55 0.140

x 5.1169 0.2003 25.55 0.000

S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2%

With the blue data point:

The regression equation is y = 8.50 + 3.32 x

Predictor Coef SE Coef T P

Constant 8.505 4.222 2.01 0.058

x 3.3198 0.6862 4.84 0.000

S = 10.45 R-Sq = 55.2% R-Sq(adj) = 52.8%


Impact on regression analyses

Impact on regression analyses

  • Not every outlier strongly influences the regression analysis.

  • Always determine if the regression analysis is unduly influenced by one or a few data points.

  • Simple plots for simple linear regression.

  • Summary measures for multiple linear regression.


The leverages h ii

The leverageshii


The leverages h ii1

The leverageshii

The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn:

where the weights hi1, hi2, …, hii, …, hin depend only on the predictor values.

For example:


The leverages h ii2

The leverageshii

Because the predicted response can be written as:

the leverage, hii, quantifies the influence that the observed response yi has on its predicted value .


Properties of the leverages h ii

Properties of the leverages hii

  • The leverage hii is:

    • a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points.

    • a number between 0 and 1, inclusive.

  • The sum of the hiiequals p, the number of parameters.


Any high leverages h ii

Any high leverages hii?


Outliers and influential data points

HI1

0.176297 0.157454 0.127014 0.119313 0.086145

0.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492

Sum of HI1 = 2.0000


Any high leverages h ii1

Any high leverages hii?


Outliers and influential data points

HI1

0.153481 0.139367 0.116292 0.110382 0.084374

0.077557 0.066879 0.063589 0.050033 0.052121

0.047632 0.048156 0.049557 0.055893 0.057574

0.078121 0.088549 0.096634 0.096227 0.110048

0.357535

Sum of HI1 = 2.0000


Identifying data points whose x values are extreme and therefore potentially influential

Identifying data points whose x values are extreme .... and therefore potentially influential


Using leverages to identify extreme x values

Using leverages to identify extreme x values

Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….

…or if it’s greater than 0.99 (whichever is smallest).


Outliers and influential data points

x y HI1

14.00 68.00 0.357535

Unusual Observations

Obs x y Fit SE Fit Residual St Resid

21 14.0 68.00 71.449 1.620 -3.449 -1.59 X

X denotes an observation whose X value gives it large

influence.


Outliers and influential data points

x y HI2

13.00 15.00 0.311532

Unusual Observations

Obs x y Fit SE Fit Residual St Resid

21 13.0 15.00 51.66 5.83 -36.66 -4.23RX

R denotes an observation with a large standardized residual.

X denotes an observation whose X value gives it large influence.


Important distinction

Important distinction!

  • The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis.

  • The leverage depends only on the predictor values.

  • Whether the data point is influential or not depends on the observed value yi.


Identifying outliers unusual y values

Identifying outliers(unusual y values)


Identifying outliers

Identifying outliers

  • Residuals

  • Standardized residuals

    • also called internally studentized residuals


Residuals

Residuals

Ordinary residuals defined for each observation, i = 1, …, n:

x y FITS1 RESI1

1 2 2.2 -0.2

2 5 4.4 0.6

3 6 6.6 -0.6

4 9 8.8 0.2


Standardized residuals

Standardized residuals

Standardized residuals defined for each observation, i = 1, …, n:

MSE1 0.400000

x y FITS1 RESI1 HI1 SRES1

1 2 2.2 -0.2 0.7 -0.57735

2 5 4.4 0.6 0.3 1.13389

3 6 6.6 -0.6 0.3 -1.13389

4 9 8.8 0.2 0.7 0.57735


Standardized residuals1

Standardized residuals

  • Standardized residuals quantify how large the residuals are in standard deviation units.

    • An observation with a standardized residual that is larger than 3 (in absolute value) is generally deemed an outlier.

    • Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).


An outlier

An outlier?


Outliers and influential data points

S = 4.711

x y FITS1 HI1 s(e) RESI1 SRES1

0.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.82635

0.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.24916

1.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.43544

1.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.99818

2.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191

...

8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.05561

9.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.77679

4.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110

  • Unusual Observations

  • Obs x y Fit SE Fit Residual St Resid

  • 4.00 40.00 23.11 1.06 16.89 3.68R

  • R denotes an observation with a large standardized residual.


Why should we care regression of y on x with outlier

Why should we care?(Regression of y on xwith outlier)

The regression equation is y = 2.95763 + 5.03734 x

S = 4.71075 R-Sq = 91.0 % R-Sq(adj) = 90.5 %

Analysis of Variance

Source DF SS MS F P

Regression 1 4265.82 4265.82 192.230 0.000

Error 19 421.63 22.19

Total 20 4687.46


Why should we care regression of y on x without outlier

Why should we care?(Regression of y on xwithout outlier)

The regression equation is y = 1.73217 + 5.11687 x

S = 2.5919 R-Sq = 97.3 % R-Sq(adj) = 97.2 %

Analysis of Variance

Source DF SS MS F P

Regression 1 4386.07 4386.07 652.841 0.000

Error 18 120.93 6.72

Total 19 4507.00


Identifying influential data points

Identifying influential data points


Identifying influential data points1

Identifying influential data points

  • Deleted residuals

  • Deleted t residuals

    • also called studentized deleted residuals

    • also called externally studentized residuals

  • Difference in fits, DFITS

  • Cook’s distance measure


Basic idea of these four measures

Basic idea of these four measures

  • Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations.

  • Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.


Deleted residuals

Deleted residuals

yi = the observed response for ith observation

= predicted response for ith observationbased on the estimated model with the ith observation deleted

Deleted residual


Deleted t residuals

Deleted t residuals

A deleted t residual is just a standardized deleted residual:

The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.


Outliers and influential data points

x y RESI1 TRES1

1 2.1 -1.59 -1.7431

2 3.8 0.24 0.1217

3 5.2 1.77 1.6361

10 2.1 -0.42 -19.7990


Outliers and influential data points

Do any of the deleted t residuals stick out like a sore thumb?


Outliers and influential data points

Row x y RESI1 SRES1 TRES1

1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916

2 0.45401 4.1673 -1.0774 -0.24916 -0.24291

3 1.09765 6.5703 -1.9166 -0.43544 -0.42596

...

19 8.70156 46.5475 -0.2429 -0.05561 -0.05413

20 9.16463 45.7762 -3.3468 -0.77679 -0.76837

21 4.00000 40.0000 16.8930 3.68110 6.69012


Outliers and influential data points

Do any of the deleted t residuals stick out like a sore thumb?


Dfits

DFITS

The difference in fits:

is the number of standard deviations that the fitted value changes when the ith case is omitted.


Using dfits

Using DFITS

An observation is deemed influential …

… if the absolute value of its DFIT value is

greater than:

… or if the absolute value of its DFIT value

sticks out like a sore thumb from the other

DFIT values.


Outliers and influential data points

x y DFIT1

14.00 68.00 -1.23841


Outliers and influential data points

Row x y DFIT1

1 0.1000 -0.0716 -0.52503

2 0.4540 4.1673 -0.08388

3 1.0977 6.5703 -0.18232

4 1.2794 13.8150 0.75898

5 2.2061 11.4501 -0.21823

6 2.5006 12.9554 -0.20155

7 3.0403 20.1575 0.27774

8 3.2358 17.5633 -0.08230

9 4.4531 26.0317 0.13865

10 4.1699 22.7573 -0.02221

11 5.2847 26.3030 -0.18487

12 5.5924 30.6885 0.05523

13 5.9209 33.9402 0.19741

14 6.6607 30.9228 -0.42449

15 6.7995 34.1100 -0.17249

16 7.9794 44.4536 0.29918

17 8.4154 46.5022 0.30960

18 8.7161 50.0568 0.63049

19 8.7016 46.5475 0.14948

20 9.1646 45.7762 -0.25094

2114.0000 68.0000 -1.23841


Outliers and influential data points

x y DFIT2

13.00 15.00 -11.4670


Outliers and influential data points

Row x y DFIT2

1 0.1000 -0.0716 -0.4028

2 0.4540 4.1673 -0.2438

3 1.0977 6.5703 -0.2058

4 1.2794 13.8150 0.0376

5 2.2061 11.4501 -0.1314

6 2.5006 12.9554 -0.1096

7 3.0403 20.1575 0.0405

8 3.2358 17.5633 -0.0424

9 4.4531 26.0317 0.0602

10 4.1699 22.7573 0.0092

11 5.2847 26.3030 0.0054

12 5.5924 30.6885 0.0782

13 5.9209 33.9402 0.1278

14 6.6607 30.9228 0.0072

15 6.7995 34.1100 0.0731

16 7.9794 44.4536 0.2805

17 8.4154 46.5022 0.3236

18 8.7161 50.0568 0.4361

19 8.7016 46.5475 0.3089

20 9.1646 45.7762 0.2492

2113.0000 15.0000 -11.4670


Outliers and influential data points

x y DFIT3

4.00 40.00 1.5505


Outliers and influential data points

Row x y DFIT3

1 0.10000 -0.0716 -0.37897

2 0.45401 4.1673 -0.10501

3 1.09765 6.5703 -0.16248

4 1.27936 13.8150 0.36737

5 2.20611 11.4501 -0.17547

6 2.50064 12.9554 -0.16377

7 3.04030 20.1575 0.10670

8 3.23583 17.5633 -0.09265

9 4.45308 26.0317 0.03061

10 4.16990 22.7573 -0.05850

11 5.28474 26.3030 -0.16025

12 5.59238 30.6885 -0.02183

13 5.92091 33.9402 0.05988

14 6.66066 30.9228 -0.34036

15 6.79953 34.1100 -0.18835

16 7.97943 44.4536 0.10017

17 8.41536 46.5022 0.09771

18 8.71607 50.0568 0.29275

19 8.70156 46.5475 -0.02188

20 9.16463 45.7762 -0.33969

21 4.00000 40.0000 1.55050


Cook s distance

Cook’s distance

Cook’s distance

  • Di depends on both residual ei and leverage hii.

  • Di summarizes how much each of the estimated coefficients change when deleting the ith observation.

  • A large Di indicates yi has a strong influence on the estimated coefficients.


Effect on estimates of removing each data point one at a time

Effect on estimates of removing each data point one at a time?


Effect on estimates of removing each data point one at a time1

Effect on estimates of removing each data point one at a time?


Effect on estimates of removing each data point one at a time2

Effect on estimates of removing each data point one at a time?


Effect on estimates of removing each data point one at a time3

Effect on estimates of removing each data point one at a time?


Using cook s distance

Using Cook’s distance

  • If Di is greater than 1, then the ith data point is worthy of further investigation.

  • If Di is greater than 4, then the ith data point is most certainly influential.

  • Or, if Di sticks out like a sore thumb from the other Di values, it is most certainly influential.


Outliers and influential data points

x y COOK1

14.00 68.00 0.701960


Outliers and influential data points

Row x y COOK1

1 0.1000 -0.0716 0.134156

2 0.4540 4.1673 0.003705

3 1.0977 6.5703 0.017302

4 1.2794 13.8150 0.241688

5 2.2061 11.4501 0.024434

6 2.5006 12.9554 0.020879

7 3.0403 20.1575 0.038414

8 3.2358 17.5633 0.003555

9 4.4531 26.0317 0.009944

10 4.1699 22.7573 0.000260

11 5.2847 26.3030 0.017379

12 5.5924 30.6885 0.001605

13 5.9209 33.9402 0.019747

14 6.6607 30.9228 0.081345

15 6.7995 34.1100 0.015290

16 7.9794 44.4536 0.044621

17 8.4154 46.5022 0.047961

18 8.7161 50.0568 0.173897

19 8.7016 46.5475 0.011657

20 9.1646 45.7762 0.032320

21 14.0000 68.0000 0.701960


Outliers and influential data points

x y COOK2

13.00 15.00 4.04801


Outliers and influential data points

Row x y COOK2

1 0.1000 -0.0716 0.08172

2 0.4540 4.1673 0.03076

3 1.0977 6.5703 0.02198

4 1.2794 13.8150 0.00075

5 2.2061 11.4501 0.00901

6 2.5006 12.9554 0.00629

7 3.0403 20.1575 0.00086

8 3.2358 17.5633 0.00095

9 4.4531 26.0317 0.00191

10 4.1699 22.7573 0.00004

11 5.2847 26.3030 0.00002

12 5.5924 30.6885 0.00320

13 5.9209 33.9402 0.00848

14 6.6607 30.9228 0.00003

15 6.7995 34.1100 0.00280

16 7.9794 44.4536 0.03958

17 8.4154 46.5022 0.05229

18 8.7161 50.0568 0.09180

19 8.7016 46.5475 0.04809

20 9.1646 45.7762 0.03194

21 13.0000 15.0000 4.04801


Outliers and influential data points

x y COOK3

4.00 40.00 0.36391


Outliers and influential data points

Row x y COOK3

1 0.10000 -0.0716 0.073075

2 0.45401 4.1673 0.005801

3 1.09765 6.5703 0.013793

4 1.27936 13.8150 0.067493

5 2.20611 11.4501 0.015960

6 2.50064 12.9554 0.013909

7 3.04030 20.1575 0.005955

8 3.23583 17.5633 0.004498

9 4.45308 26.0317 0.000494

10 4.16990 22.7573 0.001799

11 5.28474 26.3030 0.013191

12 5.59238 30.6885 0.000251

13 5.92091 33.9402 0.001886

14 6.66066 30.9228 0.056276

15 6.79953 34.1100 0.018263

16 7.97943 44.4536 0.005272

17 8.41536 46.5022 0.005020

18 8.71607 50.0568 0.043959

19 8.70156 46.5475 0.000253

20 9.16463 45.7762 0.058966

21 4.00000 40.0000 0.363914


A strategy for dealing with problematic data points

A strategy for dealing with problematic data points

  • Don’t forget that the above methods are just statistical tools. It’s okay to use common sense and knowledge about the situation.

  • First, check for obvious data errors.

    • If a data entry error, simply correct it.

    • If not representative of the population, delete it.

    • If a procedural error invalidates the measurement, delete it.


A comment about deleting data points

A comment about deleting data points

  • Do not delete data just because they do not fit your preconceived regression model.

  • You must have a good, objective reason for deleting data points.

  • If you delete any data after you’ve collected it, justify and describe it in your reports.

  • If not sure what to do about a data point, analyze data twice and report both results.


A strategy for dealing with problematic data points cont d

A strategy for dealing with problematic data points (cont’d)

  • Then, consider model misspecification.

    • Any important variables missing?

    • Any nonlinearity that needs to be modeled?

    • Any missing interaction terms?

  • If nonlinearity an issue, one possibility is to reduce scope of model and fit linear model.


  • Login