1 / 49

# Linear Regression - PowerPoint PPT Presentation

Ecole Nationale Vétérinaire de Toulouse. Linear Regression. Didier Concordet [email protected] ECVPT Workshop April 2011. Can be downloaded at http://www.biostat.envt.fr/. An example. b>0. Y. Y. Y. a. b=0. b>0. a. b<0. a=0. x. x. x. About the straight line.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Linear Regression' - piper

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Vétérinaire

de Toulouse

### Linear Regression

Didier Concordet

ECVPT Workshop April 2011

Y

Y

Y

a

b=0

b>0

a

b<0

a=0

x

x

x

Y= a + b x

a = intercept

b = slope

• How to obtain the best straight line ?

• Is this straight line the best curve to use ?

• How to use this straight line ?

Proceed in three main steps

• write a (statistical) model

• estimate the parameters

• graphical inspection of data

A statistical model

Mean model :

functionnal relationship

Variance model :

Assumptions on the residuals

Mean model

= residual (error term)

• the xi 's are not random variables

• they are known with a high precision

• the ei 'shave a constant variance

• homoscedasticity

• the ei 'sare independent

• the ei 'sare normally distributed

• normality

homoscedasticity

heteroscedasticity

A criterion is needed to estimate parameters

A statistical model

A criterion

Intuitive criterion :

minimum

compensation

Reasonnable criterion :

minimum

Linear model

Homoscedasticity

Normality

Least squares criterion (L.S.)

and

change with samples

and

are random variables

True mean straight line

Estimated straight line

or

Mean predicted value for the ith observation

ith residual

Dep Var: HPLC N: 18

Effect Coefficient Std Error t P(2 Tail)

CONSTANT 20.046 3.682 5.444 0.000

CONCENT 2.916 0.069 42.030 0.000

Intercept

Estimated straight line

Slope

by construction

but

The residual variance is defined by

standard error of estimate

Dep Var: HPLC N: 18

Multiple R: 0.996 Squared multiple R: 0.991

Standard error of estimate : 8.282

Effect Coefficient Std Error t P(2 Tail)

CONSTANT 20.046 3.682 5.444 0.000

CONCENT 2.916 0.069 42.030 0.000

• How to obtain the best straight line ?

• Is this straight line the best curve to use ?

• How to use this straight line ?

• Tools to check the mean model :

• scatterplot residuals vs fitted values

• test(s)

• Tools to check the variance model :

• scatterplot residuals vs fitted values

• Probability plot (Pplot)

scatterplot residuals vs fitted values

0

0

structure in the residuals

change the mean model

No structure in the residuals

OK

Two cases

No replication

Try a polynomial model

Replications

Test of lack of fit

try another mean model and test the improvement

Example :

If the test on c is significant (c  0) then keep this model

Dep Var: HPLC N: 18

Multiple R: 0.996 Squared multiple R: 0.991

Standard error of estimate: 8.539

Effect Coefficient Std Error t P(2 Tail)

CONSTANT 21.284 6.649 3.201 0.006

CONCENT 2.842 0.335 8.486 0.000

CONCENT

*CONCENT 0.001 0.003 0.227 0.824

With replications

Perform a test of lack of fit

Pure error

Principle : compare

to

if

-

>

then change the model

Three steps

1) Linear regression

2) One way ANOVA

3)

if

then change the model

Three steps

1) Linear regression

2) One way ANOVA

Dep Var: HPLC N: 18

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

CONCENT 121251.776 5 24250.355 289.434 0.000

Error 1005.427 12 83.786

3)

if

We keep the straight line

scatterplot residuals vs fitted values

0

0

No structure in the residuals

but heteroscedasticity

change the model (criterion)

homoscedasticity

OK

scatterplot residuals vs fitted values :

modelize the dispersion.

0

The standard deviation of the residuals increases

with : it increases with x

Estimate again the slope and the intercept but with

weights proportionnal to the variance.

with

and check that the weight residuals (as defined

above) are homoscedastic

0

Expected value for normal distribution

Expected value for normal distribution

0

No curvature :

Normality

Curvature : non normality

is it so important ?

Try to modelize the distribution of residuals

In general, it is difficult with few observations

If enough observations are available,

the non normality does not affect too much

the result.

R² = square correlation coefficient

= % of dispersion of the Yi's explained

by the straight line (the model)

0  R²  1

If R² = 1, all theei = 0, the straight line explain all the variation of the Yi's

If R² = 0, the slope is = 0, the straight line does not explain any variation of the Yi's

R² and R (correlation coefficient) are not designed to measure linearity !

Example :

Multiple R: 0.990

Squared multiple R: 0.980

• How to obtain the best straight line ?

• Is this straight line the best curve to use ?

• How to use this straight line ?

• Direct use : for a given x

• predict the mean Y

• construct a confidence interval of the mean Y

• construct a prediction interval of Y

• Reverse use calibration (approximate results): for a given Y

• predict the mean x

• construct a confidence interval of the mean x

• construct a prediction interval of X

Example :

There is a probability 1-a that a+bx belongs to this interval

100(1-a)% of the measurements carried-out for this x belongs to this interval

Reverse use : for a given Y=y0 predict the mean X

Example :

For a given Y=y0 a confidence interval of the mean X

Y0

X

U

L

There is a probability 1-a that the mean X belongs to [ L , U ]

L and U are so that

One can fit the straight line by inverting x and Y

If the correlation coefficient is high, the straight line is the best model

Normality of the xi's is required to perform a regression

Normality of the ei's is essential to perform a good regression