Ecole Nationale Vétérinaire de Toulouse. Linear Regression. Didier Concordet [email protected] ECVPT Workshop April 2011. Can be downloaded at http://www.biostat.envt.fr/. An example. b>0. Y. Y. Y. a. b=0. b>0. a. b<0. a=0. x. x. x. About the straight line.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Vétérinaire
de Toulouse
Didier Concordet
ECVPT Workshop April 2011
Can be downloaded at http://www.biostat.envt.fr/
Proceed in three main steps
A statistical model
Mean model :
functionnal relationship
Variance model :
Assumptions on the residuals
A criterion is needed to estimate parameters
A statistical model
A criterion
Intuitive criterion :
minimum
compensation
Reasonnable criterion :
minimum
Linear model
Homoscedasticity
Normality
Least squares criterion (L.S.)
True mean straight line
Estimated straight line
or
Mean predicted value for the ith observation
ith residual
Dep Var: HPLC N: 18
Effect Coefficient Std Error t P(2 Tail)
CONSTANT 20.046 3.682 5.444 0.000
CONCENT 2.916 0.069 42.030 0.000
Intercept
Estimated straight line
Slope
by construction
but
The residual variance is defined by
standard error of estimate
Dep Var: HPLC N: 18
Multiple R: 0.996 Squared multiple R: 0.991
Adjusted squared multiple R: 0.991
Standard error of estimate : 8.282
Effect Coefficient Std Error t P(2 Tail)
CONSTANT 20.046 3.682 5.444 0.000
CONCENT 2.916 0.069 42.030 0.000
scatterplot residuals vs fitted values
0
0
structure in the residuals
change the mean model
No structure in the residuals
OK
Two cases
No replication
Try a polynomial model
(quadratic first)
Replications
Test of lack of fit
try another mean model and test the improvement
Example :
If the test on c is significant (c 0) then keep this model
Dep Var: HPLC N: 18
Multiple R: 0.996 Squared multiple R: 0.991
Adjusted squared multiple R: 0.991
Standard error of estimate: 8.539
Effect Coefficient Std Error t P(2 Tail)
CONSTANT 21.284 6.649 3.201 0.006
CONCENT 2.842 0.335 8.486 0.000
CONCENT
*CONCENT 0.001 0.003 0.227 0.824
Perform a test of lack of fit
Pure error
Principle : compare
to
if

>
then change the model
Three steps
1) Linear regression
2) One way ANOVA
3)
if
then change the model
Three steps
1) Linear regression
2) One way ANOVA
Dep Var: HPLC N: 18
Analysis of Variance
Source SumofSquares df MeanSquare Fratio P
CONCENT 121251.776 5 24250.355 289.434 0.000
Error 1005.427 12 83.786
3)
if
We keep the straight line
scatterplot residuals vs fitted values
0
0
No structure in the residuals
but heteroscedasticity
change the model (criterion)
homoscedasticity
OK
scatterplot residuals vs fitted values :
modelize the dispersion.
0
The standard deviation of the residuals increases
with : it increases with x
Estimate again the slope and the intercept but with
weights proportionnal to the variance.
with
and check that the weight residuals (as defined
above) are homoscedastic
0
Expected value for normal distribution
Expected value for normal distribution
0
No curvature :
Normality
Curvature : non normality
is it so important ?
Try to modelize the distribution of residuals
In general, it is difficult with few observations
If enough observations are available,
the non normality does not affect too much
the result.
R² = square correlation coefficient
= % of dispersion of the Yi's explained
by the straight line (the model)
0 R² 1
If R² = 1, all theei = 0, the straight line explain all the variation of the Yi's
If R² = 0, the slope is = 0, the straight line does not explain any variation of the Yi's
R² and R (correlation coefficient) are not designed to measure linearity !
Example :
Multiple R: 0.990
Squared multiple R: 0.980
Adjusted squared multiple R: 0.980
Example :
There is a probability 1a that a+bx belongs to this interval
100(1a)% of the measurements carriedout for this x belongs to this interval
Example :
There is a probability 1a that the mean X belongs to [ L , U ]
L and U are so that
One can fit the straight line by inverting x and Y
If the correlation coefficient is high, the straight line is the best model
Normality of the xi's is required to perform a regression
Normality of the ei's is essential to perform a good regression