Mathematical description of the relationship between two variables using regression

1 / 45

# Mathematical description of the relationship between two variables using regression - PowerPoint PPT Presentation

Mathematical description of the relationship between two variables using regression. Lecture 13 BIOL2608 Biometrics. Regression Analysis. A simple mathematical expression to provide an estimate of one variable from another

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Mathematical description of the relationship between two variables using regression' - kennedy-pitts

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Mathematical description of the relationship between two variables using regression

Lecture 13

BIOL2608 Biometrics

Regression Analysis
• A simple mathematical expression to provide an estimate of one variable from another
• It is possible to predict the likely outcome of events given sufficient quantitative knowledge of the processes involved
Regression models

y

x

• Model 1
• “Controlled” parameter (Independent variable) vs. Measured parameter (dependent variable)
• Independent variable (on x-axis) must be measured with a high degree of accuracy & is not subjected to random variation
• Other inferential factors must be kept constant
• Dependent variable (on y-axis) may vary randomly and its ‘error’ should follow a normal distribution
Normally distributed populations of y values

Y

X

• The population of y values is normally distributed &
• The variances of different population of y values corresponding to different individual x values are similar
Regression models

x2

x1

• Model 2
• Both measured parameters (x1 & x2 not x & y) which cannot be controlled
• Both are subject to random variation & called random-effects factors
• Common in field studies where conditions are difficult to control
• Correlation rather than regression, required for bivariately normal distributions
• e.g. measurements of human arm and leg lengths
Example for model 1
• Study the rate of disappearance of a pesticide in a seawater sample
• Time (independent) vs. Concentration (dependent)
• Other factors such as pH, salinity must be kept constant
• Study the growth rate of fish at different fixed water temperature
• Temp (independent) vs. Growth rate (dependent)
• Other factors such as diet, feeding frequency must be kept constant
Y

x, y

c

d

a

X

Model: y = a + bx

Slope: coefficient b = c/d

Intercept: coefficient a

b= -ve

b= +ve

Y

b= 0

X

The concept of least squares

Sum of di2 indicates the deviations of the points from the regression line

Best fit line is achieved with minimum sum of square deviations (di2)

d

Calculation for a regression
• y = a + bx
• b = [xy – (xy/n)]/ [x2 – (x)2/n]
• a = y – bx

b = [514.8-(130)(44.4)/13]/[1562 – (130)2/13]

b = 0.720 cm/day

a = 3.4 – (0.720)(10) = 0.715 cm

The simple linear regression equation is

Y = 0.715 + 0.270X

^

Normally distributed populations of y values

Y

X

• The population of y values is normally distributed &
• The variancesof different population of y values corresponding to different individual x values are similar
^

Residual = y – y = y – (a + bx)

+

+

0

0

-

-

+

+

0

0

-

-

Testing the significance of a regression

Source of variation Sum of squares (SS) DF Mean square (MS)

Total (Yi- Y) Yi2 – (Yi)2/n n –1

Linear regression [XiYi - XiYi/n]2 1 regression SS

(Yi – Y)  Xi2 – (Xi)2/n regression DF

Residual (Yi – Yi) total SS – regression SS n – 2 residual SS

residual DF

^

^

Compute F = regression MS/ residual MS, the critical F(1), 1, (n-2)

Coefficient of determination r2 = regression SS/total SS

r2 indicates the proportion (or %) of the total variation in Y that is explained or accounted by fitted the regression:

e.g. r2 = 0.81 i.e. the regression can explain 81% of the total variation

ANOVA testing of Ho: b = 0

Total SS = 171.3 – (44.4)2/13 = 19.66

Regression SS

= [514.8-(130)(44.4)/13]2/[1562 – (130)2/13] = 19.13

Residual SS = 19.66 – 19 .13 = 0.53

Total DF = n –1 = 12

Source of variation SS DF MS F Crit F P

Total 19.66 12

Linear regression 19.13 1 19.13 401.1 4.84 <0.001

Residual 0.53 11 0.048

Thus, reject Ho. r2 = 19.13/19.66 = 0.97

=(Sy·x)2

How to report the results in texts?

Source of variation SS DF MS F Crit F P

Total 19.66 12

Linear regression 19.13 1 19.13 401.1 4.84 <0.001

Residual 0.53 11 0.048

Thus, reject Ho. r2 = 19.13/19.66 = 0.97

The wing length of the sparrow significantly increases with increasing age (F1, 11 = 401.1, r2 = 0.97, p < 0.001). 97% of the total variance in the wing length data can be explained by age.

Confidence intervals of the regression coefficients
• Assume that the slopes b of a regression are normally distributed, then it is possible to fit confidence intervals (CI) to the slope:
• 95% CI = b  t (2), (n-2) Sb

where Sb = (Sy·x)2/( Xi2 – (Xi)2/n)

• CI for the interceptYi = Yi t (2), (n-2) SYi

where SYi = (Sy·x)2[1/n + (Xi – X)2/( Xi2 – (Xi)2/n)]

^

Y = 0.715 + 0.270X, mean = 10

If x = 13 days, then y = 4.225 cm

• SYi = (Sy·x)2[1/n + (Xi – X)2/( Xi2 – (Xi)2/n)]

= (0.0477)[1/13 + (13 –10)2/(1562 – 1302/13)]

= 0.073 cm

• CI for the intercept Yi = Yi t 0.05(2), 11 SYi

= 4.225  (2.201)(0.073)

= 4.225  0.161 cm

Inverse prediction (follow the same example)

^

Y = 0.715 + 0.270X and mean Y = 3.415, if Yi = 4.5, then

X = (Yi – 0.715)/0.270 = 14.019 days

To compute 95% CI:

t 0.05(2), 11 = 2.201

K = b2 –t2Sb2 = 0.2702 – (2.201)2[(Sy·x)2/( Xi2 – (Xi)2/n)]

= 0.2702 – (2.201)2[(0.0477)/(262)] = 0.0720

95% CI:

=X + b(Yi – Y)/K  (t/K)(Sy·x)2{[(Yi – Y)2/( Xi2 – (Xi)2/n)]+K(1 + 1/n)}

=14.069 (0.270/0.072) (0.0477){[(4.5 – 3.415)2/262]+ 0.072(1 + 1/13)}

= 14.069  1.912 days

Lower limit = 12.157 days; Higher limit = 15.981

Important for toxicity test

Regression with replication

See p. 345 – 357, example 17.8 (Zar, 1999)

Example 17.8

Total SS (DF = 20 –1)

= 383346 – (2744)2/20 = 6869.2

Regression SS (DF = 1)

= [149240-(1050)(2744)/20]2/[59100 – (1050)2/20] = (5180)2/3975 = 6751.29

Among-groups SS (DF = k –1 = 5 – 1 = 4)

= 383228.73 – (2744)2/20 = 6751.93

Within groups SS (DF = total – among-groups = 19 – 4 = 15)

= 6869.2 – 6751.93 = 117.27

Deviations-from-linearity SS (DF = among-groups – regression = 4 –1 = 3)

= 6751.93 – 6751.29 = 1.64

b = [149240-(1050)(2744)/20]/ [59100 – (1050)2/20] = 5180/3975 = 1.303 mm Hg/yr

Mean x = 52.5; mean y = 137.2

a = 137.2 – (1.303)(52.5) = 68.79 mm Hg

Y = 68.79 + 1.303x

Ho: The population regression is linear.

Source of variation SS DF MS F P

Total 6869.2 19

Among groups 6751.93 4

Linear regression 6750.29 1

Deviations from linearity 1.64 3 0.55 0.07 >0.25

Within groups 117.27 15 7.82

Thus, accept Ho.

Ho:  = 0

Source of variation SS DF MS F P

Total 6869.2 19

Linear regression 6750.29 1 6750.29 1021.2 <0.001

Within groups 118.91 18 6.61

Thus, reject Ho. r2 = 6750.29/6869.2 = 0.98

Model 2 regression
• The regression coefficient b’ (b prime) is obtained as Sx1/Sx2
• e.g. field observations of PCB (polychlorinated biphenyl) concentration per gram of fish and compared with the kidney biomass
b’ = 1.2074/2.2033 = -0.548

(by inspection of the scatter-graph)

a’ = mean X1 – b(mean X2)

= 9.739 – 0.548(5.835) = 12.94

X1 = 12.94 – 0.548X2

Key notes
• There are two regression models
• Model 1 regression is used when one of the variable is fixed so that it is measured with negligible error
• In model 1, the fixed variable is called the independent variable x and the dependent variable is designated y.
• The regression equation is written y = a + bx where a and b are the regression coefficients
• Model 2 regression is used where neither variable is fixed and both are measured with error.

### Comparing simple linear regression equations &Multiple regression analysis

Comparing Two Slopes
• Use of Student’s t

t = (b1 – b2) / Sb1-b2

Sb1-b2 = (SY·X)p2/(x2)1 + (SY·X)p2/(x2)2

Y

Y

Y

X

X

X

Comparing Two Slopes
• (SY·X)p2= (residual SS)1 + (residual SS)2
• (residual DF)1 + (residual DF)2
• Critical t : DF = n1 + n2 – 4
• Test Ho: Equal slopes

Y

Y

Y

X

X

X

b = 2.97

Y

Comparing Two Slopes - Example

 x2 =  Xi2 – ( Xi)2/n

 xy= XiYi – ( XiYi/n)

 y2 = Yi2 – ( Yi)2/n

b = 2.17

X

• For sample 1: Temperature (C) For sample 2: Volumes (ml)
•  x2 = 1470.8712  x2 = 2272.4750
•  xy=4363.1627  xy= 4928.8100
• y2 =13299.5296  y2 =10964.0947

n = 26 n = 30

b =  xy/  x2

b = 4363.1627/1470.8712 = 2.97 b = 4928.8100/2272.4750 = 2.17

Residual SS = RSS =  y2 –( xy)2/ x2

= 13299.5296 – (4363.1627)2/1470.8712 = 10964.0947 – (4928.81)2/2272.475

= 356.7317 = 273.9142

Residual DF = RDF = n –2 = 26 – 2 = 30 - 2 = 28

= (356.7317+ 273.9142)/(24+28) = 12.1278

= (356.7317+ 273.9142)/(24+28) = 12.1278

Sb1-b2 = (SY·X)p2/(x2)1 + (SY·X)p2/(x2)2

= (12.1278)/(1470.8712) + (12.1278)/(2272.4750)

= 0.1165

t = (2.97 – 2.17) / 0.1165 = 6.867 (p < 0.001)

DF = 24 + 28 = 52, Critical t 0.05(2), 52 = 2.007

Reject Ho.

Therefore, there is a significant difference between the two slopes (t 0.05(2), 52 = 6.867, p < 0.001)

b = 2.97

Y

b = 2.17

X

b = 2.97

Y

Testing for difference between points on the two nonparallel regression lines. We are testing whether the volumes (Y) are different in the two groups at X = 12: Ho: same value

Further we need to know

A1 = 10.57 and a2 = 24.91;

Mean X1 = 22.93 and mean X2 = 18.95

Then, Y = a + bX

Estimated Y1 = 46.21; Y2 = 50.95

SY1-Y2 = (SY·X)p2/[1/n1 + 1/n2 + (X – X1)2/(x2)1 +(X – X2)2/(x2)2 ]

= (12.1278)/(1/26)+(1/30)+(12-22.93)2/(1470.8712) +

(12 – 18.95)2/(2272.4750)

= 1.45 ml

t = (46.21 – 50.95) / 1.45 = -3.269 (0.001< p < 0.002)

DF = 26 + 30 - 4 = 52, Critical t 0.05(2), 52 = 2.007

Reject Ho.

b = 2.17

X

Comparing Two Elevations

Y

• If there is no significant difference between the slopes, you are required to compare the two elevations

X

For Common regression:

Sum of squares of X = Ac = ( x2)1 + ( x2)2

Sum of cross-products = Bc = ( xy)1 + ( xy)2

Sum of squares of Y = Cc = ( y2)1 + ( y2)2

Residual SSc = Cc – Bc2/ Ac

Residual DFc = n1 + n2 – 3

Residual MS = SSc/DFc

Comparing Two Elevations

For Common regression:

Sum of squares of X = Ac = ( x2)1 + ( x2)2

Sum of cross-products = Bc = ( xy)1 + ( xy)2

Sum of squares of Y = Cc = ( y2)1 + ( y2)2

Residual SSc = Cc – Bc2/ Ac

Residual DFc = n1 + n2 – 3

Residual MS = (SYX)2c = SSc/DFc

bc = Bc/Ac

t = (Y1 – Y2) – bc(X1 - X2)

(SYX)2c [1/n1 + 1/n2 + (X1 - X2)2/Ac]

Y

X

Comparing slopes and elevations: An example

For sample 1: For sample 2:

 x2 = 1012.1923  x2 = 1659.4333

 xy= 1585.3385  xy= 2475.4333

• y2 =2618.3077  y2 = 3848.9333

n = 13 n = 15

Mean X = 54.65 Mean X = 56.93

Mean Y = 170.23 Mean Y = 162.93

b =  xy/  x2

b = 1.57 b = 1.49

RSS =  y2 –( xy)2/ x2

= 136.2230 = 156.2449

RDF = n –2 = 11 = 13

Test Ho: equal slopes

DF for t test = 11 +13 = 24

Sb1-b2 = 0.1392

t = (1.57 – 1.49)/0.1392 = 0.575 < Critical t 0.05(2), 24 = 2.064, p > 0.50

Accept Ho

Y

X

Comparing slopes and elevations: An example

For sample 1: For sample 2:

 x2 = 1012.1923  x2 = 1659.4333 Ac = 2671.6256

 xy= 1585.3385  xy= 2475.4333 Bc = 4060.7718

• y2 =2618.3077  y2 = 3848.9333 Cc = 6467.2410

n = 13 n = 15

Mean X = 54.65 Mean X = 56.93

Mean Y = 170.23 Mean Y = 162.93

b = 1.57 b = 1.49

bc = 4060.7718/ 2671.6256 = 1.520

SSc = 6467.2410 – (4060.7718)2/2671.6256 = 295.0185

DFc = 13 + 15 – 3 = 25

Residual MS = (SYX)2c = SSc/DFc = 11.8007

Test Ho: equal elevation

t = (170.23 – 162.93) – 1520(54.65-56.93)/

11.8007[1/13 + 1/15 + (54.65 – 56.93)2/2671.6256] = 8.218

> Critical t 0.05(2), 25 = 2.060, p < 0.001

Reject Ho

Y

X

Comparing more than two slopes
• See Section 18.4 (Zar, 1999) for details
• You can also perform an Analysis of Covariance (ANCOVA) using SPSS
ANCOVA - example
• Covariate MUST be a continuous ratio or interval scale (e.g. Temp)
• Dependent variable(s) is/are dependent on the covariate (increase or decrease)

B

A

ANCOVA - example
• Plot the figure
• Test Ho: equal slope
• Then Ho: equal elevation

B

A

ANCOVA in SPSS
• Heart beat as dependent variable
• Temp as covariate
• Species as fixed factor
• Test Ho: equal slope using Model with an interaction: Species x Temp
• Sig. interaction indicates different slopes
• No sig. Interaction, then remove this term from the Model: Sig. Species effect  different elevations

B

A

Double log transformation

Log Y

Y

X

Log X

Y = a + bx

Then log Y = k + m log X

log Y = log (10k) + log Xm

log Y = log (10k)Xm

Y = (10k) Xm

Y = CXm

Double log transformation

wt

wt

length

length

Then ln W = a + b ln L (natural log can also be used)

W= (expa) Lb

Similarly, physiological response (e.g. E = ammonia excretion rate) vs. body wt:

E = expaW-b

Multiple regression analysis
• Y = a + bX

Models:

• Y = a + b1X1 + b2X2
• Y = a + b1X1 + b2X2 + b3X3
• Y = a + b1X1 + b2X2 + b3X3 + b4X4
• Y = a + biXi
• Similar to simple linear regression, there is only one effect or dependent variable (Y). However, there are several cause or independent variables chosen by the experimenter.
Multiple regression analysis
• The same assumptions as for linear regression apply so each of the cause (independent) variables must be measured without error.
• ANDeach of these cause variables must be independent of each other
• If not independent, partial correlation should be used.
Multiple regression analysis
• Works in exactly the same way as linear regression only the best-fit line is made up a separate slope for each of the cause variables;
• There is a single intercept which is the value of the effect variable when all cause variables are zero
Multiple regression analysis
• A multiple regression using just two causevariables is possible to visualize using a 3D diagram
• If there are any more cause variables, there is no way to display the relationships
• Can be done easily with SPSS – Stepwise multiple regression analysis will help you to determine the most important cause factor(s)