Regression: (1) Simple Linear Regression

1 / 52

# Regression: (1) Simple Linear Regression - PowerPoint PPT Presentation

Regression: (1) Simple Linear Regression. Hal Whitehead BIOL4062 / 5062. Regression. Purposes of regression Simple linear regression Formula Assumptions If assumptions hold, what can we do? Testing assumptions When assumptions do not hold. Regression. One Dependent Variable Y

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Regression: (1) Simple Linear Regression

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Regression:(1) Simple Linear Regression

BIOL4062 / 5062

Regression
• Purposes of regression
• Simple linear regression
• Formula
• Assumptions
• If assumptions hold, what can we do?
• Testing assumptions
• When assumptions do not hold
Regression

One Dependent Variable Y

Independent Variables X1,X2,X3,...

Purposes of Regression

1. Relationship between Y and X's

2. Quantitative prediction of Y

3. Relationship between Y and X controlling for C

4. Which of X's are most important?

5. Best mathematical model

6. Compare regression relationships: Y1 on X, Y2 on X

7. Assess interactive effects of X's

Simple regression: one X
• Multiple regression: two or more X's
Simple linear regression

Y = β0 + β1X + Error

Assumptions of simple linear regression

1. Existence

2. Independence

3. Linearity

4. Homoscedasticity

5. Normality

6. X measured without error

Assumptions of simple linear regression

1. For any fixed value of X, Y is a random variable with a certain probability distribution having finite mean and variance

(Existence)

Y

Prob of Y

X

Assumptions of simple linear regression

2. The Y values are statistically independent of one another

(Independence)

Assumptions of simple linear regression

3. The mean value of Y given X is a straight line function of X

(Linearity)

Y

Prob of Y

X

Assumptions of simple linear regression

4. The variance of Y is the same for all X

(Homoscedasticity)

Y

Prob of Y

X

Assumptions of simple linear regression

5. For any fixed value of X, Y has a normal distribution

• (Normality)

Y

Prob of Y

X

Assumptions of simple linear regression

6. There are no measurement errors in X

(X measured without error)

Assumptions of simple linear regression

1. Existence

2. Independence

3. Linearity

4. Homoscedasticity

5. Normality

6. X measured without error

If assumptions hold, what can we do?

1. Estimate β0 (intercept), β1 (slope), together with measures of uncertainty

2. Describe quality of fit (variation of data around straight line) by estimate of σ² or r²

3. Tests of slope and intercept

4. Prediction and prediction bands

5. ANOVA Table

Parameters estimated using least-squares
• Age-specific pregnancy rates of female sperm whales (from Best et al. 1984 Rep. int. Whal. Commn. Spec. Issue)

Find line which minimizes

squares of residuals

1. Estimate β0 (intercept), β1 (slope), together with measures of uncertainty
• Age-specific pregnancy rates of female sperm whales (from Best et al. 1984 Rep. int. Whal. Commn. Spec. Issue)
1. Estimate β0 (intercept), β1 (slope), together with measures of uncertainty
• β0 = 0.230

(SE 0.028)

• 95% c.i.:

0.164; 0.296

• β1 = -0.0035

(SE 0.0009)

• 95% c.i.:

-0.0056; 0.0013

2. Describe quality of fit by estimate of σ² or r²

σ² = 0.0195

r2 = 0.679

(Propn. variance accounted for by

regression)

3. Tests of slope and intercept

a) Slope = 0 {Equivalent to r=0}

b) Slope = Predetermined constant

c) Intercept = 0

d) Intercept = Predetermined constant

e) Compare slopes

f) Compare intercepts {Assume same slope}

(tests use t-distribution)

3a) Slope = 0 {Equivalent to r=0}

Does pregnancy rate change with age?

H0: β1 = 0

H1: β1≠ 0

P=0.006

Does pregnancy rate decline with age?

H0: β1 = 0

H1: β1 > 0

P=0.003

3b) Slope = Predetermined constant

β1 = 2.868 (SE 0.058)

95% c.i.: 2.752; 2.984

Does shape change with length?

H0: β1 = 3

H1: β1≠ 3

P<0.05

weight=length3

Weights and Lengths of Cetacean Species

Whitehead & Mann In Cetacean Societies 2000

3c) Intercept = 0

β0 = 0.436 (SE 0.080)

95% c.i.: 0.276; 0.596

Is birth length proportional to length?

H0: β0 = 0

H1: β0≠ 0

P=0.000

3e) Compare slopes

β1 (m) = 2.528 (SE 0.409)

β1 (o) = 2.962 (SE 0.094)

Does shape change differently with length for odontocetes and mysticetes?

H0: β1 (m) = β1 (o)

H1: β1 (m) ≠ β1 (o) P = 0.146

Weights and Lengths of Cetacean Species

3f) Compare intercepts{Assume same slope}

β0 (m) = 2.528 (SE 0.409)

β0 (o) = 2.962 (SE 0.094)

Are odontocetes and mysticetes equally fat?

H0: β0 (m) = β0 (o)

H1: β0 (m) ≠β0 (o) P = 0.781

15

10

Log(Weight)

5

ORDER

m

o

0

0

1

2

3

4

Log(Length)

4. Prediction and prediction bands

95% Confidence Bands for

Regression Line

95% Prediction Bands

From: http://www.tufts.edu/~gdallal/slr.htm

5. ANOVA Table

Analysis of Variance

Source Sum-of-Squares df Mean-Square F-ratio P

Regression 286.27 1 286.27 2475.07 0.00

Residual 5.32 46 0.12

If assumptions hold, what can we do?

1. Estimate β0 (intercept), β1 (slope), together with measures of uncertainty

2. Describe quality of fit (variation of data around straight line) by estimate of σ²or r²

3. Tests of slope and intercept

4. Prediction and prediction bands

5. ANOVA Table

Expected

Testing assumptions: diagnostics
• Use residuals to look at assumptions of regression:

e(i) = Y(i) - (β0 + β1X(i))

Observed

Residuals
• Residual: e(i) = Y(i) - (β0 + β1X(i))
• Standardized residuals: e(i)/S

{S is the standard deviation of the residuals

• Studentized residuals: e(i) / [S(1 - h(i))]

{h(i) is the "leverage value" of observation i:

h(i) =1/n + (X(i) - ΣX(i)/n )²/[(n-1)S(X)²]}

• Jackknifed residuals: e(i) / [S(-i) (1 - h(i))]

{The residual variance (S(-i)) is calculated separately with each observation deleted}

Use Residuals to:

a) look for outliers which we may wish to remove

b) examine normality

c) check for linearity

d) check for homoscedasticity

e) check for some kinds of non-independence

Yes

if “outlier” was probably not produced by the process being studied

measurement error

different species

...

No

if “outlier” was probably produced by the process being studied

extreme specimen

Should outliers be removed?
b) Using residuals to examine normality
• Lilliefors test for normality:

P=0.62

• Lilliefors test for normality (excluding Bowhead whale):

P=0.68

• Durbin-Watson D Statistic: 1.48
• low values (<2) indicate autocorrelation
• First Order Autocorrelation: 0.26

Days spent following sperm whales

Use Residuals to:

a) look for outliers which we may wish to remove

b) examine normality

c) check for linearity

d) check for homoscedasticity

e) check for some kinds of non-independence

Assumptions of simple linear regression

1. Existence

2. Independence

3. Linearity

4. Homoscedasticity

5. Normality

6. X measured without error

When assumptions do not hold:

1. Existence:

Forget it!

When assumptions do not hold:

2. Independence:

• collect data differently
• reduce the size of the data set
• (e.g. autocorrelation term, species effect)

More a problem for testing than prediction

When assumptions do not hold:

3. Linearity:

• Transform either X or Y or both variables. e.g.:

Log(Y) = ß0+ ß1 Log(X) + E

• Polynomial regression:

Y = ß0 + ß1X + ß2X² + ... + E

• Non-linear regression. e.g.:

Y = c + EXP(ß0 + ß1X) + E

• Piecewise linear regression:

Y = ß0 + ß1X [X>XK] + E

where [X> XK]=0 if X< XK and [X> XK]=1 if X> XK.

Y = ß0 + ß1X [X>XK] + E

• Log(Y) = ß0+ ß1 Log(X) + E
• Y = ß0 + ß1X + ß2X² + ... + E
• Y = c + EXP(ß0 + ß1X) + E
When assumptions do not hold:

4. Homoscedasticity:

• Transformations of the Y variable
• Weighted regressions(if we know that some observations are more accurate than others)
When assumptions do not hold:

5. Normality:

• Transformations of the Y variable
• Non-normal error structures (e.g. Poisson)

Small departures from normality are not especially important, unless doing a test

When assumptions do not hold:

6. X measured without error:

• Major axis regression
• Reduced major axis, or geometric mean, regression
Major axis regression:
• Minimize sum of squares of perpendicular distances from observations to regression line
• Only if variables are in same units

{First principal component of covariance matrix}

Reduced major axis regression:
• Each of the two variables is transformed to have a mean of zero and a standard deviation of 1
• Then, minimize sum of squares of perpendicular distances from observations to regression line
• Its slope cannot be sensibly tested against zero

{first principal component using the correlation matrix}

Regression
• Extremely useful technique!
• Check assumptions using residuals
• Can be extended in several ways
• multiple regression
• non-linear regression
• non-normal errors
• piecewise regression
• ...