Regression: (2) Multiple Linear Regression and Path Analysis

1 / 32

# Regression: (2) Multiple Linear Regression and Path Analysis - PowerPoint PPT Presentation

Regression: (2) Multiple Linear Regression and Path Analysis. Hal Whitehead BIOL4062/5062. Multiple Linear Regression and Path Analysis. Multiple linear regression assumptions parameter estimation hypothesis tests selecting independent variables collinearity polynomial regression

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Regression: (2) Multiple Linear Regression and Path Analysis' - lolita

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Regression:(2) Multiple Linear Regression and Path Analysis

BIOL4062/5062

Multiple Linear Regression and Path Analysis
• Multiple linear regression
• assumptions
• parameter estimation
• hypothesis tests
• selecting independent variables
• collinearity
• polynomial regression
• Path analysis
Regression

One Dependent Variable Y

Independent Variables X1,X2,X3,...

Purposes of Regression

1. Relationship between Y and X's

2. Quantitative prediction of Y

3. Relationship between Y and X controlling for C

4. Which of X's are most important?

5. Best mathematical model

6. Compare regression relationships: Y1 on X, Y2 on X

7. Assess interactive effects of X's

Simple regression: one X
• Multiple regression: two or more X's

Y = ß0 + ß1X(1) + ß2X(2) + ß3X(3) + ... + ßkX(k) + E

Multiple linear regression:assumptions (1)
• For any specific combination of X's, Y is a (univariate) random variable with a certain probability distribution having finite mean and variance (Existence)
• Y values are statistically independent of one another (Independence)
• Mean value of Y given the X's is a straight linear function of the X's (Linearity)
Multiple linear regression:assumptions (2)
• The variance of Y is the same for any fixed combinations of X's (Homoscedasticity)
• For any fixed combination ofX's, Y has a normal distribution (Normality)
• There are no measurement errors in the X's (Xs measured without error)
Multiple linear regression:parameter estimation

Y = ß0 + ß1X(1) + ß2X(2) + ß3X(3) + ... + ßkX(k) + E

• Estimate the ß's in multiple regression using least squares
• Sizes of the coefficients not good indicators of importance of X variables
• Number of data points in multiple regression
• at least one more than number of X’s
• preferably 5 times number of X’s
Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004)

Multiple regression of Y [Log (CNS)] on:

X’ s ß SE(ß)

Log(Mass) -0.49 (0.70)

Log(Fat) -0.07 (0.10)

Log(Muscle) 1.03 (0.54)

Log(Heart) 0.42 (0.22)

Log(Bone) -0.07 (0.30)

N=39

Multiple linear regression:hypothesis tests

Usually test:

H0: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßj⋅X(j) + E

H1: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßj⋅X(j) + ... + ßk⋅X(k) + E

F-test with k-j, n-(k-j)-1 degrees of freedom (“partial F-test”)

H0: variables X(j+1),…,X(k) do not help explain variability in Y

Multiple linear regression:hypothesis tests

e.g. Test significance of overall multiple regression

H0: Y = ß0 + E

H1: Y = ß0 + ß1⋅X(1) + ß2⋅X(2) + ... + ßk⋅X(k) + E

• Test significance of
• deleting independent variable
Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004)

Multiple regression of Y [Log (CNS)] on:

X’ s ß SE(ß) P

Log(Mass) -0.49 (0.70) 0.49

Log(Fat) -0.07 (0.10) 0.52

Log(Muscle) 1.03 (0.54) 0.07

Log(Heart) 0.42 (0.22) 0.06

Log(Bone) -0.07 (0.30) 0.83

Tests

whether

removal

of

variable

reduces

fit

Multiple linear regression:selecting independent variables
• Reasons for selecting a subset of independent variables (X’s):
• cost (financial and other)
• simplicity
• improved prediction
• improved explanation
Multiple linear regression:selecting independent variables
• Partial F-test
• predetermined forward selection
• forward selection based upon improvement in fit
• backward selection based upon improvement in fit
• stepwise (backward/forward)
• Mallow’s C(p)
• AIC
Multiple linear regression:selecting independent variables
• Partial F-test
• predetermined forward selection
• Mass, Bone, Heart, Muscle, Fat
• forward selection based upon improvement in fit
• backward selection based upon improvement in fit
• Stepwise (backward/forward)
Multiple linear regression:selecting independent variables
• Partial F-test
• predetermined forward selection
• forward selection based upon improvement in fit
• backward selection based upon improvement in fit
• stepwise (backward/forward)
Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004)
• Complete model (r2=0.97):
• Forward stepwise (α-to-enter=0.15; α-to-remove=0.15):
• 1. Constant (r2=0.00)
• 2. Constant + Muscle (r2=0.97)
• 3. Constant + Muscle + Heart (r2=0.97)
• 4. Constant + Muscle + Heart + Mass (r2=0.97)

-0.18 - 0.82xMass +1.24xMuscle + 0.39xHeart

Why do Large Animals have Large Brains?(Schoenemann Brain Behav. Evol. 2004)
• Complete model (r2=0.97):
• Backward stepwise (α-to-enter=0.15; α-to-remove=0.15):
• 1. All (r2=0.97)
• 2. Remove Bone (r2=0.97)
• 3. Remove Fat (r2=0.97)

-0.18 - 0.82xMass +1.24xMuscle + 0.39xHeart

Comparing models
• Mallow’s C(p)
• C(p) = (k-p).F(p) + (2p-k+1)
• k parameters in full model; p parameters in restricted model
• F(p) is the F value comparing the fit of the restricted model with that of the full model
• Lowest C(p) is best model
• Akaike Information Criteria (AIC)
• AIC=n.Log(σ2) +2p
• Lowest AIC indicates best model
• Can compare models not included in one another
Collinearity
• If two (or more) X’s are linearly related:
• they are collinear
• the regression problem is indeterminate

X(3)=5.X(2)+16, or

X(2)=4.X(1)+ 16.X(4)

• If they are nearly linearly related (near collinearity), coefficients and tests are very inaccurate
• Centering (mean = 0)
• Scaling (SD =1)
• Regression on first few Principal Components
• Ridge Regression
Curvilinear (Polynomial) Regression
• Y = ß0 + ß1⋅X + ß2⋅X² + ß3⋅X3 + ... + ßk⋅Xk + E
• Used to fit fairly complex curves to data
• ß’s estimated using least squares
• Use sequential partial F-tests, or AIC, to find how many terms to use
• k>3 is rare in biology
• Better to transform data and use simple linear regression, when possible
Curvilinear (Polynomial) Regression

Y=0.066 + 0.00727.X

Y=0.117 + 0.00085.X + 0.00009.X²

Y=0.201 - 0.01371.X + 0.00061.X²

- 0.000005.X3

From Sokal and Rohlf

A B

C D

E

Path Analysis
• Models with causal structure
• Represented by path diagram
• All variables quantitative
• All path relationships assumed linear
• (transformations may help)

A B

C D

E

U

Path Analysis
• All paths one way
• A => C
• C => A
• No loops
• Some variables may not be directly observed:
• residual variables (U)
• Some variables not observed but known to exist
• latent variables (D)

A B

C D

E

U

Path Analysis
• Path coefficients and other statistics calculated using multiple regressions
• Variables are:
• centered (mean = 0) so no constants in regressions
• often standardized (SD = 1)
• So: path coefficients usually between -1 and +1
• Paths with coefficients not significantly different from zero may be eliminated
Path Analysis: an example
• Isaak and Hubert. 2001. “Production of stream habitat gradients by montane watersheds: hypothesis tests based on spatially explicit path analyses” Can. J. Fish. Aquat. Sci.

- - - Predicted negative interaction ________ Predicted positive interaction