CHAPTER 7 Linear Correlation &amp; Regression Methods

1 / 55

# CHAPTER 7 Linear Correlation & Regression Methods - PowerPoint PPT Presentation

CHAPTER 7 Linear Correlation &amp; Regression Methods. 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression. Parameter Estimation via SAMPLE DATA …. Testing for association between two POPULATION variables X and Y ….

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' CHAPTER 7 Linear Correlation & Regression Methods' - cameron-mays

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### CHAPTER 7Linear Correlation & Regression Methods

7.1 - Motivation

7.2 - Correlation / Simple Linear Regression

7.3 - Extensions of Simple Linear Regression

Parameter Estimation via SAMPLE DATA …

Testing for association between two POPULATION variables X and Y…

• Categorical variables
• Numerical variables

Chi-squared Test

???????

PARAMETERS

• Means:
• Variances:
• Covariance:

Examples:

X = Disease status (D+, D–)

Y = Exposure status (E+, E–)

X = # children in household (0, 1-2, 3-4, 5+)

Y = Income level (Low, Middle, High)

Parameter Estimation via SAMPLE DATA …

• Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

• Means:
• Means:
• Variances:
• Variances:
• Covariance:
• Covariance:

(can be +, –, or 0)

Parameter Estimation via SAMPLE DATA …

• Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

• Means:
• Means:
• Variances:
• Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

• Covariance:
• Covariance:

(can be +, –, or 0)

X

Parameter Estimation via SAMPLE DATA …

• Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

• Means:
• Means:
• Variances:
• Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

• Covariance:
• Covariance:

(can be +, –, or 0)

Does this suggest a linear trend between X and Y?

If so, how do we measure it?

X

LINEAR

Testing for association between two population variables X and Y…

^

• Numerical variables

???????

PARAMETERS

• Means:
• Variances:
• Covariance:
• Linear Correlation Coefficient:

Always between –1 and +1

Parameter Estimation via SAMPLE DATA …

• Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

• Means:
• Means:
• Variances:
• Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

• Covariance:
• Covariance:

(can be +, –, or 0)

• Linear Correlation Coefficient:

Always between –1 and +1

X

Parameter Estimation via SAMPLE DATA …

Example in R (reformatted for brevity):

• Numerical variables

> pop = seq(0, 20, 0.1)

> x = sort(sample(pop, 10))

1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

> y = sample(pop, 10)

13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

> c(mean(x), mean(y))

7.05 12.08

> var(x)

29.48944

> var(y)

43.76178

• Means:
• Means:
• Variances:
• Variances:

JAMA. 2003;290:1486-1493

plot(x, y, pch = 19)

Scatterplot

n = 10

(n data points)

• Covariance:
• Covariance:

> cov(x, y)

-25.86667

(can be +, –, or 0)

• Linear Correlation Coefficient:

Always between –1 and +1

> cor(x, y)

-0.7200451

X

Parameter Estimation via SAMPLE DATA …

• Numerical variables
• Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

X

Parameter Estimation via SAMPLE DATA …

• Numerical variables
• Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation

Parameter Estimation via SAMPLE DATA …

• Numerical variables
• Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation

Parameter Estimation via SAMPLE DATA …

• Numerical variables
• Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation

Parameter Estimation via SAMPLE DATA …

• Numerical variables
• Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

> cor(x, y)

-0.7200451

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation

Testing for linear association between two numerical population variables X and Y…

Now that we have r, we can conduct HYPOTHESIS TESTING on 

• Linear Correlation Coefficient

Test Statistic for p-value

• Linear Correlation Coefficient

p-value = .0189 < .05

2 * pt(-2.935, 8)

Parameter Estimation via SAMPLE DATA …

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

• Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

in what sense???

Residuals

Find estimates and for the “best” line

Parameter Estimation via SAMPLE DATA …

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

• Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

“Least Squares Regression Line”

in what sense???

i.e., that minimizes

Residuals

Find estimates and for the “best” line

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

• Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

Check 

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

Testing for linear association between two numerical population variables X and Y…

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

• Linear Regression Coefficients

“Response = Model + Error”

Test Statistic for p-value?

• Linear Regression Coefficients

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

Testing for linear association between two numerical population variables X and Y…

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

• Linear Regression Coefficients

“Response = Model + Error”

Test Statistic for p-value

• Linear Regression Coefficients

p-value = .0189

Same t-score as H0:  = 0!

> plot(x, y, pch = 19)

> lsreg = lm(y ~ x) # or lsfit(x,y)

> abline(lsreg)

> summary(lsreg)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 18.2639 2.6097 6.999 0.000113 ***

x -0.8772 0.2989 -2.935 0.018857 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedom

Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583

F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???

Because this second method generalizes…

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

For now, assume the “additive model,” i.e., main effects only.

Y

True response yi

Residual

Fitted response

X2

0

(x1i , x2i)

Predictors

X1

Multilinear Regression

Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)!

Once calculated, how do we then test the null hypothesis?

ANOVA

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

?

?

?

Parameter Estimation via SAMPLE DATA …

STATISTICS

• Means:
• Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSTotis a measure of the total amount of variability in the observed responses (i.e., before any model-fitting).

Parameter Estimation via SAMPLE DATA …

STATISTICS

• Means:
• Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSRegis a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.)

Parameter Estimation via SAMPLE DATA …

STATISTICS

• Means:
• Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSErris a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting).

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

?

?

?

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

= 204.2

= 189.656

= 9 (43.76178)

Residuals

= 393.856

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

Tot

Err

predictor

observed response

Reg

fitted response

residuals

> cor(x, y)

-0.7200451

= 204.2

= 189.656

= 393.856

Residuals

minimum

SSTot = SSReg + SSErr

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

Same as before!

> summary(aov(lsreg))

Df Sum Sq Mean Sq F value Pr(>F)

x 1 204.20 204.201 8.6135 0.01886 *

Residuals 8 189.66 23.707

Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Moreover,

> cor(x, y)

-0.7200451

Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Moreover,

> plot(x, y, pch = 19)

> lsreg = lm(y ~ x)

> abline(lsreg)

> summary(lsreg)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 18.2639 2.6097 6.999 0.000113 ***

x -0.8772 0.2989 -2.935 0.018857 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedom

Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583

F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Summary of Linear Correlation and Simple Linear Regression

Means

Variances

Covariance

X

Y

Given:

• Linear Correlation Coefficient

Y

JAMA. 2003;290:1486-1493

–1 r +1

measures the strength of linear association

• Least Squares Regression Line

minimizesSSErr =

X

= SSTot – SSReg

(ANOVA)

Summary of Linear Correlation and Simple Linear Regression

Means

Variances

Covariance

X

Y

Given:

• Linear Correlation Coefficient

Y

JAMA. 2003;290:1486-1493

–1 r +1

measures the strength of linear association

• Least Squares Regression Line

minimizesSSErr =

X

= SSTot – SSReg

(ANOVA)

All point estimates can be upgraded to CIs for hypothesis testing, etc.

proportion of total variability modeled by the regression line’s variability.

• Coefficient of Determination

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

R code example: lsreg= lm(y ~ x1+x2+x3)

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

(“polynomial regression”)

R code example: lsreg= lm(y ~ x1+x2+x3)

R code example: lsreg= lm(y ~ x+x^2+x^3)

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

(“polynomial regression”)

“interactions”

“interactions”

R code example: lsreg= lm(y ~ x+x^2+x^3)

R code example: lsreg= lm(y ~ x1+x2+x1:x2)

R code example: lsreg= lm(y ~ x1*x2)

Recall…

Multiple Linear Regwith interaction

Example in R (reformatted for brevity):

with an indicator (“dummy”) variable:

> I = c(1,1,1,1,1,0,0,0,0,0)

I = 1

> lsreg = lm(y ~ x*I)

> summary(lsreg)

Coefficients:

Estimate

(Intercept) 6.56463

x 0.00998

I 6.80422

x:I 1.60858

I = 0

Suppose these are actually two subgroups, requiring two distinct linear regressions!