Chapter 7 linear correlation regression methods
This presentation is the property of its rightful owner.
Sponsored Links
1 / 55

CHAPTER 7 Linear Correlation & Regression Methods PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on
  • Presentation posted in: General

CHAPTER 7 Linear Correlation & Regression Methods. 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression. Parameter Estimation via SAMPLE DATA …. Testing for association between two POPULATION variables X and Y ….

Download Presentation

CHAPTER 7 Linear Correlation & Regression Methods

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


CHAPTER 7Linear Correlation & Regression Methods

7.1 - Motivation

7.2 - Correlation / Simple Linear Regression

7.3 - Extensions of Simple Linear Regression


Parameter Estimation via SAMPLE DATA …

Testing for association between two POPULATION variables X and Y…

  • Categorical variables

  • Numerical variables

Chi-squared Test

???????

PARAMETERS

  • Means:

  • Variances:

  • Covariance:

Examples:

X = Disease status (D+, D–)

Y = Exposure status (E+, E–)

X = # children in household (0, 1-2, 3-4, 5+)

Y = Income level (Low, Middle, High)


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

  • Means:

  • Means:

  • Variances:

  • Variances:

  • Covariance:

  • Covariance:

(can be +, –, or 0)


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

  • Means:

  • Means:

  • Variances:

  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

  • Covariance:

  • Covariance:

(can be +, –, or 0)

X


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

  • Means:

  • Means:

  • Variances:

  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

  • Covariance:

  • Covariance:

(can be +, –, or 0)

Does this suggest a linear trend between X and Y?

If so, how do we measure it?

X


LINEAR

Testing for association between two population variables X and Y…

^

  • Numerical variables

???????

PARAMETERS

  • Means:

  • Variances:

  • Covariance:

  • Linear Correlation Coefficient:

Always between –1 and +1


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

  • Means:

  • Means:

  • Variances:

  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

  • Covariance:

  • Covariance:

(can be +, –, or 0)

  • Linear Correlation Coefficient:

Always between –1 and +1

X


Parameter Estimation via SAMPLE DATA …

Example in R (reformatted for brevity):

  • Numerical variables

> pop = seq(0, 20, 0.1)

> x = sort(sample(pop, 10))

1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

> y = sample(pop, 10)

13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

> c(mean(x), mean(y))

7.05 12.08

> var(x)

29.48944

> var(y)

43.76178

  • Means:

  • Means:

  • Variances:

  • Variances:

JAMA. 2003;290:1486-1493

plot(x, y, pch = 19)

Scatterplot

n = 10

(n data points)

  • Covariance:

  • Covariance:

> cov(x, y)

-25.86667

(can be +, –, or 0)

  • Linear Correlation Coefficient:

Always between –1 and +1

> cor(x, y)

-0.7200451

X


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

X


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation


Parameter Estimation via SAMPLE DATA …

  • Numerical variables

  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

> cor(x, y)

-0.7200451

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation


Testing for linear association between two numerical population variables X and Y…

Now that we have r, we can conduct HYPOTHESIS TESTING on 

  • Linear Correlation Coefficient

Test Statistic for p-value

  • Linear Correlation Coefficient

p-value = .0189 < .05

2 * pt(-2.935, 8)


Parameter Estimation via SAMPLE DATA …

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

  • Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

in what sense???

Residuals

Find estimates and for the “best” line


Parameter Estimation via SAMPLE DATA …

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

  • Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

“Least Squares Regression Line”

in what sense???

i.e., that minimizes

Residuals

Find estimates and for the “best” line


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

  • Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

Check 


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line


Testing for linear association between two numerical population variables X and Y…

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

  • Linear Regression Coefficients

“Response = Model + Error”

Test Statistic for p-value?

  • Linear Regression Coefficients


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line


Testing for linear association between two numerical population variables X and Y…

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

  • Linear Regression Coefficients

“Response = Model + Error”

Test Statistic for p-value

  • Linear Regression Coefficients

p-value = .0189

Same t-score as H0:  = 0!


> plot(x, y, pch = 19)

> lsreg = lm(y ~ x) # or lsfit(x,y)

> abline(lsreg)

> summary(lsreg)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 18.2639 2.6097 6.999 0.000113 ***

x -0.8772 0.2989 -2.935 0.018857 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedom

Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583

F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???

Because this second method generalizes…


Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

For now, assume the “additive model,” i.e., main effects only.


Y

True response yi

Residual

Fitted response

X2

0

(x1i , x2i)

Predictors

X1

Multilinear Regression

Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)!

Once calculated, how do we then test the null hypothesis?

ANOVA


ANOVA Table


ANOVA Table


ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

?

?

?


Parameter Estimation via SAMPLE DATA …

STATISTICS

  • Means:

  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSTotis a measure of the total amount of variability in the observed responses (i.e., before any model-fitting).


Parameter Estimation via SAMPLE DATA …

STATISTICS

  • Means:

  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSRegis a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.)


Parameter Estimation via SAMPLE DATA …

STATISTICS

  • Means:

  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSErris a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting).


ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

?

?

?


ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

= 204.2

= 189.656

= 9 (43.76178)

Residuals

= 393.856


SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

Tot

Err

predictor

observed response

Reg

fitted response

residuals

> cor(x, y)

-0.7200451

= 204.2

= 189.656

= 393.856

Residuals

minimum

SSTot = SSReg + SSErr


ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.


ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

Same as before!


> summary(aov(lsreg))

Df Sum Sq Mean Sq F value Pr(>F)

x 1 204.20 204.201 8.6135 0.01886 *

Residuals 8 189.66 23.707


Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Moreover,


> cor(x, y)

-0.7200451

Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Moreover,


> plot(x, y, pch = 19)

> lsreg = lm(y ~ x)

> abline(lsreg)

> summary(lsreg)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 18.2639 2.6097 6.999 0.000113 ***

x -0.8772 0.2989 -2.935 0.018857 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedom

Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583

F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.


Summary of Linear Correlation and Simple Linear Regression

Means

Variances

Covariance

X

Y

Given:

  • Linear Correlation Coefficient

Y

JAMA. 2003;290:1486-1493

–1 r +1

measures the strength of linear association

  • Least Squares Regression Line

minimizesSSErr =

X

= SSTot – SSReg

(ANOVA)


Summary of Linear Correlation and Simple Linear Regression

Means

Variances

Covariance

X

Y

Given:

  • Linear Correlation Coefficient

Y

JAMA. 2003;290:1486-1493

–1 r +1

measures the strength of linear association

  • Least Squares Regression Line

minimizesSSErr =

X

= SSTot – SSReg

(ANOVA)

All point estimates can be upgraded to CIs for hypothesis testing, etc.

proportion of total variability modeled by the regression line’s variability.

  • Coefficient of Determination


Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

R code example: lsreg= lm(y ~ x1+x2+x3)


Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

quadratic terms, etc.

(“polynomial regression”)

R code example: lsreg= lm(y ~ x1+x2+x3)

R code example: lsreg= lm(y ~ x+x^2+x^3)


Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

quadratic terms, etc.

(“polynomial regression”)

“interactions”

“interactions”

R code example: lsreg= lm(y ~ x+x^2+x^3)

R code example: lsreg= lm(y ~ x1+x2+x1:x2)

R code example: lsreg= lm(y ~ x1*x2)


Recall…

Multiple Linear Regwith interaction

Example in R (reformatted for brevity):

with an indicator (“dummy”) variable:

> I = c(1,1,1,1,1,0,0,0,0,0)

I = 1

> lsreg = lm(y ~ x*I)

> summary(lsreg)

Coefficients:

Estimate

(Intercept) 6.56463

x 0.00998

I 6.80422

x:I 1.60858

I = 0

Suppose these are actually two subgroups, requiring two distinct linear regressions!


Logistic Reg, Transformations


  • Login