chapter 7 linear correlation regression methods
Download
Skip this Video
Download Presentation
CHAPTER 7 Linear Correlation & Regression Methods

Loading in 2 Seconds...

play fullscreen
1 / 55

CHAPTER 7 Linear Correlation & Regression Methods - PowerPoint PPT Presentation


  • 162 Views
  • Uploaded on

CHAPTER 7 Linear Correlation & Regression Methods. 7.1 - Motivation 7.2 - Correlation / Simple Linear Regression 7.3 - Extensions of Simple Linear Regression. Parameter Estimation via SAMPLE DATA …. Testing for association between two POPULATION variables X and Y ….

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CHAPTER 7 Linear Correlation & Regression Methods' - cameron-mays


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
chapter 7 linear correlation regression methods

CHAPTER 7Linear Correlation & Regression Methods

7.1 - Motivation

7.2 - Correlation / Simple Linear Regression

7.3 - Extensions of Simple Linear Regression

slide2

Parameter Estimation via SAMPLE DATA …

Testing for association between two POPULATION variables X and Y…

  • Categorical variables
  • Numerical variables

Chi-squared Test

???????

PARAMETERS

  • Means:
  • Variances:
  • Covariance:

Examples:

X = Disease status (D+, D–)

Y = Exposure status (E+, E–)

X = # children in household (0, 1-2, 3-4, 5+)

Y = Income level (Low, Middle, High)

slide3

Parameter Estimation via SAMPLE DATA …

  • Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

  • Means:
  • Means:
  • Variances:
  • Variances:
  • Covariance:
  • Covariance:

(can be +, –, or 0)

slide4

Parameter Estimation via SAMPLE DATA …

  • Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

  • Means:
  • Means:
  • Variances:
  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

  • Covariance:
  • Covariance:

(can be +, –, or 0)

X

slide5

Parameter Estimation via SAMPLE DATA …

  • Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

  • Means:
  • Means:
  • Variances:
  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

  • Covariance:
  • Covariance:

(can be +, –, or 0)

Does this suggest a linear trend between X and Y?

If so, how do we measure it?

X

slide6

LINEAR

Testing for association between two population variables X and Y…

^

  • Numerical variables

???????

PARAMETERS

  • Means:
  • Variances:
  • Covariance:
  • Linear Correlation Coefficient:

Always between –1 and +1

slide7

Parameter Estimation via SAMPLE DATA …

  • Numerical variables

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

  • Means:
  • Means:
  • Variances:
  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

  • Covariance:
  • Covariance:

(can be +, –, or 0)

  • Linear Correlation Coefficient:

Always between –1 and +1

X

slide8

Parameter Estimation via SAMPLE DATA …

Example in R (reformatted for brevity):

  • Numerical variables

> pop = seq(0, 20, 0.1)

> x = sort(sample(pop, 10))

1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1

> y = sample(pop, 10)

13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0

???????

PARAMETERS

PARAMETERS

STATISTICS

Y

> c(mean(x), mean(y))

7.05 12.08

> var(x)

29.48944

> var(y)

43.76178

  • Means:
  • Means:
  • Variances:
  • Variances:

JAMA. 2003;290:1486-1493

plot(x, y, pch = 19)

Scatterplot

n = 10

(n data points)

  • Covariance:
  • Covariance:

> cov(x, y)

-25.86667

(can be +, –, or 0)

  • Linear Correlation Coefficient:

Always between –1 and +1

> cor(x, y)

-0.7200451

X

slide9

Parameter Estimation via SAMPLE DATA …

  • Numerical variables
  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

X

slide10

Parameter Estimation via SAMPLE DATA …

  • Numerical variables
  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation

slide11

Parameter Estimation via SAMPLE DATA …

  • Numerical variables
  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation

slide12

Parameter Estimation via SAMPLE DATA …

  • Numerical variables
  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

r measures the strength of linear association

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation

slide13

Parameter Estimation via SAMPLE DATA …

  • Numerical variables
  • Linear Correlation Coefficient:

Always between –1 and +1

Y

r measures the strength of linear association

> cor(x, y)

-0.7200451

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

r

–1 0 +1

X

negative linear correlation

positive linear correlation

slide14

Testing for linear association between two numerical population variables X and Y…

Now that we have r, we can conduct HYPOTHESIS TESTING on 

  • Linear Correlation Coefficient

Test Statistic for p-value

  • Linear Correlation Coefficient

p-value = .0189 < .05

2 * pt(-2.935, 8)

slide15

Parameter Estimation via SAMPLE DATA …

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

  • Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

in what sense???

Residuals

Find estimates and for the “best” line

slide16

Parameter Estimation via SAMPLE DATA …

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

  • Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

“Least Squares Regression Line”

in what sense???

i.e., that minimizes

Residuals

Find estimates and for the “best” line

slide17

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have…

  • Linear Correlation Coefficient:

r measures the strength of linear association

“Response = Model + Error”

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

Check 

slide18

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

slide19

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

slide20

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

slide21

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

slide22

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

slide23

Testing for linear association between two numerical population variables X and Y…

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

  • Linear Regression Coefficients

“Response = Model + Error”

Test Statistic for p-value?

  • Linear Regression Coefficients
slide24

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

i.e., that minimizes

Residuals

Find estimates and for the “best” line

slide25

Testing for linear association between two numerical population variables X and Y…

Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1

  • Linear Regression Coefficients

“Response = Model + Error”

Test Statistic for p-value

  • Linear Regression Coefficients

p-value = .0189

Same t-score as H0:  = 0!

slide26

> plot(x, y, pch = 19)

> lsreg = lm(y ~ x) # or lsfit(x,y)

> abline(lsreg)

> summary(lsreg)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 18.2639 2.6097 6.999 0.000113 ***

x -0.8772 0.2989 -2.935 0.018857 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedom

Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583

F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

BUT WHY HAVE TWO METHODS FOR THE SAME PROBLEM???

Because this second method generalizes…

slide27

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

For now, assume the “additive model,” i.e., main effects only.

slide28

Y

True response yi

Residual

Fitted response

X2

0

(x1i , x2i)

Predictors

X1

Multilinear Regression

Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)!

Once calculated, how do we then test the null hypothesis?

ANOVA

slide31

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

?

?

?

slide32

Parameter Estimation via SAMPLE DATA …

STATISTICS

  • Means:
  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSTotis a measure of the total amount of variability in the observed responses (i.e., before any model-fitting).

slide33

Parameter Estimation via SAMPLE DATA …

STATISTICS

  • Means:
  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSRegis a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.)

slide34

Parameter Estimation via SAMPLE DATA …

STATISTICS

  • Means:
  • Variances:

JAMA. 2003;290:1486-1493

Scatterplot

(n data points)

SSErris a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting).

slide35

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

?

?

?

slide36

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

slide37

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

predictor

observed response

fitted response

residuals

> cor(x, y)

-0.7200451

= 204.2

= 189.656

= 9 (43.76178)

Residuals

= 393.856

slide38

SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES

Tot

Err

predictor

observed response

Reg

fitted response

residuals

> cor(x, y)

-0.7200451

= 204.2

= 189.656

= 393.856

Residuals

minimum

SSTot = SSReg + SSErr

slide39

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

slide40

ANOVA Table

In our example, k = 2 regression coefficients and n = 10 data points.

Same as before!

slide41

> summary(aov(lsreg))

Df Sum Sq Mean Sq F value Pr(>F)

x 1 204.20 204.201 8.6135 0.01886 *

Residuals 8 189.66 23.707

slide42

Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Moreover,

slide43

> cor(x, y)

-0.7200451

Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

Moreover,

slide44

> plot(x, y, pch = 19)

> lsreg = lm(y ~ x)

> abline(lsreg)

> summary(lsreg)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-8.6607 -3.2154 0.8954 3.4649 5.7742

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 18.2639 2.6097 6.999 0.000113 ***

x -0.8772 0.2989 -2.935 0.018857 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.869 on 8 degrees of freedom

Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583

F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886

Coefficient of Determination

The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining.

summary of linear correlation and simple linear regression
Summary of Linear Correlation and Simple Linear Regression

Means

Variances

Covariance

X

Y

Given:

  • Linear Correlation Coefficient

Y

JAMA. 2003;290:1486-1493

–1 r +1

measures the strength of linear association

  • Least Squares Regression Line

minimizesSSErr =

X

= SSTot – SSReg

(ANOVA)

summary of linear correlation and simple linear regression1
Summary of Linear Correlation and Simple Linear Regression

Means

Variances

Covariance

X

Y

Given:

  • Linear Correlation Coefficient

Y

JAMA. 2003;290:1486-1493

–1 r +1

measures the strength of linear association

  • Least Squares Regression Line

minimizesSSErr =

X

= SSTot – SSReg

(ANOVA)

All point estimates can be upgraded to CIs for hypothesis testing, etc.

proportion of total variability modeled by the regression line’s variability.

  • Coefficient of Determination
slide47

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

R code example: lsreg= lm(y ~ x1+x2+x3)

slide48

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

quadratic terms, etc.

(“polynomial regression”)

R code example: lsreg= lm(y ~ x1+x2+x3)

R code example: lsreg= lm(y ~ x+x^2+x^3)

slide49

Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc.

Multilinear Regression

“Response = Model + Error”

“main effects”

quadratic terms, etc.

(“polynomial regression”)

“interactions”

“interactions”

R code example: lsreg= lm(y ~ x+x^2+x^3)

R code example: lsreg= lm(y ~ x1+x2+x1:x2)

R code example: lsreg= lm(y ~ x1*x2)

slide54

Recall…

Multiple Linear Regwith interaction

Example in R (reformatted for brevity):

with an indicator (“dummy”) variable:

> I = c(1,1,1,1,1,0,0,0,0,0)

I = 1

> lsreg = lm(y ~ x*I)

> summary(lsreg)

Coefficients:

Estimate

(Intercept) 6.56463

x 0.00998

I 6.80422

x:I 1.60858

I = 0

Suppose these are actually two subgroups, requiring two distinct linear regressions!

ad