slide1 l.
Download
Skip this Video
Download Presentation
Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

Loading in 2 Seconds...

play fullscreen
1 / 80

Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA) - PowerPoint PPT Presentation


  • 443 Views
  • Uploaded on

Lecture #9. Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA). Bivariate Data. Bivariate data are just what they sound like – data with measurements on two variables; let’s call them X and Y

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)' - osgood


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Lecture #9

Bivariate data

Correlation

Coefficient of Determination

Regression

One-way Analysis of Variance (ANOVA)

bivariate data
Bivariate Data
  • Bivariate data are just what they sound like – data with measurements on two variables; let’s call them X and Y
  • Here, we will look at two continuous variables
  • Want to explore the relationship between the two variables
  • Example: Fasting blood glucose and ventricular shortening velocity
scatterplot
Scatterplot
  • We can graphically summarize a bivariate data set with a scatterplot (also sometimes called a scatter diagram)
  • Plots values of one variable on the horizontal axis and values of the other on the vertical axis
  • Can be used to see how values of 2 variables tend to move with each other (i.e. how the variables are associated)
numerical summary
Numerical Summary
  • Typically, a bivariate data set is summarized numerically with 5 summary statistics
  • These provide a fair summary for scatterplots with the same general shape as we just saw, like an oval or an ellipse
  • We can summarize each variable separately : X mean, X SD; Y mean, Y SD
  • But these numbers don’t tell us how the values of X and Y vary together
pearson s correlation coefficient r
Pearson’s Correlation Coefficient “r”
  • “r” indicates…
    • strength of relationship (strong, weak, or none)
    • direction of relationship
      • positive (direct) – variables move in same direction
      • negative (inverse) – variables move in opposite directions
  • r ranges in value from –1.0 to +1.0

-1.0 0.0 +1.0

Strong Negative No Rel. Strong Positive

correlation cont
Correlation (cont)

Correlation is the relationship between two variables.

what r is
What r is...
  • r is a measure of LINEAR ASSOCIATION
  • The closer r is to –1 or 1, the more tightly the points on the scatterplot are clustered around a line
  • The sign of r (+ or -) is the same as the sign of the slope of the line
  • When r = 0, the points are not LINEARLY ASSOCIATED– this does NOT mean there is NO ASSOCIATION
and what r is not
...and what r is not
  • r is a measure of LINEAR ASSOCIATION
  • r does NOT tell us if Y is a function of X
  • r does NOT tell us if XcausesY
  • r does NOT tell us if YcausesX
  • r does NOT tell us what the scatterplot looks like
correlation is not causation
Correlation is NOT causation
  • You cannot infer that since X and Y are highly correlated (r close to –1 or 1) that X is causing a change in Y
  • Y could be causing X
  • X and Y could both be varying along with a third, possibly unknown factor (either causal or not)
reading correlation matrix
Reading Correlation Matrix

r = -.904

p = .013 -- Probability of getting a correlation this size by sheer chance. Reject Ho if p ≤ .05.

sample size

r (4) = -.904, p.05

interpretation of correlation
Interpretation of Correlation

Correlations

  • from 0 to 0.25 (-0.25) = little or no relationship;
  • from 0.25 to 0.50 (-0.25 to 0.50) = fair degree of relationship;
  • from 0.50 to 0.75 (-0.50 to -0.75) = moderate to good relationship;
  • greater than 0.75 (or -0.75) = very good to excellent relationship.
limitations of correlation
Limitations of Correlation
  • linearity:
    • can’t describe non-linear relationships
    • e.g., relation between anxiety & performance
  • truncation of range:
    • underestimate stength of relationship if you can’t see full range of x value
  • no proof of causation
    • third variable problem:
      • could be 3rd variable causing change in both variables
      • directionality: can’t be sure which way causality “flows”
coefficient of determination r 2
Coefficient of Determination r2
  • The square of the correlation,r2, is the proportion of variation in the values of y that is explained by the regression model with x.
  • Amount of variance accounted for in y by x
  • Percentage increase in accuracy you gain by using the regression line to make predictions
  • 0  r2 1.
  • The larger r2 , the stronger the linear relationship.
  • The closer r2 is to 1, the more confident we are in our prediction.
linear regression
Linear Regression
  • Correlation measures the direction and strength of the linear relationship between two quantitative variables
  • A regression line
    • summarizes the relationship between two variables if the form of the relationship is linear.
    • describes how a response variable y changes as an explanatory variable x changes.
    • is often used as a mathematical model to predict the value of a response variable y based on a value of an explanatory variable x.
simple linear regression
(Simple) Linear Regression
  • Refers to drawing a (particular, special) line through a scatterplot
  • Used for 2 broad purposes:
    • Estimation
    • Prediction
formula for linear regression
Formula for Linear Regression

Slope or the change in y for every unit change in x

Y-intercept or the value of y when x = 0.

y = bx + a

Y variable plotted on vertical axis.

X variable plotted on horizontal axis.

interpretation of parameters
Interpretation of parameters
  • The regression slope is the average change in Y when X increases by 1 unit
  • The intercept is the predicted value for Y when X = 0
  • If the slope = 0, then X does not help in predicting Y (linearly)
which line
Which line?
  • There are many possible lines that could be drawn through the cloud of points in the scatterplot:
least squares
Least Squares
  • Q: Where does this equation come from?

A: It is the line that is ‘best’ in the sense that it minimizes the sum of the squared errors in the vertical (Y) direction

Y

*

*

*

errors

*

*

X

linear regression33

U.K. monthly return is y variable

Linear Regression

U.S. monthly return is x variable

Question: What is the relationship between U.K. and U.S. stock returns?

linear regression35
Linear Regression

A regression creates a model of the relationship between x and y.

It fits a line to the scatter plot by minimizing the distance between y and the line or

If the correlation is significant then

create a regression analysis.

linear regression36
Linear Regression

The slope is calculated as:

Tells you the change in the dependent variable for every unit change in the independent variable.

slide37

The coefficient of determination or R-square measures the variation explained by the best-fit line as a percent of the total variation:

regression graphic regression line

y’=47

y’=20

if x=18 then…

if x=24 then…

Regression Graphic – Regression Line
regression equation
Regression Equation
  • y’= bx + a
    • y’ = predicted value of y
    • b = slope of the line
    • x = value of x that you plug-in
    • a = y-intercept (where line crosses y access)
  • In this case….
    • y’ = -4.263(x) + 125.401
  • So if the distance is 20 feet
    • y’ = -4.263(20) + 125.401
    • y’ = -85.26 + 125.401
    • y’ = 40.141
spss regression set up
SPSS Regression Set-up
  • “Criterion,”
  • y-axis variable,
  • what you’re trying to predict
  • “Predictor,”
  • x-axis variable,
  • what you’re basing the prediction on
getting regression info from spss

b

Getting Regression Info from SPSS

y’ = b (x) + a

y’ = -4.263(20) + 125.401

a

extrapolation
Extrapolation
  • Interpolation: Using a model to estimate Y for an X value within the range on which the model was based.
  • Extrapolation: Estimating based on an X value outside the range.
  • Interpolation Good, Extrapolation Bad.
nixon s graph economic growth45
Nixon’s Graph:Economic Growth

Start of

Nixon Adm.

Now

nixon s graph economic growth46
Nixon’s Graph:Economic Growth

Start of

Nixon Adm.

Projection

Now

conditions for regression
Conditions for regression
  • “Straight enough” condition (linearity)
  • Errors are mostly independent of X
  • Errors are mostly independent of anything else you can think of
  • Errors are more-or-less normally distributed
general anova setting comparisons of 2 or more means
General ANOVA SettingComparisons of 2 or more means
  • Investigator controls one or more independent variables
    • Called factors (or treatment variables)
    • Each factor contains two or more levels (or groups or categories/classifications)
  • Observe effects on the dependent variable
    • Response to levels of independent variable
  • Experimental design: the plan used to collect the data
logic of anova
Logic of ANOVA
  • Each observation is different from the Grand (total sample) Mean by some amount
  • There are two sources of variance from the mean:
    • 1) That due to the treatment or independent variable
    • 2) That which is unexplained by our treatment
one way analysis of variance
One-Way Analysis of Variance
  • Evaluate the difference among the means of two or more groups

Examples: Accident rates for 1st, 2nd, and 3rd shift

Expected mileage for five brands of tires

  • Assumptions
    • Populations are normally distributed
    • Populations have equal variances
    • Samples are randomly and independently drawn
hypotheses of one way anova
Hypotheses of One-Way ANOVA
  • All population means are equal
  • i.e., no treatment effect (no variation in means among groups)
  • At least one population mean is different
  • i.e., there is a treatment effect
  • Does not mean that all population means are different (some pairs may be the same)
one factor anova
One-Factor ANOVA

All Means are the same:

The Null Hypothesis is True

(No Treatment Effect)

one factor anova53
One-Factor ANOVA

(continued)

At least one mean is different:

The Null Hypothesis is NOT true

(Treatment Effect is present)

or

partitioning the variation
Partitioning the Variation
  • Total variation can be split into two parts:

SST = SSA + SSW

SST = Total Sum of Squares

(Total variation)

SSA = Sum of Squares Among Groups

(Among-group variation)

SSW = Sum of Squares Within Groups

(Within-group variation)

partitioning the variation55
Partitioning the Variation

(continued)

SST = SSA + SSW

Total Variation = the aggregate dispersion of the individual data values across the various factor levels (SST)

Among-Group Variation = dispersion between the factor sample means (SSA)

Within-Group Variation = dispersion that exists among the data values within a particular factor level (SSW)

partition of total variation
Commonly referred to as:

Sum of Squares Within

Sum of Squares Error

Sum of Squares Unexplained

Within-Group Variation

Partition of Total Variation

Total Variation (SST)

d.f. = n – 1

Variation Due to Factor (SSA)

Variation Due to Random Sampling (SSW)

+

=

d.f. = c – 1

d.f. = n – c

Commonly referred to as:

  • Sum of Squares Between
  • Sum of Squares Among
  • Sum of Squares Explained
  • Among Groups Variation
total sum of squares
Total Sum of Squares

SST = SSA + SSW

  • Where:
  • SST = Total sum of squares
  • c = number of groups (levels or treatments)
  • nj = number of observations in group j
  • Xij = ith observation from group j
  • X = grand mean (mean of all data values)
total variation
Total Variation

(continued)

among group variation
Among-Group Variation

SST = SSA + SSW

  • Where:
  • SSA = Sum of squares among groups
  • c = number of groups
  • nj = sample size from group j
  • Xj = sample mean from group j
  • X = grand mean (mean of all data values)
among group variation60
Among-Group Variation

(continued)

Variation Due to

Differences Among Groups

Mean Square Among = SSA/degrees of freedom

within group variation
Within-Group Variation

SST = SSA + SSW

  • Where:
  • SSW = Sum of squares within groups
  • c = number of groups
  • nj = sample size from group j
  • Xj = sample mean from group j
  • Xij = ith observation in group j
within group variation63
Within-Group Variation

(continued)

Summing the variation within each group and then adding over all groups

Mean Square Within = SSW/degrees of freedom

slide66

One-Way ANOVA Table

Source of Variation

MS

(Variance)

SS

df

F ratio

SSA

Among Groups

MSA

SSA

c - 1

MSA =

F =

c - 1

MSW

SSW

Within Groups

SSW

n - c

MSW =

n - c

SST =

SSA+SSW

Total

n - 1

c = number of groups

n = sum of the sample sizes from all groups

df = degrees of freedom

one way anova f test statistic
One-Way ANOVAF Test Statistic
  • Test statistic

MSA is mean squares among groups

MSW is mean squares within groups

  • Degrees of freedom
    • df1 = c – 1 (c = number of groups)
    • df2 = n – c (n = sum of sample sizes from all populations)

H0: μ1= μ2 = …= μc

H1: At least two population means are different

interpreting one way anova f statistic
Interpreting One-Way ANOVA F Statistic
  • The F statistic is the ratio of the among estimate of variance and the within estimate of variance
    • The ratio must always be positive
    • df1 = c -1 will typically be small
    • df2 = n - c will typically be large

Decision Rule:

  • Reject H0 if F > FU, otherwise do not reject H0

 = .05

0

Do not

reject H0

Reject H0

FU

one way anova f test example
You want to see if cholesterol level is different in three groups.

You randomly select five patients. Measure their cholesterol levels.

At the 0.05 significance level, is there a difference in mean cholesterol?

One-Way ANOVA F Test Example

Gp 1Gp 2Gp 3 254 234 200 263 218 222 241 235 197 237 227 206 251 216 204

one way anova example scatter diagram
One-Way ANOVA Example: Scatter Diagram

Cholesterol

270

260

250

240

230

220

210

200

190

Gp 1Gp 2Gp 3 254 234 200 263 218 222 241 235 197 237 227 206 251 216 204

1 2 3

Groups

one way anova example computations
One-Way ANOVA Example Computations

Gp 1Gp 2Gp 3 254 234 200 263 218 222 241 235 197 237 227 206 251 216 204

X1 = 249.2

X2 = 226.0

X3 = 205.8

X = 227.0

n1 = 5

n2 = 5

n3 = 5

n = 15

c = 3

SSA = 5 (249.2 – 227)2 + 5 (226 – 227)2 + 5 (205.8 – 227)2 = 4716.4

SSW = (254 – 249.2)2 + (263 – 249.2)2 +…+ (204 – 205.8)2 = 1119.6

MSA = 4716.4 / (3-1) = 2358.2

MSW = 1119.6 / (15-3) = 93.3

one way anova example solution
H0: μ1 = μ2 = μ3

H1: μj not all equal

 = 0.05

df1= 2 df2 = 12

One-Way ANOVA Example Solution

Test Statistic:

Decision:

Conclusion:

Critical Value:

FU = 3.89

Reject H0 at  = 0.05

 = .05

There is evidence that at least one μj differs from the rest

0

Do not

reject H0

Reject H0

F= 25.275

FU = 3.89

significant and non significant differences
Significant and Non-significant Differences

Non-significant:

Within > Between

Significant:

Between > Within

anova summary
ANOVA (summary)
  • Null hypothesis is that there is no difference between the means.
  • Alternate hypothesis is that at least two means differ.
  • Use the F statistic as your test statistic. It tests the between-sample variance (difference between the means) against the within-sample variance (variability within the sample). The larger this is the more likely the means are different.
  • Degrees of freedom for numerator is k-1 (k is the number of treatments)
  • Degrees of freedom for the denominator is n-k (n is the number of responses)
  • If test F is larger than critical F, then reject the null.
  • If p-value is less than alpha, then reject the null.
anova summary75
ANOVA (summary)

Assumptions:

  • All k population probability distributions are normal.
  • The k population variances are equal.
  • The samples from each population are random and independent.
anova
ANOVA

WHEN YOU REJECT THE NULL

For an one-way ANOVA after you have rejected the null, you may want to determine which treatment yielded the best results.

Must do follow-on analysis to determine if the difference between each pair of means if significant.

one way anova example
One-way ANOVA (example)

The study described here is about measuring cortisol levels in 3 groups of subjects :

  • Healthy (n = 16)
  • Depressed – Non-melancholic depressed (n = 22)
  • Depressed – Melancholic depressed (n = 18)
results
Results
  • Results were obtained as follows

Source DF SS MS F P

Grp. 2 164.7 82.3 6.61 0.003

Error 53 660.0 12.5

Total 55 824.7

Individual 95% CIs For Mean

Based on Pooled StDev

Level N Mean StDev -+---------+---------+---------+-----

1 16 9.200 2.931 (------*------)

2 22 10.700 2.758 (-----*-----)

3 18 13.500 4.674 (------*------)

-+---------+---------+---------+-----

Pooled StDev = 3.529 7.5 10.0 12.5 15.0

multiple comparison of the means 1
Multiple Comparison of the Means - 1
  • Several methods are available depending upon whether one wishes to compare means with a control mean (Dunnett) or just overall comparison (Tukey and Fisher)

Dunnett's comparisons with a control

Critical value = 2.27

Control = level (1) of Grp.

Intervals for treatment mean minus control mean

Level Lower Center Upper -----+---------+---------+---------+--

  • 2 -1.127 1.500 4.127 (----------*----------)
  • 3 1.553 4.300 7.047 (----------*----------)
  • -----+---------+---------+---------+--

-1.0 1.5 4.0 7.0

multiple comparison of means 2
Multiple Comparison of Means - 2

Tukey's pair wise comparisons

Intervals for (column level mean) − (row level mean)

  • 1 2
  • 2 -4.296
  • 1.296
  • 3 -7.224 -5.504
  • -1.376 -0.096

Fisher's pair wise comparisons

Intervals for (column level mean) − (row level mean)

  • 1 2
  • 2 -3.826
  • 0.826
  • 3 -6.732 -5.050
  • -1.868 -0.550

The End

ad