Loading in 2 Seconds...

Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

Loading in 2 Seconds...

- 443 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)' - osgood

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Bivariate data

Correlation

Coefficient of Determination

Regression

One-way Analysis of Variance (ANOVA)

Bivariate Data

- Bivariate data are just what they sound like – data with measurements on two variables; let’s call them X and Y
- Here, we will look at two continuous variables
- Want to explore the relationship between the two variables
- Example: Fasting blood glucose and ventricular shortening velocity

Scatterplot

- We can graphically summarize a bivariate data set with a scatterplot (also sometimes called a scatter diagram)
- Plots values of one variable on the horizontal axis and values of the other on the vertical axis
- Can be used to see how values of 2 variables tend to move with each other (i.e. how the variables are associated)

Numerical Summary

- Typically, a bivariate data set is summarized numerically with 5 summary statistics
- These provide a fair summary for scatterplots with the same general shape as we just saw, like an oval or an ellipse
- We can summarize each variable separately : X mean, X SD; Y mean, Y SD
- But these numbers don’t tell us how the values of X and Y vary together

Pearson’s Correlation Coefficient “r”

- “r” indicates…
- strength of relationship (strong, weak, or none)
- direction of relationship
- positive (direct) – variables move in same direction
- negative (inverse) – variables move in opposite directions
- r ranges in value from –1.0 to +1.0

-1.0 0.0 +1.0

Strong Negative No Rel. Strong Positive

Correlation (cont)

Correlation is the relationship between two variables.

What r is...

- r is a measure of LINEAR ASSOCIATION
- The closer r is to –1 or 1, the more tightly the points on the scatterplot are clustered around a line
- The sign of r (+ or -) is the same as the sign of the slope of the line
- When r = 0, the points are not LINEARLY ASSOCIATED– this does NOT mean there is NO ASSOCIATION

...and what r is not

- r is a measure of LINEAR ASSOCIATION
- r does NOT tell us if Y is a function of X
- r does NOT tell us if XcausesY
- r does NOT tell us if YcausesX
- r does NOT tell us what the scatterplot looks like

r 0: outliers

outliers

Correlation is NOT causation

- You cannot infer that since X and Y are highly correlated (r close to –1 or 1) that X is causing a change in Y
- Y could be causing X
- X and Y could both be varying along with a third, possibly unknown factor (either causal or not)

Reading Correlation Matrix

r = -.904

p = .013 -- Probability of getting a correlation this size by sheer chance. Reject Ho if p ≤ .05.

sample size

r (4) = -.904, p.05

Interpretation of Correlation

Correlations

- from 0 to 0.25 (-0.25) = little or no relationship;
- from 0.25 to 0.50 (-0.25 to 0.50) = fair degree of relationship;
- from 0.50 to 0.75 (-0.50 to -0.75) = moderate to good relationship;
- greater than 0.75 (or -0.75) = very good to excellent relationship.

Limitations of Correlation

- linearity:
- can’t describe non-linear relationships
- e.g., relation between anxiety & performance
- truncation of range:
- underestimate stength of relationship if you can’t see full range of x value
- no proof of causation
- third variable problem:
- could be 3rd variable causing change in both variables
- directionality: can’t be sure which way causality “flows”

Coefficient of Determination r2

- The square of the correlation,r2, is the proportion of variation in the values of y that is explained by the regression model with x.
- Amount of variance accounted for in y by x
- Percentage increase in accuracy you gain by using the regression line to make predictions
- 0 r2 1.
- The larger r2 , the stronger the linear relationship.
- The closer r2 is to 1, the more confident we are in our prediction.

Linear Regression

- Correlation measures the direction and strength of the linear relationship between two quantitative variables
- A regression line
- summarizes the relationship between two variables if the form of the relationship is linear.
- describes how a response variable y changes as an explanatory variable x changes.
- is often used as a mathematical model to predict the value of a response variable y based on a value of an explanatory variable x.

(Simple) Linear Regression

- Refers to drawing a (particular, special) line through a scatterplot
- Used for 2 broad purposes:
- Estimation
- Prediction

Formula for Linear Regression

Slope or the change in y for every unit change in x

Y-intercept or the value of y when x = 0.

y = bx + a

Y variable plotted on vertical axis.

X variable plotted on horizontal axis.

Interpretation of parameters

- The regression slope is the average change in Y when X increases by 1 unit
- The intercept is the predicted value for Y when X = 0
- If the slope = 0, then X does not help in predicting Y (linearly)

Which line?

- There are many possible lines that could be drawn through the cloud of points in the scatterplot:

Least Squares

- Q: Where does this equation come from?

A: It is the line that is ‘best’ in the sense that it minimizes the sum of the squared errors in the vertical (Y) direction

Y

*

*

*

errors

*

*

X

U.K. monthly return is y variable

Linear RegressionU.S. monthly return is x variable

Question: What is the relationship between U.K. and U.S. stock returns?

Correlation tells the strength of relationship between x and y.

Relationship may not be linear.

Linear Regression

A regression creates a model of the relationship between x and y.

It fits a line to the scatter plot by minimizing the distance between y and the line or

If the correlation is significant then

create a regression analysis.

Linear Regression

The slope is calculated as:

Tells you the change in the dependent variable for every unit change in the independent variable.

The coefficient of determination or R-square measures the variation explained by the best-fit line as a percent of the total variation:

Regression Equation

- y’= bx + a
- y’ = predicted value of y
- b = slope of the line
- x = value of x that you plug-in
- a = y-intercept (where line crosses y access)
- In this case….
- y’ = -4.263(x) + 125.401
- So if the distance is 20 feet
- y’ = -4.263(20) + 125.401
- y’ = -85.26 + 125.401
- y’ = 40.141

SPSS Regression Set-up

- “Criterion,”
- y-axis variable,
- what you’re trying to predict

- “Predictor,”
- x-axis variable,
- what you’re basing the prediction on

Extrapolation

- Interpolation: Using a model to estimate Y for an X value within the range on which the model was based.
- Extrapolation: Estimating based on an X value outside the range.
- Interpolation Good, Extrapolation Bad.

Conditions for regression

- “Straight enough” condition (linearity)
- Errors are mostly independent of X
- Errors are mostly independent of anything else you can think of
- Errors are more-or-less normally distributed

General ANOVA SettingComparisons of 2 or more means

- Investigator controls one or more independent variables
- Called factors (or treatment variables)
- Each factor contains two or more levels (or groups or categories/classifications)
- Observe effects on the dependent variable
- Response to levels of independent variable
- Experimental design: the plan used to collect the data

Logic of ANOVA

- Each observation is different from the Grand (total sample) Mean by some amount
- There are two sources of variance from the mean:
- 1) That due to the treatment or independent variable
- 2) That which is unexplained by our treatment

One-Way Analysis of Variance

- Evaluate the difference among the means of two or more groups

Examples: Accident rates for 1st, 2nd, and 3rd shift

Expected mileage for five brands of tires

- Assumptions
- Populations are normally distributed
- Populations have equal variances
- Samples are randomly and independently drawn

Hypotheses of One-Way ANOVA

- All population means are equal
- i.e., no treatment effect (no variation in means among groups)
- At least one population mean is different
- i.e., there is a treatment effect
- Does not mean that all population means are different (some pairs may be the same)

One-Factor ANOVA

(continued)

At least one mean is different:

The Null Hypothesis is NOT true

(Treatment Effect is present)

or

Partitioning the Variation

- Total variation can be split into two parts:

SST = SSA + SSW

SST = Total Sum of Squares

(Total variation)

SSA = Sum of Squares Among Groups

(Among-group variation)

SSW = Sum of Squares Within Groups

(Within-group variation)

Partitioning the Variation

(continued)

SST = SSA + SSW

Total Variation = the aggregate dispersion of the individual data values across the various factor levels (SST)

Among-Group Variation = dispersion between the factor sample means (SSA)

Within-Group Variation = dispersion that exists among the data values within a particular factor level (SSW)

Commonly referred to as:

Sum of Squares Within

Sum of Squares Error

Sum of Squares Unexplained

Within-Group Variation

Partition of Total VariationTotal Variation (SST)

d.f. = n – 1

Variation Due to Factor (SSA)

Variation Due to Random Sampling (SSW)

+

=

d.f. = c – 1

d.f. = n – c

Commonly referred to as:

- Sum of Squares Between
- Sum of Squares Among
- Sum of Squares Explained
- Among Groups Variation

Total Sum of Squares

SST = SSA + SSW

- Where:
- SST = Total sum of squares
- c = number of groups (levels or treatments)
- nj = number of observations in group j
- Xij = ith observation from group j
- X = grand mean (mean of all data values)

Total Variation

(continued)

Among-Group Variation

SST = SSA + SSW

- Where:
- SSA = Sum of squares among groups
- c = number of groups
- nj = sample size from group j
- Xj = sample mean from group j
- X = grand mean (mean of all data values)

Among-Group Variation

(continued)

Variation Due to

Differences Among Groups

Mean Square Among = SSA/degrees of freedom

Among-Group Variation

(continued)

Within-Group Variation

SST = SSA + SSW

- Where:
- SSW = Sum of squares within groups
- c = number of groups
- nj = sample size from group j
- Xj = sample mean from group j
- Xij = ith observation in group j

Within-Group Variation

(continued)

Summing the variation within each group and then adding over all groups

Mean Square Within = SSW/degrees of freedom

Within-Group Variation

(continued)

Source of Variation

MS

(Variance)

SS

df

F ratio

SSA

Among Groups

MSA

SSA

c - 1

MSA =

F =

c - 1

MSW

SSW

Within Groups

SSW

n - c

MSW =

n - c

SST =

SSA+SSW

Total

n - 1

c = number of groups

n = sum of the sample sizes from all groups

df = degrees of freedom

One-Way ANOVAF Test Statistic

- Test statistic

MSA is mean squares among groups

MSW is mean squares within groups

- Degrees of freedom
- df1 = c – 1 (c = number of groups)
- df2 = n – c (n = sum of sample sizes from all populations)

H0: μ1= μ2 = …= μc

H1: At least two population means are different

Interpreting One-Way ANOVA F Statistic

- The F statistic is the ratio of the among estimate of variance and the within estimate of variance
- The ratio must always be positive
- df1 = c -1 will typically be small
- df2 = n - c will typically be large

Decision Rule:

- Reject H0 if F > FU, otherwise do not reject H0

= .05

0

Do not

reject H0

Reject H0

FU

You want to see if cholesterol level is different in three groups.

You randomly select five patients. Measure their cholesterol levels.

At the 0.05 significance level, is there a difference in mean cholesterol?

One-Way ANOVA F Test ExampleGp 1Gp 2Gp 3 254 234 200 263 218 222 241 235 197 237 227 206 251 216 204

One-Way ANOVA Example: Scatter Diagram

Cholesterol

270

260

250

240

230

220

210

200

190

Gp 1Gp 2Gp 3 254 234 200 263 218 222 241 235 197 237 227 206 251 216 204

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

1 2 3

Groups

One-Way ANOVA Example Computations

Gp 1Gp 2Gp 3 254 234 200 263 218 222 241 235 197 237 227 206 251 216 204

X1 = 249.2

X2 = 226.0

X3 = 205.8

X = 227.0

n1 = 5

n2 = 5

n3 = 5

n = 15

c = 3

SSA = 5 (249.2 – 227)2 + 5 (226 – 227)2 + 5 (205.8 – 227)2 = 4716.4

SSW = (254 – 249.2)2 + (263 – 249.2)2 +…+ (204 – 205.8)2 = 1119.6

MSA = 4716.4 / (3-1) = 2358.2

MSW = 1119.6 / (15-3) = 93.3

H0: μ1 = μ2 = μ3

H1: μj not all equal

= 0.05

df1= 2 df2 = 12

One-Way ANOVA Example SolutionTest Statistic:

Decision:

Conclusion:

Critical Value:

FU = 3.89

Reject H0 at = 0.05

= .05

There is evidence that at least one μj differs from the rest

0

Do not

reject H0

Reject H0

F= 25.275

FU = 3.89

Significant and Non-significant Differences

Non-significant:

Within > Between

Significant:

Between > Within

ANOVA (summary)

- Null hypothesis is that there is no difference between the means.
- Alternate hypothesis is that at least two means differ.
- Use the F statistic as your test statistic. It tests the between-sample variance (difference between the means) against the within-sample variance (variability within the sample). The larger this is the more likely the means are different.
- Degrees of freedom for numerator is k-1 (k is the number of treatments)
- Degrees of freedom for the denominator is n-k (n is the number of responses)
- If test F is larger than critical F, then reject the null.
- If p-value is less than alpha, then reject the null.

ANOVA (summary)

Assumptions:

- All k population probability distributions are normal.
- The k population variances are equal.
- The samples from each population are random and independent.

ANOVA

WHEN YOU REJECT THE NULL

For an one-way ANOVA after you have rejected the null, you may want to determine which treatment yielded the best results.

Must do follow-on analysis to determine if the difference between each pair of means if significant.

One-way ANOVA (example)

The study described here is about measuring cortisol levels in 3 groups of subjects :

- Healthy (n = 16)
- Depressed – Non-melancholic depressed (n = 22)
- Depressed – Melancholic depressed (n = 18)

Results

- Results were obtained as follows

Source DF SS MS F P

Grp. 2 164.7 82.3 6.61 0.003

Error 53 660.0 12.5

Total 55 824.7

Individual 95% CIs For Mean

Based on Pooled StDev

Level N Mean StDev -+---------+---------+---------+-----

1 16 9.200 2.931 (------*------)

2 22 10.700 2.758 (-----*-----)

3 18 13.500 4.674 (------*------)

-+---------+---------+---------+-----

Pooled StDev = 3.529 7.5 10.0 12.5 15.0

Multiple Comparison of the Means - 1

- Several methods are available depending upon whether one wishes to compare means with a control mean (Dunnett) or just overall comparison (Tukey and Fisher)

Dunnett's comparisons with a control

Critical value = 2.27

Control = level (1) of Grp.

Intervals for treatment mean minus control mean

Level Lower Center Upper -----+---------+---------+---------+--

- 2 -1.127 1.500 4.127 (----------*----------)
- 3 1.553 4.300 7.047 (----------*----------)
- -----+---------+---------+---------+--

-1.0 1.5 4.0 7.0

Multiple Comparison of Means - 2

Tukey's pair wise comparisons

Intervals for (column level mean) − (row level mean)

- 1 2
- 2 -4.296
- 1.296
- 3 -7.224 -5.504
- -1.376 -0.096

Fisher's pair wise comparisons

Intervals for (column level mean) − (row level mean)

- 1 2
- 2 -3.826
- 0.826
- 3 -6.732 -5.050
- -1.868 -0.550

The End

Download Presentation

Connecting to Server..