Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

Lecture #9 Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

Bivariate Data • Bivariate data are just what they sound like – data with measurements on two variables; let’s call them X and Y • Here, we will look at two continuous variables • Want to explore the relationship between the two variables • Example: Fasting blood glucose and ventricular shortening velocity

Scatterplot • We can graphically summarize a bivariate data set with a scatterplot (also sometimes called a scatter diagram) • Plots values of one variable on the horizontal axis and values of the other on the vertical axis • Can be used to see how values of 2 variables tend to move with each other (i.e. how the variables are associated)

Scatterplot: positive correlation

Scatterplot: negative correlation

Scatterplot: real data example

Numerical Summary • Typically, a bivariate data set is summarized numerically with 5 summary statistics • These provide a fair summary for scatterplots with the same general shape as we just saw, like an oval or an ellipse • We can summarize each variable separately : X mean, X SD; Y mean, Y SD • But these numbers don’t tell us how the values of X and Y vary together

Pearson’s Correlation Coefficient “r” • “r” indicates… • strength of relationship (strong, weak, or none) • direction of relationship • positive (direct) – variables move in same direction • negative (inverse) – variables move in opposite directions • r ranges in value from –1.0 to +1.0 -1.0 0.0 +1.0 Strong Negative No Rel. Strong Positive

Correlation (cont) Correlation is the relationship between two variables.

What r is... • r is a measure of LINEAR ASSOCIATION • The closer r is to –1 or 1, the more tightly the points on the scatterplot are clustered around a line • The sign of r (+ or -) is the same as the sign of the slope of the line • When r = 0, the points are not LINEARLY ASSOCIATED– this does NOT mean there is NO ASSOCIATION

...and what r is not • r is a measure of LINEAR ASSOCIATION • r does NOT tell us if Y is a function of X • r does NOT tell us if XcausesY • r does NOT tell us if YcausesX • r does NOT tell us what the scatterplot looks like

r  0: curved relation

r  0: outliers outliers

r  0: parallel lines

r  0: different linear trends

r  0: random scatter

Correlation is NOT causation • You cannot infer that since X and Y are highly correlated (r close to –1 or 1) that X is causing a change in Y • Y could be causing X • X and Y could both be varying along with a third, possibly unknown factor (either causal or not)

Correlation matrix

Reading Correlation Matrix r = -.904 p = .013 -- Probability of getting a correlation this size by sheer chance. Reject Ho if p ≤ .05. sample size r (4) = -.904, p.05

Interpretation of Correlation Correlations • from 0 to 0.25 (-0.25) = little or no relationship; • from 0.25 to 0.50 (-0.25 to 0.50) = fair degree of relationship; • from 0.50 to 0.75 (-0.50 to -0.75) = moderate to good relationship; • greater than 0.75 (or -0.75) = very good to excellent relationship.

Limitations of Correlation • linearity: • can’t describe non-linear relationships • e.g., relation between anxiety & performance • truncation of range: • underestimate stength of relationship if you can’t see full range of x value • no proof of causation • third variable problem: • could be 3rd variable causing change in both variables • directionality: can’t be sure which way causality “flows”

Coefficient of Determination r2 • The square of the correlation,r2, is the proportion of variation in the values of y that is explained by the regression model with x. • Amount of variance accounted for in y by x • Percentage increase in accuracy you gain by using the regression line to make predictions • 0  r2 1. • The larger r2 , the stronger the linear relationship. • The closer r2 is to 1, the more confident we are in our prediction.

Age vs. Height: r2=0.9888.

Age vs. Height: r2=0.849.

Linear Regression • Correlation measures the direction and strength of the linear relationship between two quantitative variables • A regression line • summarizes the relationship between two variables if the form of the relationship is linear. • describes how a response variable y changes as an explanatory variable x changes. • is often used as a mathematical model to predict the value of a response variable y based on a value of an explanatory variable x.

(Simple) Linear Regression • Refers to drawing a (particular, special) line through a scatterplot • Used for 2 broad purposes: • Estimation • Prediction

Formula for Linear Regression Slope or the change in y for every unit change in x Y-intercept or the value of y when x = 0. y = bx + a Y variable plotted on vertical axis. X variable plotted on horizontal axis.

Interpretation of parameters • The regression slope is the average change in Y when X increases by 1 unit • The intercept is the predicted value for Y when X = 0 • If the slope = 0, then X does not help in predicting Y (linearly)

Which line? • There are many possible lines that could be drawn through the cloud of points in the scatterplot:

Least Squares • Q: Where does this equation come from? A: It is the line that is ‘best’ in the sense that it minimizes the sum of the squared errors in the vertical (Y) direction Y * * * errors * * X

U.K. monthly return is y variable Linear Regression U.S. monthly return is x variable Question: What is the relationship between U.K. and U.S. stock returns?

Correlation tells the strength of relationship between x and y. Relationship may not be linear.

Linear Regression A regression creates a model of the relationship between x and y. It fits a line to the scatter plot by minimizing the distance between y and the line or If the correlation is significant then create a regression analysis.

Linear Regression The slope is calculated as: Tells you the change in the dependent variable for every unit change in the independent variable.

The coefficient of determination or R-square measures the variation explained by the best-fit line as a percent of the total variation:

y’=47 y’=20 if x=18 then… if x=24 then… Regression Graphic – Regression Line

Regression Equation • y’= bx + a • y’ = predicted value of y • b = slope of the line • x = value of x that you plug-in • a = y-intercept (where line crosses y access) • In this case…. • y’ = -4.263(x) + 125.401 • So if the distance is 20 feet • y’ = -4.263(20) + 125.401 • y’ = -85.26 + 125.401 • y’ = 40.141

SPSS Regression Set-up • “Criterion,” • y-axis variable, • what you’re trying to predict • “Predictor,” • x-axis variable, • what you’re basing the prediction on

b Getting Regression Info from SPSS y’ = b (x) + a y’ = -4.263(20) + 125.401 a

Extrapolation • Interpolation: Using a model to estimate Y for an X value within the range on which the model was based. • Extrapolation: Estimating based on an X value outside the range. • Interpolation Good, Extrapolation Bad.

Nixon’s Graph:Economic Growth

Nixon’s Graph:Economic Growth Start of Nixon Adm.

Nixon’s Graph:Economic Growth Start of Nixon Adm. Now

Nixon’s Graph:Economic Growth Start of Nixon Adm. Projection Now

Conditions for regression • “Straight enough” condition (linearity) • Errors are mostly independent of X • Errors are mostly independent of anything else you can think of • Errors are more-or-less normally distributed

General ANOVA SettingComparisons of 2 or more means • Investigator controls one or more independent variables • Called factors (or treatment variables) • Each factor contains two or more levels (or groups or categories/classifications) • Observe effects on the dependent variable • Response to levels of independent variable • Experimental design: the plan used to collect the data

Logic of ANOVA • Each observation is different from the Grand (total sample) Mean by some amount • There are two sources of variance from the mean: • 1) That due to the treatment or independent variable • 2) That which is unexplained by our treatment

One-Way Analysis of Variance • Evaluate the difference among the means of two or more groups Examples: Accident rates for 1st, 2nd, and 3rd shift Expected mileage for five brands of tires • Assumptions • Populations are normally distributed • Populations have equal variances • Samples are randomly and independently drawn

Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)