Create Presentation
Download Presentation

Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

610 Views

Download Presentation
## Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Lecture #9**Bivariate data Correlation Coefficient of Determination Regression One-way Analysis of Variance (ANOVA)**Bivariate Data**• Bivariate data are just what they sound like – data with measurements on two variables; let’s call them X and Y • Here, we will look at two continuous variables • Want to explore the relationship between the two variables • Example: Fasting blood glucose and ventricular shortening velocity**Scatterplot**• We can graphically summarize a bivariate data set with a scatterplot (also sometimes called a scatter diagram) • Plots values of one variable on the horizontal axis and values of the other on the vertical axis • Can be used to see how values of 2 variables tend to move with each other (i.e. how the variables are associated)**Numerical Summary**• Typically, a bivariate data set is summarized numerically with 5 summary statistics • These provide a fair summary for scatterplots with the same general shape as we just saw, like an oval or an ellipse • We can summarize each variable separately : X mean, X SD; Y mean, Y SD • But these numbers don’t tell us how the values of X and Y vary together**Pearson’s Correlation Coefficient “r”**• “r” indicates… • strength of relationship (strong, weak, or none) • direction of relationship • positive (direct) – variables move in same direction • negative (inverse) – variables move in opposite directions • r ranges in value from –1.0 to +1.0 -1.0 0.0 +1.0 Strong Negative No Rel. Strong Positive**Correlation (cont)**Correlation is the relationship between two variables.**What r is...**• r is a measure of LINEAR ASSOCIATION • The closer r is to –1 or 1, the more tightly the points on the scatterplot are clustered around a line • The sign of r (+ or -) is the same as the sign of the slope of the line • When r = 0, the points are not LINEARLY ASSOCIATED– this does NOT mean there is NO ASSOCIATION**...and what r is not**• r is a measure of LINEAR ASSOCIATION • r does NOT tell us if Y is a function of X • r does NOT tell us if XcausesY • r does NOT tell us if YcausesX • r does NOT tell us what the scatterplot looks like**r 0: outliers**outliers**Correlation is NOT causation**• You cannot infer that since X and Y are highly correlated (r close to –1 or 1) that X is causing a change in Y • Y could be causing X • X and Y could both be varying along with a third, possibly unknown factor (either causal or not)**Reading Correlation Matrix**r = -.904 p = .013 -- Probability of getting a correlation this size by sheer chance. Reject Ho if p ≤ .05. sample size r (4) = -.904, p.05**Interpretation of Correlation**Correlations • from 0 to 0.25 (-0.25) = little or no relationship; • from 0.25 to 0.50 (-0.25 to 0.50) = fair degree of relationship; • from 0.50 to 0.75 (-0.50 to -0.75) = moderate to good relationship; • greater than 0.75 (or -0.75) = very good to excellent relationship.**Limitations of Correlation**• linearity: • can’t describe non-linear relationships • e.g., relation between anxiety & performance • truncation of range: • underestimate stength of relationship if you can’t see full range of x value • no proof of causation • third variable problem: • could be 3rd variable causing change in both variables • directionality: can’t be sure which way causality “flows”**Coefficient of Determination r2**• The square of the correlation,r2, is the proportion of variation in the values of y that is explained by the regression model with x. • Amount of variance accounted for in y by x • Percentage increase in accuracy you gain by using the regression line to make predictions • 0 r2 1. • The larger r2 , the stronger the linear relationship. • The closer r2 is to 1, the more confident we are in our prediction.**Linear Regression**• Correlation measures the direction and strength of the linear relationship between two quantitative variables • A regression line • summarizes the relationship between two variables if the form of the relationship is linear. • describes how a response variable y changes as an explanatory variable x changes. • is often used as a mathematical model to predict the value of a response variable y based on a value of an explanatory variable x.**(Simple) Linear Regression**• Refers to drawing a (particular, special) line through a scatterplot • Used for 2 broad purposes: • Estimation • Prediction**Formula for Linear Regression**Slope or the change in y for every unit change in x Y-intercept or the value of y when x = 0. y = bx + a Y variable plotted on vertical axis. X variable plotted on horizontal axis.**Interpretation of parameters**• The regression slope is the average change in Y when X increases by 1 unit • The intercept is the predicted value for Y when X = 0 • If the slope = 0, then X does not help in predicting Y (linearly)**Which line?**• There are many possible lines that could be drawn through the cloud of points in the scatterplot:**Least Squares**• Q: Where does this equation come from? A: It is the line that is ‘best’ in the sense that it minimizes the sum of the squared errors in the vertical (Y) direction Y * * * errors * * X**U.K. monthly return is y variable**Linear Regression U.S. monthly return is x variable Question: What is the relationship between U.K. and U.S. stock returns?**Correlation tells the strength of relationship between x and**y. Relationship may not be linear.**Linear Regression**A regression creates a model of the relationship between x and y. It fits a line to the scatter plot by minimizing the distance between y and the line or If the correlation is significant then create a regression analysis.**Linear Regression**The slope is calculated as: Tells you the change in the dependent variable for every unit change in the independent variable.**The coefficient of determination or R-square measures the**variation explained by the best-fit line as a percent of the total variation:**y’=47**y’=20 if x=18 then… if x=24 then… Regression Graphic – Regression Line**Regression Equation**• y’= bx + a • y’ = predicted value of y • b = slope of the line • x = value of x that you plug-in • a = y-intercept (where line crosses y access) • In this case…. • y’ = -4.263(x) + 125.401 • So if the distance is 20 feet • y’ = -4.263(20) + 125.401 • y’ = -85.26 + 125.401 • y’ = 40.141**SPSS Regression Set-up**• “Criterion,” • y-axis variable, • what you’re trying to predict • “Predictor,” • x-axis variable, • what you’re basing the prediction on**b**Getting Regression Info from SPSS y’ = b (x) + a y’ = -4.263(20) + 125.401 a**Extrapolation**• Interpolation: Using a model to estimate Y for an X value within the range on which the model was based. • Extrapolation: Estimating based on an X value outside the range. • Interpolation Good, Extrapolation Bad.**Nixon’s Graph:Economic Growth**Start of Nixon Adm.**Nixon’s Graph:Economic Growth**Start of Nixon Adm. Now**Nixon’s Graph:Economic Growth**Start of Nixon Adm. Projection Now**Conditions for regression**• “Straight enough” condition (linearity) • Errors are mostly independent of X • Errors are mostly independent of anything else you can think of • Errors are more-or-less normally distributed**General ANOVA SettingComparisons of 2 or more means**• Investigator controls one or more independent variables • Called factors (or treatment variables) • Each factor contains two or more levels (or groups or categories/classifications) • Observe effects on the dependent variable • Response to levels of independent variable • Experimental design: the plan used to collect the data**Logic of ANOVA**• Each observation is different from the Grand (total sample) Mean by some amount • There are two sources of variance from the mean: • 1) That due to the treatment or independent variable • 2) That which is unexplained by our treatment**One-Way Analysis of Variance**• Evaluate the difference among the means of two or more groups Examples: Accident rates for 1st, 2nd, and 3rd shift Expected mileage for five brands of tires • Assumptions • Populations are normally distributed • Populations have equal variances • Samples are randomly and independently drawn