1 / 85

Correlation and Regression

Correlation and Regression. Correlation. A measure of the degree to which 2 (generally continuous) variables are related. Measure of relationship or association. Time studying-grades: more studying, higher grades. Wage-job and satisfaction: higher salaries, more job satisfaction.

linda-neal
Download Presentation

Correlation and Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation and Regression

  2. Correlation • A measure of the degree to which 2 (generally continuous) variables are related. • Measure of relationship or association. • Time studying-grades: more studying, higher grades. • Wage-job and satisfaction: higher salaries, more job satisfaction. • Anxiety-grades: higher anxiety, lower grades. • These relationships tell us nothing about causality

  3. CORRELATION does not equal CAUSATION! • 1) Does A cause B or does B cause A?2) Could be a third variable: • More interesting or engaging jobs might pay better. • So, higher salaries might not produce more satisfaction. • Job satisfaction may result from having more interesting jobs, which also tend to pay better. • Let’s say there is a correlation between colds and sleep (more colds, less sleep) • Think of a 3rd variable that might be responsible for this relationship.

  4. Scatterplots • Useful way to look at the relationship between two variables: • A figure in which the individual data points are plotted in two-dimensional space • Every individual is represented by a point in 2 dimensional space. • Ex. Salary (X) and Job Satisfaction (Y) • Predictor Variable – variable from which a prediction is made (X axis). • Criterion Variable – variable to be predicted (Y axis). • We likely want to predict Job Satisfaction from our knowledge of Salary.

  5. Nature of the relationship • Positive relationship: As X increases, Y increases. • Negative relationship: As X increases, Y decreases. • No relationship: As X increases, Y neither increases or decreases. • If we draw a line through the points in a scatterplot that best fits all the points, we can get an idea about the nature of the relationship.

  6. Very Small Relationship

  7. Positive Relationship

  8. Negative Relationship

  9. Measuring Relationships • Correlation coefficient • The most common is Pearson’s r, or just r for short. • This measures the relationship between 2 continuous variables. • Range from 1 to –1 • Positive slope (r = + .01 to +1) - As one variable increases. • In other words, the variables are varying in the same direction. • If there is no relationship between the variables, the correlation would be 0.0. • Negative slope – (r = -.01 to -1.00). As one variable increases, the other decreases.

  10. r • We look at the sign of the correlation to determine its direction • We look at the absolute value to determine its magnitude. The closer the correlation is to the absolute value of 1, the stronger the relationship between the two variables.

  11. Types of relationship • Correlation coefficients are measuring the degree of linear relationship between two variables. • Of course, 2 variables can be related in other ways. • For example, we could have curvilinear relationships (U or inverted U for example. • I like to call this the beer-fun curve. • If you do not obtain a big r, this just means you do not have a linear relationship between the 2 variables. • The two variables might still be related in some other way. • This is why Scatterplots are handy…you can get a feel for the data just by looking at it sometimes.

  12. Pearson Product-Moment Correlation Coefficient • Based on covariance • Degree to which 2 variables vary together • Covariance (negative to positive infinity) • High pos. cov.: Very + scores on X paired with very + scores on Y • Small pos. cov.: Very + scores on X paired with somewhat + scores on Y • High neg. cov.: Very + scores on X paired with very – scores on Y. • Small neg. cov.: Very + scores on X paired with somewhat – scores on Y. • No Cov: High + scores on X paired with both + and -scores on Y

  13. Covariance • For each person, we look at how much each score deviates from the mean. • If both variables deviate from the mean by the same amount, they are likely related. • Variance tells us how much scores on one variable deviate from the mean for that variable. • Covariance is very similar. • It tells is by how much scores on two variables differ from their respective means.

  14. Covariance • Calculate the error between the mean and each subject’s score for the first variable (x). • Calculate the error between the mean and their score for the second variable (y). • Multiply these error values. • Add these values and you get the cross product deviations. • The covariance is the average cross-product deviations.

  15. Note the similarity between these equations

  16. Note the similarity between these equations

  17. Covariance, why not stop there? • It depends upon the units of measurement. • E.g. The Covariance of two variables measured in Miles might be 4.25, but if the same scores are converted to Km, the Covariance is 11. • Solution: standardization • Divide by the standard deviations of both variables. • The standardised version of Covariance is known as the.... • Correlation coefficient!!!!!!

  18. Pearson product moment Correlation coefficient:

  19. Compute r, to what end? • What can we make of a correlation = .09? • Is that still positive? Is that meaningful? • Just like we will always get a difference between two means, we will always get some sort of correlation between two variables just due to random variability in our sample. • The question is whether our obtained correlation is due to error or whether it represents some real relationship? • What do we need?Hypothesis testing !!!!!!!

  20. Hypotheses • ρor Rho, is the POPULATION correlation. • Ho: ρ = 0 • H1: ρ ne 0 • So, the null hypothesis is saying that there is no relationship between our two variables, or that the population correlation is zero. • The alternative hypothesis is saying that there IS a relationship between our two variables, or that the population correlation is NOT zero. • This is a non-directional example, but we can have directional predictions too. • Ho: ρ >= 0 OR ρ <= 0 • H1: ρ < 0 OR ρ > 0

  21. Set up your criterion • Need the df = n-2 and α • Rcrit tells you: If your calculated correlation exceeds this critical correlation, you can conclude there is a relationship between your two variables in the population, such that… • Otherwise you retain the null and conclude that there is no relationship between your two variables.

  22. An example! • Is there a relationship between individuals’ O-Span (X) and their need for cognition (Y)? • Measure both variables on continuous scales. • ƩX = 36 • ƩX2 = 218 • ƩY = 37 • ƩY2 = 225 • ƩXY = 219 • Calculation • Time!!!

  23. Decision criteria • 1 or 2 tailed prediction? • Let’s go 2 tailed, so • Ho: ρ = 0 • H1: ρ ne 0 • α = .05 • df = n-2 = 7-2 = 5 • rcrit = .754

  24. OK, what do we conclude about the Null?

  25. R2 • The variability in y that can be accounted for by x • R2 = Just square our correlation • R2 = .9232 = .8519 • Interpretation? • A proportion of .8519 of the variability in need for cognition scores is accounted for by ospan. • OR, 85.19% of the variability in need for cognition scores is accounted for by ospan • OR, Ospan accounts for 85.19 % of the variability in need for cognition scores.

  26. Is one r different from another r • We can also test the significance of r by converting it to a z (which is normally distributed). • We can also take the zs for 2 correlations and compare them using a t-test. • Field explains how to do this. • There is a lot of calculations, it is mechanical, and I am not going to spend time on this • You can also do this online quickly: • http://faculty.vassar.edu/lowry/rdiff.html

  27. Factors that influence correlation • 1) Range restrictions • Range over which x or y varies is restricted • Ex. S.A.T. and G.P.A. • With range restriction, the correlation could go up, it could go down, but usually it decreases. • 2) Nonlinearity of relationship • Usually get a weaker relationship. • Mathmatically, correlation meant to measure linear relationships • With range restrictions, could go up if you eliminate the curvilinear aspect of relationship • 3) The effect of heterogeneous samples • Sample observations could be subdivided into 2 distinct sets on the basis of some other variable • Ex. Movies and Dexterity with male and female subgroups (draw) • Really there is no relationship between movies and dexterity, however, because females score higher than males on both variables, there is a positive correlation

  28. Types of Correlations • 1) Pearson's r: • Both variables are continuous • 2) Spearman's correlation coefficient for ranked data (rs) • Use the same formula as you do for pearson correlation coefficient • 1st in graduating class, 2nd, 3rd, etc. by IQ. • 3) Point Biserial correlation (rpb) • Correlation coefficient when one of your two variables measured is dichotomous, the other is continuous. • MF and liking for romance movies. • 4) Phi • Correlation coefficient when both of the variables are dichotomous. • MF and whether they have traveled out of the US.

  29. Partial and Semi-Partial Correlations • Partial correlation: • Measures the relationship between two variables, controlling for the effect that a third variable has on them both. • Semi-partial correlation: • Measures the relationship between two variables controlling for the effect that a third variable has on only one of the others.

  30. Revision Revision Exam Anxiety Exam Anxiety Andy Field Semi-PartialCorrelation Partial Correlation

  31. REGRESSION • (Simple) Linear Regression is a model to predict the value of one variable from another. • Used to predict values of an outcome from one predictor. • Predict NC with Ospan • Multiple Regression is an extension: • Used to predict values of an outcome from several predictors. • Predict NC with Ospan and IQ • It is a hypothetical model of the relationship between several variables. • We can use regression to predict specific values of our outcome variable given specific values of our predictor variables. • Multiple Regression is an extension:

  32. How do we make our predictions? • With a regression line (simple in this case) • Yi (aka ) = Predicted value of Y • Value we will estimate w/regression equation • b1 = Slope of the regression line. • The amount of difference in y associated with 1-unit difference in x • Regression coefficient for the predictor • Direction/Strength of Relationship • bo = Intercept (the predicted value of y when x = 0; where the line intercepts y axis) • x = Value of a predictor (O-span) variable.

  33. What does this line do? • Method of least Squares • When drawing our line, we want the line the best goes through our data points.   • Well we do this by minimizing error. • Line is Yi (or ) = predicted values of Y • Data points are Y • Lets draw this out for the Ospan and NC data • (Y- Yi) = Errors in prediction: residual • Regression equation minimizes squared error, or variance around the line. • Error =

  34. Calculating the regression line

  35. The line is a model • This line will fit the data as best as possible, but that does not mean it fits the data well. • As we discussed, we could have non-linear relationships. • How do we test the fit of the model to the data? • By assessing the degree to which the line captures variability(or minimizes error). • How do we get there? • With our old friends, SUM OF SQUARES!!

  36. What are we trying to predict? • Y! • The mean of Y is one type of model that predicts Y. • If we were to guess someone’s Y, we would use the mean of Y. • If X is not related to Y at all (i.e., if X does not predict Y at all), the line predicting Y with X would be what? • Parallel with the X-axis (i.e., slope = 0!) • What would the Y-intercept of that line be? • The mean of Y! • So, we are going to start by measuring the total variability in Y. • How do we do this, look at the sum of the squared deviations of Ys from the mean (Y predicted from the “mean” model). • That is, calculate SSTOTAL for Y!!! • You can do it!

  37. The alternative Model • Another way to predict Y is by using X. If X IS related to Y in a linear fashion, the slope of that line should NOT = 0. • We can create a line that predicts Y with X. • THIS line, or regression equation, minimizes squared error, or variance around the line. • Of course 999,999,999 times out of a billion, this regression line will not perfectly predict Y. We will have…error. • What is another word for error? • Chance. • What do we do with chance when computing test statistics? • That’s right, we stick it in the denominator of whatever ration we compute. • So, we need to know how much variability there is between the model line (regression equation) and the Y values. • That is, we need a measure of variability around the model line. • Each Y deviates a little around the line, right? So, what should we call the SS that measures these deviations from the line? • SSRESIDUAL !!! 

  38. The alternative Model continued • We calculate a regression line that minimizes error. • We can measure that error in terms of variability. • Is this model line different from the mean line? • That is, does the model line account for more error (or significantly reduce the amount of unexplained variability relative to the mean model)? • That is, does the slope of the model line differ from the slope of the mean model (0)? • To get at this, we measure the variability of the model line from the mean model. • At each value of X, how different is the Y predicted by the model line and the mean model? • Square all those deviations, add them up and what do you have? • SSMODEL !!! 

  39. Sums of Squares

  40. Summary of SS • SST: Total variability • OR variability between Y values and the Y mean. • SSR: Residual/Error variability • OR, variability between the regression model and the Y values). • SSM: Model variability • OR difference in variability between the regression model and the Y mean. • OR variability between the model and the Y mean.

  41. SST Total Variance In The Y Data SSM SSR Improvement Due to the Model Error in Model Testing the Model: ANOVA • If the model results in better prediction than using the mean, SSM should be greater than SSR

  42. Testing the Model: ANOVA • Mean Squareds • Sums of Squares / respective df (as with ANOVA) • This givens us Average variability, VARIANCE, or Mean Squares (MS). • Mean Squared Terms can be used to calculate F!

  43. How to Calculate SS • SSTotal is what we know. • SSResidual = conceptually • Standard Error of the estimate (average deviation of Y from the line, or: • Calculation wise, St. Error of the Estimate = • Square that and you have MSResidual • Multiply by DF (N-2) and you have SSresidual

  44. SS continued • SSModel? • How to we find that? • SSmodel = SStotal – SSresidual • Using our O-span and NC data, we can calculate a regression line and test whether it is “significant.” • Sstotal(in Y) = 29.429 • St. Error of Estimate = • MSresidual = .867; SSresidual = .867 x (7-2) = 4.335 • SSmodel = 29.429 – 4.335 = 25.0294

  45. F • MSmodel = 25.0294/1 • F = SSmodel = 25.0294/.867 = 28.945 • Interpretation? • The line predicting NC with Ospan accounts for significantly more variance in Y than the a line with no slope intersecting the y-axis at the mean for Y. • The line predicts more variance in Y than would be expected from chance alone. • How MUCH more?

  46. Testing the Model: R2 • R2 • The proportion of variance accounted for by the regression model. • The Pearson Correlation Coefficient Squared

  47. Testing regression coefficients • Is the slope of our regression line significantly different from 0? • We have one predictor, and our ANOVA is significant, so in this case we know the answer is yes. • How can we test the coefficient anyway? • With a t-test.

More Related