Correlation & Regression

Correlation & Regression

Correlation • T-tests and ANOVA examine the mean differences between two + levels of one or more IV’s on a DV • i.e. differences between males and females (2 levels of the IV “gender”) on exam scores • What if instead of average differences we were more interested in the relationship between two variables? • “relationship” = how one variable changes as a function of another variable

Correlation • i.e. the relationship between anxiety prior to a medical procedure and the patient’s post-op recovery • This type of question concerns what is called a correlation • Correlation = relationship between two variables • NOTE – if we were looking at average post-op recovery (the DV) in groups both high and low in pre-op anxiety (2 levels of the IV anxiety), we would be looking at mean differences, and an ANOVA would be more appropriate than correlation

Correlation • The easiest means of representing this relationship/correlation is via the use of a scatterplot • Scatterplot = a graph in which the individual data points are plotted in two-dimensions

Correlation • Predictor Variable = traditionally the variable on the x-axis (in this case “Depression”) • Criterion Variable = traditionally the variable on the y-axis (in this case “Pessimism”) • Best-Fit Line/Regression Line = the line that represents the area in space that each data point is minimally distant from/that best represents the data

Correlation • Regression Line • “Best” fit = line that minimizes average distance from all data points (i.e. residuals) • Residual = Amount that data point deviates from this line

Correlation • It is important to note that although the predictor is usually the variable on the x-axis, and the criterion on the y-axis, that often these definitions are not adhered to and the variables are named randomly • Also, because one variable is called the predictor does not mean that it “predicts” the criterion in the sense that it can tell you what the criterion is before it occurs • i.e. to say that depression predicts pessimism does not mean that depression comes first and causes you to be pessimistic!

Correlation • Correlation does not equal causation! • the only way that you can say that one variable predicts another in time is through the design of your experiment • if depression were assessed in January and pessimism were assessed in December, and the two were found to be related, then you can say that one predicts the other in time • statistical prediction ≠ “prediction” • if the two variables were measured at the same time, we do not know which one caused the other one

Correlation • to determine causation (that one variable caused another) we need to show several things: • that the predictor preceded the criterion in time (this also shows that the criterion did not cause the predictor) • that other variables did not cause both the criterion and the predictor at the same time, resulting in their relationship IV DV Var 1

Correlation • i.e. if we were studying the relationship (correlation) between two variables: the length of grass and ice cream consumption • If they were measured simultaneously it would be impossible to tell which caused which • If both were measured at two time points, July and December, we would find that they both increase and decrease at the same time (i.e. one does not seem to cause the other) – no causation • If we measured temperature as well, we would find that both are correlated because increases in temperature causes both, which explains why the increase and decrease at the same time

Correlation • Correlation is represented by the Pearson Product-Moment Correlation Coefficient (r) • can range from -1 to 1, where 1 represents a strong positive relationship, -1 a strong negative relationship, and 0 no relationship between the two variables • both strong positive and negative relationships are, none-the-less, robust relationships and are generally meaningful – a negative relationship is not bad • only used when the two variables are continuous/dimensional

Correlation • Positive Relationship (r = .82) • As BDI2TOT increases, MASQGDD also increases

Correlation • Negative Relationship (r = -.679) • As MASQAD increases, TMMSREP decreases

Correaltion • No Relationship (r = .00) • Information about Explanatory Flexibility tells you nothing about Emotional Insight

Correlation • Pearson’s r is heavily reliant on the covariance • covxy = • If variance = • …then cov is just average variability in both x and y

Correlation • Error variance = average amount each point deviates from best-fit line = standard error of the estimate =sy.x • sy.x = • If Ŷ is point on best fit line (predicted value of Y), then sy.x = standard deviation of residuals or variance of residuals/error = error variance

Correlation • Pearson’s r = covxy/sxsy • Correlation = amount of shared variability/√(total variability) • Since it’s like a %, r ranges from 0 – (-)1.00 • In fact, by squaring r (r2) = % variability that is shared between x and y • Previous example of BDI2 and MASQGDD, r = .82; r2 = .67  67% of variance in BDI2 is predicted by MASQGDD

Correlation • Hypotheses in Correlation: • H0 = ρ = 0 • ρ (rho) = correlation in population (parameter) • H1 = ρ ≠ 0

Correlation • Assumptions of Correlation (Pearson’s r) • Nonlinear/Curvilinear Relationships • If the relationship between the two variables is not linear, and is instead U-shaped or bell-shaped (like our normal distribution), our attempts at finding a best-fit line will fail, and it will seem as though our two variables are unrelated (r will approximate 0), when in fact the relationship exists, but is nonlinear

Correlation • Above is an example of a curvilinear relationship, although the two variables are clearly related, their correlation is only r = -.205 • Note how the best-fit line does not represent the data points well

Correlation • Assumptions of Correlation (Pearson’s r) • Normality • Both variables must be normally distributed, otherwise correlation will appear smaller than it is • If our data is non-normal, correlation coefficients other than r can be used

Correlation • We can also calculate r if our data is ordinal instead of continuous/dimensional • Remember: data on an ordinal scale is ranked, which means that we can tell that one number is higher than another, but not how much higher (interval scales have this), and there is no zero point (ratio scales have this) – i.e. 1st place, 2nd place, etc. = ordinal data • Correlation here is represented by Spearman’s rs • Difference between r and rs is that rs requires that the data be monotonic, or constantly rising or falling – if data are arranged in rank order, they can only go up or down, you can’t go from 1st place to 9th place to 2nd place if the places are arranged in order

Correlation • Other correlation coefficients • The Point Biserial Correlation coefficient (rpb) - If one variable is continuous/dimensional and the other dichotomous (a nominal scale where the variable can take only two possible values) • Dichotomous variables – e.g. Gender (Male/Female), Yes/No answers, Race (if it is coded as Caucasian or Minority), etc.

Correlation • Other correlation coefficient • Phi (Φ) – when both variables are dichotomous

Correlation • Factors that bias correlation coefficients: • Range Restriction • Typically, restricting range reduces correlations Full Dataset (r = .82) Only BDI > 30 (r = .490)

Correlation • However, restricting range increases correlations if the relationship is curvilinear because it makes the variable linear Full Dataset (r = -.205) Only Var1 ≥ 5 (r = -.982)

Correlation • Problems of range restriction are common in psychological research, because researchers want their group to be as different from each other as possible to increase the effect sizes that they obtain • Remember: The formula for effect size for ANOVA (Cohen’s d) is the mean for Group 1 – the mean for Group 2 divided by the sp • To get highly different groups, researchers sample those high and low on a particular variable • I.e. comparing those highest on aggression to those lowest on aggression • This is identical to only looking at BDI2 scores higher than 30, when looking at the full range of scores, correlations will be more accurate

Correlation • Factors that bias correlation coefficients: • Heterogenous Subsamples • This is a problem when there is an interaction present (i.e. our age by gender interaction mentioned in the discussion of Factorial ANOVA)

If males’ performance increases as they age, and womens’ performance remains the same, when the two genders are averaged together and age and performance are correlated regardless of gender, the correlation will be smaller • Strong correlation of age and performance for males + weak correlation of age and performance for females = biased correlation when the two are added together

Correlation • Factors that bias correlation coefficients: • Outliers No Outliers (r = .989) Outlier (r = .522)

Correlation • Testing correlations for significance • just like t- and F-statistics, r-statistics can be tested for significance • just like t- and F-statistics, with increasing sample size (n), smaller correlations (r’s) will be significant • with 25 people, r ≥ .396 is significant at p < .05, with 1000 people you only need an r ≥ .062 (see Table E.2, page 515 in your text)

Correlation • Testing correlations for significance • the r-statistic is also its own, built-in effect size statistic • Cohen’s conventions for r: .1 = small, .3 = medium, and .5 = large effects • by squaring r (r2), you also get a relatively unbiased effect size estimate that is interpreted identically to η2 and ω2 • Remember: η2 and ω2 represent the percent of variability in one variable accounted for by the other

Correlation • Testing correlations for significance • Therefore, if: • r = .5, p = .00001, you can state that your two variables are strongly (effect size) and reliably (p-value) related • r = .5, p = .65, you can conclude that your two variables are strongly related, but that you probably didn’t have enough subjects for this to be represented in your p-value • r = .1, p = .00001, you can conclude that large sample size inflated your p-value, and your variables are probably not related • r = .1, p = .65, you can conclude that your two variables are neither strongly nor reliably related

Regression • The best-fit line allows us to make educated guesses about what a score is on one variable given a score on the other • Extrapolate = make educated guesses what a score would be that is either higher or lower than any actual score obtained • Interpolate = make educated guesses what a score would be that is in the range of the scores obtained, but that was not actually obtained

Regression • Range of scores on Depression = 0 – 49 • Range of scores on Pessimism = 1 – 7 • Extrapolation – What pessimism score would be associated with a depression score of 50? (~6.8) • Interpolation – What pessimism score would be associated with a depression score of 45? (~5.5)

Regression • Interested in linear relationship between 2 variables = use correlation • Interested in linear relationship(s) between 3+ dimensional variables = regression • DV = Symptoms of paranoia • IV = Treatment vs. Control groups  ANOVA • IV discrete (dichotomous/polychotomous) • IV = # of sessions of treatment  Regression • IV dimensional/continuous

Regression • DV = Criterion, IV’s = Predictors • Criterion = b1x1 +b2x2 + b3x3… + a • x1 = predictor #1; b1 = slope of x1 and DV; a = intercept Slope = rate of change • b = .75 = 1 pt. increase in IV associated with .75 pt. increase in DV • I.e. for every 1 pt. increase in pessimism, Dep increases .75 pt.

Regression • Slope • Slope w/ raw data = b • I.e. b = .45 in prediction of GPA from IQ  1 pt. increase in IQ associated with ½ pt. increase in GPA • Slope w/ standardized data = β • Standardize data (i.e. convert to z-score) to compare slopes between experiments • β = bxs/sintercept • I.e. β = .53  1 s.d. increase in IQ associated with ½ s.d. increase in GPA • b more interpretable if scale of variables is meaningful • Intercept = value of DV when IV = 0 • In previous ex., Pess = ~3 when Dep = 0, so a = ~3

Regression • Regression can test: • The overall ability of all of your IV’s to predict your criterion (overall model/omnibus R2) • The ability of each IV to predict your criterion (b or β) • Each of these statistics is associated with a p-value & tested for significance • Can also be used to make predictions based on best-fit/regression line (less common)

Regression • Hypotheses in Regression: • H0 = b/β/R2 (in population) = 0 • H1 = b/β/R2 (in population) ≠ 0

Regression • Assumptions of Regression • Linearity of Regression • Variables linearly related to one another • Normality in Arrays • Actual values of DV normally distributed around predicted values (i.e. regression line) – AKA regression line is good approximation of population parameter • Homogeneity of Variance in Arrays • Assumes that variance of criterion is equal for all levels of predictor(s) • Sound familiar? • Variance of DV equal for all levels of IV(s)

Correlation/Regression • Correlation & Regression can also answer other kinds of questions: • Can test difference between 2 independent r ’s/b ’s • ra & b > rc & d • Is the correlation between depression and anxiety using the BDI and BAI larger than the same correlation using the MASQ-AD and MASQ-AA subscales?

Correlation/Regression • Can test difference between 2 dependent r ‘s/b ‘s • ra & b > rb & c • Is the correlation between rumination and depression as high as between rumination and generalized anxiety? • Is the correlation between rumination and depression @ Time 1 the same at Time 2, 4 weeks later? • Don’t worry about how to do calculations by hand

Correlation & Regression