Correlation

Correlation Review and Extension

Questions to be asked… • Is there a linear relationship between x and y? • What is the strength of this relationship? • Pearson Product Moment Correlation Coefficient (r) • Can we describe this relationship and use this to predict y from x? • y=bx+a • Is the relationship we have described statistically significant? • Not a very interesting one if tested against a null of r = 0

Other stuff • Check scatterplots to see whether a Pearson r makes sense • Use both r and r2 to understand the situation • If data is non-metric or non-normal, use “non-parametric” correlations • Correlation does not prove causation • True relationship may be in opposite direction, co-causal, or due to other variables • However, correlation is the primary statistic used in making an assessment of causality • ‘Potential’ Causation

Possible outcomes • -1 to +1 • As one variable increases/decreases, the other variable increases/decreases • Positive covariance • As one variable increases/decreases, another decreases/increases • Negative covariance • No relationship (independence) • r = 0 • Non-linear relationship

Covariance • The variance shared by two variables • When X and Y move in the same direction (i.e. their deviations from the mean are similarly pos or neg) • cov (x,y) = pos. • When X and Y move in opposite directions • cov (x,y) = neg. • When no constant relationship • cov (x,y) = 0

Covariance is not very meaningful on its own and cannot be compared across different scales of measurement • Solution: standardize this measure • Pearson’s r:

Factors affecting Pearson r • Linearity • Heterogeneous subsamples • Range restrictions • Outliers

Linearity • Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship

Heterogeneous subsamples • Sub-samples may artificially increase or decrease overall r. • Solution - calculate r separately for sub-samples & overall, look for differences

Range restriction • Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r. • Common example occurs with Likert scales • E.g. 1 - 4 vs. 1 - 9 • However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out

Effect of Outliers • Outliers can artificially and dramatically increase or decrease r • Options • Compute r with and without outliers • Conduct robustified R! • For example, recode outliers as having more conservative scores (winsorize) • Transform variables

What else? • r is the starting point for any regression and related method • Both the slope and magnitude of residuals are reflective of r • R = 0 slope =0 • As such a lone r doesn’t really provide much more than a starting point for understanding the relationship between two variables

Robust Approaches to Correlation • Rank approaches • Winsorized • Percentage Bend

Rank approaches: Spearman’s rho and Kendall’s tau • Spearman’s rho is calculated using the same formula as Pearson’s r, but when variables are in the form of ranks • Simply rank the data available • X = 10 15 5 35 25 becomes • X = 2 3 1 5 4 • Do this for X and Y and calculate r as normal • Kendall’s tau is a another rank based approach but the details of its calculation are different • For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data

Winsorized Correlation • As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized • X = 1 2 3 4 5 6 becomes • X = 2 2 3 4 5 5 • Winsorize both X and Y values (without regard to each other) and compute Pearson’s r • This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged • For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming • Though trimming is preferable for group comparisons

Methods Related to M-estimators • The percentage bend correlation utilizes the median and a generalization of MAD • A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the rpb gets around that • While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general • With independent X and Y variables, the values of robust approaches to correlation will match the Pearson r • With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not

Problem • While these alternative methods help us in some sense, an issue remains • When dealing with correlation, we are not considering the variables in isolation • Outliers on one or the other variable, might not be a bivariate outlier • Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves

Global measures of association • Measures are available that take into account the bivariate nature of the situation • Minimum Volume Ellipsoid Estimator (MVE) • Minimum Covariance Determinant Estimator (MCD)

Minimum Volume Ellipsoid Estimator • Robust elliptic plot (relplot) • Relplots are like scatterplot boxplots for our data where the inner circle contains half the values and anything outside the dotted circle would be considered an outlier • A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points • Those points are then used to calculate the correlation • The MVE

Minimum Covariance Determinant Estimator • The MCD is another alternative we might used and involves the notion of a generalized variance, which is a measure of the overall variability among a cloud of points • For the more adventurous, see my /6810 page for info matrices and their determinants • The determinant of a matrix is the generalized variance • For the two variable situation • As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be • The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that

Global measures of association • Note that both the MVE and MCD can be extended to situations with more than two variables • We’d just be dealing with a larger matrix • Example using the Robust library in S-Plus • OMG! Drop down menus even!

Remaining issues: Curvature • The fact is that straight lines may not capture the true story • We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship • There may still be a relationship, and a strong one, just more complex

Summary • Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables • One data point can render it a useless measure, as it is not robust to outliers • Measures which are robust are available, and some take into account the bivariate nature of the data • However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable

Correlation

Correlation

Presentation Transcript

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation

Correlation