Correlation

650 Views

Download Presentation
## Correlation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Correlation**Review and Extension**Questions to be asked…**• Is there a linear relationship between x and y? • What is the strength of this relationship? • Pearson Product Moment Correlation Coefficient (r) • Can we describe this relationship and use this to predict y from x? • y=bx+a • Is the relationship we have described statistically significant? • Not a very interesting one if tested against a null of r = 0**Other stuff**• Check scatterplots to see whether a Pearson r makes sense • Use both r and r2 to understand the situation • If data is non-metric or non-normal, use “non-parametric” correlations • Correlation does not prove causation • True relationship may be in opposite direction, co-causal, or due to other variables • However, correlation is the primary statistic used in making an assessment of causality • ‘Potential’ Causation**Possible outcomes**• -1 to +1 • As one variable increases/decreases, the other variable increases/decreases • Positive covariance • As one variable increases/decreases, another decreases/increases • Negative covariance • No relationship (independence) • r = 0 • Non-linear relationship**Covariance**• The variance shared by two variables • When X and Y move in the same direction (i.e. their deviations from the mean are similarly pos or neg) • cov (x,y) = pos. • When X and Y move in opposite directions • cov (x,y) = neg. • When no constant relationship • cov (x,y) = 0**Covariance is not very meaningful on its own and cannot be**compared across different scales of measurement • Solution: standardize this measure • Pearson’s r:**Factors affecting Pearson r**• Linearity • Heterogeneous subsamples • Range restrictions • Outliers**Linearity**• Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship**Heterogeneous subsamples**• Sub-samples may artificially increase or decrease overall r. • Solution - calculate r separately for sub-samples & overall, look for differences**Range restriction**• Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r. • Common example occurs with Likert scales • E.g. 1 - 4 vs. 1 - 9 • However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out**Effect of Outliers**• Outliers can artificially and dramatically increase or decrease r • Options • Compute r with and without outliers • Conduct robustified R! • For example, recode outliers as having more conservative scores (winsorize) • Transform variables**What else?**• r is the starting point for any regression and related method • Both the slope and magnitude of residuals are reflective of r • R = 0 slope =0 • As such a lone r doesn’t really provide much more than a starting point for understanding the relationship between two variables**Robust Approaches to Correlation**• Rank approaches • Winsorized • Percentage Bend**Rank approaches: Spearman’s rho and Kendall’s tau**• Spearman’s rho is calculated using the same formula as Pearson’s r, but when variables are in the form of ranks • Simply rank the data available • X = 10 15 5 35 25 becomes • X = 2 3 1 5 4 • Do this for X and Y and calculate r as normal • Kendall’s tau is a another rank based approach but the details of its calculation are different • For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data**Winsorized Correlation**• As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized • X = 1 2 3 4 5 6 becomes • X = 2 2 3 4 5 5 • Winsorize both X and Y values (without regard to each other) and compute Pearson’s r • This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged • For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming • Though trimming is preferable for group comparisons**Methods Related to M-estimators**• The percentage bend correlation utilizes the median and a generalization of MAD • A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the rpb gets around that • While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general • With independent X and Y variables, the values of robust approaches to correlation will match the Pearson r • With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not**Problem**• While these alternative methods help us in some sense, an issue remains • When dealing with correlation, we are not considering the variables in isolation • Outliers on one or the other variable, might not be a bivariate outlier • Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves**Global measures of association**• Measures are available that take into account the bivariate nature of the situation • Minimum Volume Ellipsoid Estimator (MVE) • Minimum Covariance Determinant Estimator (MCD)**Minimum Volume Ellipsoid Estimator**• Robust elliptic plot (relplot) • Relplots are like scatterplot boxplots for our data where the inner circle contains half the values and anything outside the dotted circle would be considered an outlier • A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points • Those points are then used to calculate the correlation • The MVE**Minimum Covariance Determinant Estimator**• The MCD is another alternative we might used and involves the notion of a generalized variance, which is a measure of the overall variability among a cloud of points • For the more adventurous, see my /6810 page for info matrices and their determinants • The determinant of a matrix is the generalized variance • For the two variable situation • As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be • The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that**Global measures of association**• Note that both the MVE and MCD can be extended to situations with more than two variables • We’d just be dealing with a larger matrix • Example using the Robust library in S-Plus • OMG! Drop down menus even!**Remaining issues: Curvature**• The fact is that straight lines may not capture the true story • We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship • There may still be a relationship, and a strong one, just more complex**Summary**• Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables • One data point can render it a useless measure, as it is not robust to outliers • Measures which are robust are available, and some take into account the bivariate nature of the data • However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable