1 / 24

# Correlation - PowerPoint PPT Presentation

Correlation. Review and Extension. Questions to be asked…. Is there a linear relationship between x and y? What is the strength of this relationship? Pearson Product Moment Correlation Coefficient (r) Can we describe this relationship and use this to predict y from x? y=bx+a

Related searches for Correlation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Correlation' - Audrey

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Correlation

Review and Extension

• Is there a linear relationship between x and y?

• What is the strength of this relationship?

• Pearson Product Moment Correlation Coefficient (r)

• Can we describe this relationship and use this to predict y from x?

• y=bx+a

• Is the relationship we have described statistically significant?

• Not a very interesting one if tested against a null of r = 0

• Check scatterplots to see whether a Pearson r makes sense

• Use both r and r2 to understand the situation

• If data is non-metric or non-normal, use “non-parametric” correlations

• Correlation does not prove causation

• True relationship may be in opposite direction, co-causal, or due to other variables

• However, correlation is the primary statistic used in making an assessment of causality

• ‘Potential’ Causation

• -1 to +1

• As one variable increases/decreases, the other variable increases/decreases

• Positive covariance

• As one variable increases/decreases, another decreases/increases

• Negative covariance

• No relationship (independence)

• r = 0

• Non-linear relationship

• The variance shared by two variables

• When X and Y move in the same direction (i.e. their deviations from the mean are similarly pos or neg)

• cov (x,y) = pos.

• When X and Y move in opposite directions

• cov (x,y) = neg.

• When no constant relationship

• cov (x,y) = 0

Factors affecting Pearson r compared across different scales of measurement

• Linearity

• Heterogeneous subsamples

• Range restrictions

• Outliers

Linearity compared across different scales of measurement

• Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship

Heterogeneous subsamples compared across different scales of measurement

• Sub-samples may artificially increase or decrease overall r.

• Solution - calculate r separately for sub-samples & overall, look for differences

Range restriction compared across different scales of measurement

• Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r.

• Common example occurs with Likert scales

• E.g. 1 - 4 vs. 1 - 9

• However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out

Effect of Outliers compared across different scales of measurement

• Outliers can artificially and dramatically increase or decrease r

• Options

• Compute r with and without outliers

• Conduct robustified R!

• For example, recode outliers as having more conservative scores (winsorize)

• Transform variables

What else? compared across different scales of measurement

• r is the starting point for any regression and related method

• Both the slope and magnitude of residuals are reflective of r

• R = 0 slope =0

• As such a lone r doesn’t really provide much more than a starting point for understanding the relationship between two variables

Robust Approaches to Correlation compared across different scales of measurement

• Rank approaches

• Winsorized

• Percentage Bend

Rank approaches: Spearman’s rho and Kendall’s tau compared across different scales of measurement

• Spearman’s rho is calculated using the same formula as Pearson’s r, but when variables are in the form of ranks

• Simply rank the data available

• X = 10 15 5 35 25 becomes

• X = 2 3 1 5 4

• Do this for X and Y and calculate r as normal

• Kendall’s tau is a another rank based approach but the details of its calculation are different

• For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data

Winsorized Correlation compared across different scales of measurement

• As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized

• X = 1 2 3 4 5 6 becomes

• X = 2 2 3 4 5 5

• Winsorize both X and Y values (without regard to each other) and compute Pearson’s r

• This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged

• For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming

• Though trimming is preferable for group comparisons

Methods Related to M-estimators compared across different scales of measurement

• The percentage bend correlation utilizes the median and a generalization of MAD

• A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the rpb gets around that

• While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general

• With independent X and Y variables, the values of robust approaches to correlation will match the Pearson r

• With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not

Problem compared across different scales of measurement

• While these alternative methods help us in some sense, an issue remains

• When dealing with correlation, we are not considering the variables in isolation

• Outliers on one or the other variable, might not be a bivariate outlier

• Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves

Global measures of association compared across different scales of measurement

• Measures are available that take into account the bivariate nature of the situation

• Minimum Volume Ellipsoid Estimator (MVE)

• Minimum Covariance Determinant Estimator (MCD)

Minimum Volume Ellipsoid Estimator compared across different scales of measurement

• Robust elliptic plot (relplot)

• Relplots are like scatterplot boxplots for our data where the inner circle contains half the values and anything outside the dotted circle would be considered an outlier

• A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points

• Those points are then used to calculate the correlation

• The MVE

Minimum Covariance Determinant Estimator compared across different scales of measurement

• The MCD is another alternative we might used and involves the notion of a generalized variance, which is a measure of the overall variability among a cloud of points

• For the more adventurous, see my /6810 page for info matrices and their determinants

• The determinant of a matrix is the generalized variance

• For the two variable situation

• As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be

• The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that

Global measures of association compared across different scales of measurement

• Note that both the MVE and MCD can be extended to situations with more than two variables

• We’d just be dealing with a larger matrix

• Example using the Robust library in S-Plus

• OMG! Drop down menus even!

Remaining issues: Curvature compared across different scales of measurement

• The fact is that straight lines may not capture the true story

• We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship

• There may still be a relationship, and a strong one, just more complex

Summary compared across different scales of measurement

• Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables

• One data point can render it a useless measure, as it is not robust to outliers

• Measures which are robust are available, and some take into account the bivariate nature of the data

• However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable