correlation n.
Skip this Video
Loading SlideShow in 5 Seconds..
Correlation PowerPoint Presentation
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 24

Correlation - PowerPoint PPT Presentation

  • Uploaded on

Correlation. Review and Extension. Questions to be asked…. Is there a linear relationship between x and y? What is the strength of this relationship? Pearson Product Moment Correlation Coefficient (r) Can we describe this relationship and use this to predict y from x? y=bx+a

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Correlation' - Audrey

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Review and Extension

questions to be asked
Questions to be asked…
  • Is there a linear relationship between x and y?
  • What is the strength of this relationship?
    • Pearson Product Moment Correlation Coefficient (r)
  • Can we describe this relationship and use this to predict y from x?
    • y=bx+a
  • Is the relationship we have described statistically significant?
    • Not a very interesting one if tested against a null of r = 0
other stuff
Other stuff
  • Check scatterplots to see whether a Pearson r makes sense
  • Use both r and r2 to understand the situation
  • If data is non-metric or non-normal, use “non-parametric” correlations
  • Correlation does not prove causation
    • True relationship may be in opposite direction, co-causal, or due to other variables
  • However, correlation is the primary statistic used in making an assessment of causality
    • ‘Potential’ Causation
possible outcomes
Possible outcomes
  • -1 to +1
    • As one variable increases/decreases, the other variable increases/decreases
      • Positive covariance
    • As one variable increases/decreases, another decreases/increases
      • Negative covariance
  • No relationship (independence)
    • r = 0
  • Non-linear relationship
  • The variance shared by two variables
  • When X and Y move in the same direction (i.e. their deviations from the mean are similarly pos or neg)
    • cov (x,y) = pos.
  • When X and Y move in opposite directions
    • cov (x,y) = neg.
  • When no constant relationship
    • cov (x,y) = 0
Covariance is not very meaningful on its own and cannot be compared across different scales of measurement
  • Solution: standardize this measure
  • Pearson’s r:
factors affecting pearson r
Factors affecting Pearson r
  • Linearity
  • Heterogeneous subsamples
  • Range restrictions
  • Outliers
  • Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship
heterogeneous subsamples
Heterogeneous subsamples
  • Sub-samples may artificially increase or decrease overall r.
  • Solution - calculate r separately for sub-samples & overall, look for differences
range restriction
Range restriction
  • Limiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r.
  • Common example occurs with Likert scales
    • E.g. 1 - 4 vs. 1 - 9
  • However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out
effect of outliers
Effect of Outliers
  • Outliers can artificially and dramatically increase or decrease r
  • Options
    • Compute r with and without outliers
    • Conduct robustified R!
      • For example, recode outliers as having more conservative scores (winsorize)
    • Transform variables
what else
What else?
  • r is the starting point for any regression and related method
  • Both the slope and magnitude of residuals are reflective of r
    • R = 0 slope =0
  • As such a lone r doesn’t really provide much more than a starting point for understanding the relationship between two variables
robust approaches to correlation
Robust Approaches to Correlation
  • Rank approaches
  • Winsorized
  • Percentage Bend
rank approaches spearman s rho and kendall s tau
Rank approaches: Spearman’s rho and Kendall’s tau
  • Spearman’s rho is calculated using the same formula as Pearson’s r, but when variables are in the form of ranks
    • Simply rank the data available
    • X = 10 15 5 35 25 becomes
    • X = 2 3 1 5 4
    • Do this for X and Y and calculate r as normal
  • Kendall’s tau is a another rank based approach but the details of its calculation are different
  • For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data
winsorized correlation
Winsorized Correlation
  • As mentioned before, Winsorizing data involves changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized
    • X = 1 2 3 4 5 6 becomes
    • X = 2 2 3 4 5 5
  • Winsorize both X and Y values (without regard to each other) and compute Pearson’s r
  • This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged
  • For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimming
    • Though trimming is preferable for group comparisons
methods related to m estimators
Methods Related to M-estimators
  • The percentage bend correlation utilizes the median and a generalization of MAD
  • A criticism of the Winsorized correlation is that the amount of Winsorizing is fixed in advance rather than determined by the data, and the rpb gets around that
  • While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general
  • With independent X and Y variables, the values of robust approaches to correlation will match the Pearson r
  • With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not
  • While these alternative methods help us in some sense, an issue remains
  • When dealing with correlation, we are not considering the variables in isolation
  • Outliers on one or the other variable, might not be a bivariate outlier
  • Conversely what might be a bivariate outlier may not contain values that are outliers for X or Y themselves
global measures of association
Global measures of association
  • Measures are available that take into account the bivariate nature of the situation
  • Minimum Volume Ellipsoid Estimator (MVE)
  • Minimum Covariance Determinant Estimator (MCD)
minimum volume ellipsoid estimator
Minimum Volume Ellipsoid Estimator
  • Robust elliptic plot (relplot)
  • Relplots are like scatterplot boxplots for our data where the inner circle contains half the values and anything outside the dotted circle would be considered an outlier
  • A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points
  • Those points are then used to calculate the correlation
    • The MVE
minimum covariance determinant estimator
Minimum Covariance Determinant Estimator
  • The MCD is another alternative we might used and involves the notion of a generalized variance, which is a measure of the overall variability among a cloud of points
    • For the more adventurous, see my /6810 page for info matrices and their determinants
      • The determinant of a matrix is the generalized variance
  • For the two variable situation
  • As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be
  • The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that
global measures of association1
Global measures of association
  • Note that both the MVE and MCD can be extended to situations with more than two variables
    • We’d just be dealing with a larger matrix
  • Example using the Robust library in S-Plus
    • OMG! Drop down menus even!
remaining issues curvature
Remaining issues: Curvature
  • The fact is that straight lines may not capture the true story
  • We may often fail to find noticeable relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship
  • There may still be a relationship, and a strong one, just more complex
  • Correlation, in terms of Pearson r, gives us a sense of the strength of a linear association between two variables
  • One data point can render it a useless measure, as it is not robust to outliers
  • Measures which are robust are available, and some take into account the bivariate nature of the data
  • However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable