1 / 28

# Correlation - PowerPoint PPT Presentation

Correlation. Rizal Maulana. Basic Concepts of Correlation Scatter Diagrams One Sample Hypothesis Testing for Correlation. Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Correlation' - elan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Correlation

Rizal Maulana

Outline

• Definition: The covariance between two sample random variables x and y is a measure of the linear association between the two variables, and is defined by the formula

• Observation: The covariance is similar to the variance, except that the covariance is defined for two variables (x and y above) whereas the variance is defined for only one variable. In fact, cov(x, x) = var(x).

Basic Concepts of Correlation

• The covariance can be thought of as the sum of matches and mismatches among the pairs of data elements for x and y

• Amatch occurs when both elements in the pair are on the same side of their mean; a mismatch occurs when one element in the pair is above its mean and the other is below its mean.

• The covariance is positive when the matches outweigh the mismatches and is negative when the mismatches outweigh the matches.

• The stronger the linear relationship the larger the value of the covariance will be.

• The size of the covariance is also influenced by the scale of the data elements, and so in order to eliminate the scale factor the correlation coefficient is used as the scale-free metric of linear relationship.

• Definition mismatches among the pairs of data elements for : The correlation coefficient between two sample variables x and y is a scale-free measure of linear association between the two variables, and is given by the formula

• Observation: The covariance can be calculated as

As a result, we can also calculate the correlation coefficient as

• Property mismatches among the pairs of data elements for 1: -1 < r < 1

• If r is close to 1 then x and y are positively correlated. A positive linear correlation means that high values of x are associated with high values of y and low values of x are associated with low values of y.

• If r is close to -1 then x and y are negatively correlated. A negative linear correlation means that high values of x are associated with low values of y, and low values of x are associated with high values of y.

• When r is close to 0 there is little linear relationship between x and y.

• Definition mismatches among the pairs of data elements for : The covariance between two random variables x and y for a population with discrete or continuous pdfis

• Definition: The (Pearson’s product moment) correlation coefficient for two variables  x and y for a population with discrete or continuous pdf is

• If mismatches among the pairs of data elements for x and y are independent then cov(x, y) = 0

• The following is true for both for the sample and population :

proof :

• Observation mismatches among the pairs of data elements for : It turns out that r is not an unbiased estimate of ρ. A relatively unbiased estimate of  is given by the adjusted correlation coefficient radj:

whileradjis a better estimate of the population correlation, especially for small values of n, for large values of n it is easy to see that radj ≈ r.

• For constant mismatches among the pairs of data elements for a and random variables x, y and z, the following are true both for the sample and population definitions of covariance:

• cov(x, y) = cov(y, x)

• cov(x, x) = var(x)

• cov(a, y) = 0

• cov(ax, y) = a · cov(x, y)

• cov(x+z, y) = cov(x, y)+ cov(z, y)

• If mismatches among the pairs of data elements for x and y are random variables and z = ax + b where a and b are constants then the correlation coefficient between x and y is the same as the correlation coefficient between z and y.

andso stdev(z) = a · stdev(x). Thus

• Property 2 mismatches among the pairs of data elements for :

Proof

Sinceti and ei are independent, cov(t,e) = 0, and so

Thus

• Excel Functions mismatches among the pairs of data elements for :

• COVAR(R1, R2) = the population covariance between the data in arrays R1 and R2. If R1 contains data {x1,…,xn}, R2 contains {y1,…,yn}, x= AVERAGE(R1) and y= AVERAGE(R2), then COVAR(R1, R2) has the value

This is the same as the formula given in Definition 1, with n replaced by n – 1. Excel doesn’t have a sample version of the covariance, although this can be calculated using the formula:

n* COVAR(R1, R2) / (n – 1)

• CORREL mismatches among the pairs of data elements for (R1, R2) = the correlation coefficient of data in arrays R1 and R2. This function can be used for both the sample and population versions of the correlation coefficient. Note that:

• CORREL(R1, R2) = COVAR(R1. R2) / (STDEVP(R1) * STDEVP(R2)) = the population version of the correlation coefficient

• CORREL(R1, R2) = n * COVAR(R1. R2) / (STDEV(R1) * STDEV(R2) * (n  – 1)) = the sample version of the correlation coefficient

• Excelalso provide COVARIANCE.S(R1, R2) to compute the sample covariance as well as COVARIANCE.P(R1, R2) which is equivalent to COVAR(R1, R2). Also, the Real Statistics supplemental functions COVARP(R1, R2) and COVARS(R1, R2) compute the population and sample covariances respectively.

• To better visualize the association between two data sets { mismatches among the pairs of data elements for x1, …, xn} and {y1, …, yn} we can employ a chart called a scatter diagram (also called a scatter plot). This is done in Excel by highlighting the data in the two data sets and selecting Insert > Charts|Scatter.

• This figureillustrates the relationship between a scatter diagram and the correlation coefficient (or covariance).

Scatter Diagrams

• As we do in Sampling Distributions, we can consider the distribution of r over repeated samples of x and y.

• We require x and y have a joint bivariate normal distribution or that samples are sufficiently large.

• We can think of a bivariate normal distribution as the three-dimensional version of the normal distribution, in which any vertical slice through the surface which graphs the distribution results in an ordinary bell curve.

• The sampling distribution of r is only symmetric when ρ = 0 (i.e. when x and y are independent).

• If ρ ≠ 0, then the sampling distribution is asymmetric and so the following theorem does not apply, and other methods of inference must be used.

One Sample Hypothesis Testing forCorrelation

• Theorem 1 distribution of : Suppose ρ = 0. If x and y have a bivariate normal distribution or if the sample size n is sufficiently large, then r has a normal distribution with mean 0, and t = r/sr ~ T(n – 2) where

here the numerator r of the random variable t is the estimate of ρ = 0 and sris the standard error of t.

• Observation: If we solve the equation in Theorem 1 for r, we get

• Observation: The theorem can be used to test the hypothesis that population random variables x and y are independent i.e. ρ = 0.

• A study is designed to check the relationship between smoking and longevity. A sample of 15 men 50 years and older was taken and the average number of cigarettes smoked per day and the age at death was recorded, as summarized in the table. Can we conclude from the sample that longevity is independent of smoking?

The scatter diagram for this data is as follows. We have also included the linear trend line that seems to best match the data.

Example 1

• Next we calculate the correlation coefficient of the sample using the CORREL function:

r = CORREL(R1, R2) = -.713

From the scatter diagram and the correlation coefficient, it is clear that the population correlation is likely to be negative.

• The absolute value of the correlation coefficient looks high, but is it high enough? To determine this, we establish the following null hypothesis:

H0: ρ = 0

• Recall that ρ = 0 would mean that the two population variables are independent.

• We use  t =  r/sras the test statistic where sr is as in Theorem 1. Based on the null hypothesis, ρ = 0, we can apply Theorem 1, provided x and y have a bivariate normal distribution.

• It is difficult to check for bivariate normality, but we can at least check to make sure that each variable is approximately normal via QQ plots.

• Both using the CORREL functionsamples appear to normal, and so by Theorem 1, we know that t has approximately a t-distribution with n – 2 = 13 degrees of freedom. We now calculate

• Finally, we perform either one of the following tests:

p-value = TDIST(ABS(-3.67), 13, 2) = .00282 < .05 = α (two-tail)

tcrit= TINV(.05, 13) = 2.16 < 3.67 = |tobs |

• And so we reject the null hypothesis, and conclude there is a non-zero correlation between smoking and longevity. In fact, it appears from the data that increased levels of smoking reduces longevity.

• The US Census Bureau collects statistics comparing the various 50 states. The following table shows the poverty rate (% of population below the poverty level) and infant mortality rate per 1,000 live births) by state. Based on this data, can we conclude the poverty and infant mortality rates by state are correlated?

Example 2

• The various 50 states. The following table shows the poverty rate (% of population below the poverty level) and infant mortality rate per 1,000 live births) by state. Based on this data, can we conclude the poverty and infant mortality rates by state are correlated?correlation coefficient of the sample is given by

r = CORREL(R1, R2) = .564

• Where R1 is the range containing the poverty data and R2 is the range containing the infant mortality data.

• From the scatter diagram and the correlation coefficient, it is clear that the population correlation is likely to be positive, and so this time we use the following one-tail null hypothesis:

H0: ρ ≤ 0

• Based on the null hypothesis we will assume that ρ = 0 (best case), and so as in Example 1

• Finally, we perform either one of the following tests:

p-value = TDIST(4.737, 48, 1) = 9.8E-08 < .05 = α (one-tail)

tcrit = TINV(.05, 48) = 2.011 < 4.737 = tobs

• And so we reject the null hypothesis, and conclude there is a non-zero correlation between poverty and infant mortality.

• Observation various 50 states. The following table shows the poverty rate (% of population below the poverty level) and infant mortality rate per 1,000 live births) by state. Based on this data, can we conclude the poverty and infant mortality rates by state are correlated?: For samples of any given size n it turns out that r is not normally distributed when ρ ≠ 0 (even when the population has a normal distribution), and so we can’t use Theorem 1.

• There is a simple transformation of r, however, that gets around this problem, and allows us to test whether ρ = ρ0 for some value of ρ0 ≠ 0.

• Definition 1: For any r define the Fisher transformation of r as follows:

• Theorem 2: If x and y have a joint bivariate normal distribution or n is sufficiently large, then the Fisher transformation r’ of the correlation coefficient r for samples of size n has distribution N(ρ′, sr′) where

• Corollary 1 various 50 states. The following table shows the poverty rate (% of population below the poverty level) and infant mortality rate per 1,000 live births) by state. Based on this data, can we conclude the poverty and infant mortality rates by state are correlated?: Suppose r1 and r2 are as in the theorem where r1 and r2 are based on independent samples and further suppose that ρ1 = ρ2. If z is defined as follows, then z ~ N(0,1)

where

• Excel Functions: Excel provides functions that calculate the Fisher transformation and its inverse.

FISHER(r) = .5 * LN((1 + r) / (1 – r))

FISHERINV(z) = (EXP(2 * z) – 1) / (EXP(2 * z) + 1)

• Observation: We can use Theorem 2 to test the null hypothesis H0: ρ = ρ0. This test is very sensitive to outliers. If outliers are present it may be better to use the Spearman rank correlation test or Kendall’s tau test.

• Suppose we calculate various 50 states. The following table shows the poverty rate (% of population below the poverty level) and infant mortality rate per 1,000 live births) by state. Based on this data, can we conclude the poverty and infant mortality rates by state are correlated?r = .6 for a sample of size n = 100. Test the following null hypothesis and find the 95% confidence interval.

H0: ρ = .7

• Observe that

r′ = FISHER(r) = FISHER(.6) = 0.693

ρ′ = FISHER(ρ) = FISHER(.7) = 0.867

sr′ = 1 / SQRT(n – 3) = 1 / SQRT(100 – 3) = 0.102

• Since r′ < ρ′ we are looking at the left tail of a two-tail test

p-value = NORMDIST(r′, ρ′, sr′, TRUE) = NORMDIST(.693, .867, .102, TRUE) = .0432 > 0.025 = α/2

r′-crit = NORMINV(α/2, ρ′, sr′) = NORMINV(.025, .867, .102) = .668 < .693 = r′

In either c′ase, we cannot reject the null hypothesis.

Example 3

• The 95% confidential interval for various 50 states. The following table shows the poverty rate (% of population below the poverty level) and infant mortality rate per 1,000 live births) by state. Based on this data, can we conclude the poverty and infant mortality rates by state are correlated?ρ′is

r′ ± zcrit ∙ sr′= 0.693 ± 1.96 ∙ 0.102 = (0.494, 0.892)

• Here zcrit = ABS(NORMSINV(.025)) = 1.96.

• The 95% confidence interval for ρ′ is therefore (FISHERINV(0.494), FISHERINV(0.892)) = (.457, .712).

• Note that .7 lies in this interval, confirming our conclusion not to reject the null hypothesis.

Effect Size And Power

• A market research team is conducting a study in which they believe the correlation between increases in product sales and marketing expenditures is 0.35. What is the power of the one-tail test if they use a sample of size 40 with α = .05? How big does their sample need to be to carry out the study with α = .05 and power = .80?

Example 4