Statistics for the Behavioral Sciences (5 th ed.) Gravetter & Wallnau

Statistics for the Behavioral Sciences (5th ed.)Gravetter & Wallnau Chapter 16Correlations and Regression University of GuelphPsychology 3320 — Dr. K. HennigWinter 2003 Term

Overview of chapter • Correlations • Pearson r • For non-linear (non scalar) data: • Spearman r (with non-linear data) • point-biserial (where one variable is dichotomous) • phi-coefficient (where both variables are dichotomous) • Regressions

CORRELATIONS:Figure 16-1 (p. 522)The relationship between exam grade and time needed to complete the exam. Notice the general trend in these data: Students who finish the exam early tend to have better grades.

Figure 16-2 (p. 523) The same set of n = 6 pairs of scores (X and Y values) is shown in a table and in a scatterplot. Notice that the scatterplot allows you to see the relationship between X and Y.

Three characteristics1. Direction: examples of positive and negative relationships. (a) Beer sales are positively related to temperature. (b) Coffee sales are negatively related to temperature.

2. Form: Examples of relationships that are not linear: (a) relationship between reaction time and age; (b) relationship between mood and drug dose.

3. Degree: Examples of different values for linear correlations: (a) shows a strong positive relationship, approximately +0.90; (b) shows a relatively weak negative correlation, approximately –0.40; (c) shows a perfect negative correlation, –1.00; (d) shows no linear trend, 0.00.

Pearson (product-moment) correlation • sum of products of deviations, or SP = (X-Mx) (Y-MY), Mx = mean for x scores, etc. • Recall: SS = ∑(X-M)2=(X-M)(X-M) 3 5

Pearson (product-moment) correlation • r = degree to which X and Y vary together degree X and Y vary separately computational formula:SP= XY-  XY/n • expressed as a z-score: r= zxzy/nnote: must use population 

Understanding and interpreting r • correlation do not prove causation, but they can disprove causation • the value of a correlation can be effected greatly by range of scores in the data • outliers can have a dramatic effect • do not interpret a correlation as a proportion (e.g., 0.50 = 50%); rather r2 = .25 or 25% of the total variability is accounted for|-is called the coefficient of determination

The effect of range(a) In this example, the full range of X and Y values shows a strong, positive correlation, but the restricted range of scores produces a correlation near zero. (b) An example in which the full range of X and Y values shows a correlation near zero, but the scores in the restricted range produce a strong, positive correlation.

OutliersA demonstration of how one extreme data point (an outlier) can influence the value of a correlation.

Hyporthesis testing • H0: p = 0 (There is no population correlation) • H1: p  0 (there is a real correlation)

CORRELATIONS: For non-linear relations Relationship between practice and performance. Although this relationship is not linear, there is a consistent positive relationship. An increase in performance tends to accompany an increase in practice.

Spearman r: Scatterplots showing (a) the scores and (b) the ranks for the data in Example 16.8. Notice that there is a consistent, positive relationship between the X and Y scores, although it is not a linear relationship. Also notice that the scatterplot of the ranks shows a perfect linear relationship.Steps:1. rank order2. use formula of Pearson r, or Special formula

Other measures of relationship • Point-biserial - where one variable is dichotomous (has two values; male vs. female, first-born vs. later born, etc.) • phi-coefficient - where both variables are (e.g., variable above - birth order (->1st vs. later born)

Introduction to regressionSAT scores and GPA - regression line drawn through the data points. The regression line defines a precise, one-to-one relationship between each X value (SAT score) and its corresponding Y value (GPA).

Relationship between total cost and number of hours playing tennis. The tennis club charges a $25 membership fee plus $5 per hour. The relationship is described by a linear equation: Total cost = $5 (number of hours) + $25 Y = bX + a.The statistical technique for finding a best-fit line is called regression

The distance between the actual data point (Y) and the predicted point on the line (Ŷ) is defined as Y– Ŷ. The goal of regression is to find the equation for the line that minimized these distances.

Best-fit straight line. The predicted Y values (Ŷ) are on the regression line. Unless the correlation is perfect (+1.00 or –1.00), there will be some error between the actual Y values and the predicted Y values. The larger the correlation is, the less the error will be.

Scatterplot showing data points that perfectly fit the regression equation Ŷ = 1.6X – 2. Note that the correlation is r = 1.00. (b) Scatterplot for the data from Example 16.14 Notice that there is error between the actual data points and the predicted Y values of the regression line.-total squared error = ∑(Y-Ŷ)2 ->least squared solution

Regression (contd.) • The regression equation for Y is the linear equation: • Goal is to find best a and b for best-fit line • Ŷ = bX + a, where: • b = SP/SSx, and a = MY-bMx • SP = (X-Mx) (Y-MY) • SSx= (X-Mx)2 • Example • X = 1, 3, 5 Y=4, 9, 8 (from text p. 559) • What are the predicted values for 5, 7, 9? • SPSS

A set of 9 data points (X and Y values) with a correlation of r = 0.80. The colored lines in part (a) show deviations from the mean for Y. For these data, SSY = 240 (total variability). In part (b) the colored lines show deviations from the regression line. For these data, SSerror = 86.4 The regression line reduces SS value by r2 = 0.64 or 64%. Error= 1 - r2

Statistics for the Behavioral Sciences (5 th ed.) Gravetter & Wallnau