Research Methods of Applied Linguistics and Statistics (11)

Research Methods of Applied Linguistics and Statistics (11) Correlation and multiple regression By Qin Xiaoqing

Pearson Correlation • The Pearson correlation allows us to establish the strength of relationships between continuous variables. • To show the relationship, the first step is to draw a scatterplot or scattergram, which can help us to obtain a preliminary understanding of this relationship. • The scatterplot can be described in terms of direction, strength and linearity.

Correlation and SPSS • Pearson product-moment coefficient is designed for interval level (continuous) variables. It can also be used if you have one continuous variable (e.g., scores on a measure of self-esteem) and one dichotomous variable (e.g., sex: M/F). • Spearman rank order correlation is designed for use with ordinal level or ranked data. • SPSS will calculate two types of correlation. First, it will give a simple bivariate correlation (which just means between two variables), also known as zero-order correlation. SPSS will also explore the relationship between two variables, while controlling for another variable. This is known as partial correlation.

Direction • Positive relationships represent relationships in which an increase in one variable is associated with an increase in a second. • Negative relationships represent relationships in which an increase in one variable is associated with decrease in a second.

Strength • Strong relationships appear as those in which the dots are very close to a straight line • Weak relationships appear as those in which the dots are more scattered about a straight line, or farther away from that line.

Linearity • Linear relationships are indicated when the pattern of dots on the scatter diagram appears to be straight, or if the points could be represented by drawing a straight line through them.

Steps for computation • List the score for each S in parallel columns on a data sheet. • Square each score and enter these values in the columns labeled X2 and Y2. • Multiply the scores and enter this value in the XY column. • Add the values in each column. • Insert the values in the formula of correlation coefficient.

Example

Scatterplot

Interpretation of scatterplot • Checking for outliers • Inspecting the distribution of data points: Are the data points spread all over the place? This suggests a very low correlation. Are all the points neatly arranged in a narrow cigar shape? This suggests quite a strong correlation. Could you draw a straight line through the main cluster of points, or would a curved line better represent the points? If a curved line is evident (suggesting a curvilinear relationship), then Pearson correlation should not be used. What is the shape of the cluster? Is it even from one end to the other? Or does it start off narrow and then get fatter. If this is the case, the data may be violating the assumption of variance homogeneity. • Determining the direction of the relationship between the variables

Formula of r for raw score

Assumptions underlying Pearson correlation • The data are measured as scores or ordinal scales that are truly continuous. • The scores on the two variables, X and Y, are independent. • The data should be normally distributed through their range. • The relationship between X and Y must be linear.

Interpreting the correlation coefficient When r=.60, the variance overlap between the 2 measures is .36. The overlap tells that the 2 measures provide similar information. Or the magnitude of r2 indicates the amount of variances in X which is accounted for by Y or vice versa.

Correlation coefficient • If you hope 2 tests measure basically the same thing, .71 isn’t very strong; .80 or .90 may be desirable. • A correlation of .30 or lower may appear weak, but in educational research such a correlation might be very important. • Significant level: p<.05, .01, df=N-2

r=.10 to .29 or r=–.10 to –.29 small • r=.30 to .49 or r=–.30 to –.49 medium • r=.50 to 1.0 or r=–.50 to –1.0 large

Presenting the results from correlation

Comparing the correlation coefficients for two groups • Sometimes when doing correlational research you may want to compare the strength of the correlation coefficients for two separate groups.

Factors affecting correlation • If you have a restricted range of scores on either of the variables, this will reduce the value of r, eg. Age (18-20) and success on an exam. • The existence of scores with extreme outliers in the data. • The presence of extremely high and extremely low scores on a variable with little in the middle. • Reliability of the data. • Non-linear relationship.Always check the scatterplot, particularly if you obtain low values of r.

Correlation versus causality • Correlation provides an indication that there is a relationship between two variables It does not however indicate that one variable causes the other. The correlation between two variables (A and B) could be due to the fact that A causes B, that B causes A, or (just to complicate matters) that an additional variable (C) causes both A and B. The possibility of a third variable that influences both of your observed variables should always be considered.

Statistical vs practical significance • Don’t get too excited if your correlation coefficients are ‘significant’. With large samples, even quite small correlation coefficients can reach statistical significance. Although statistically significant, the practical significance of a correlation of .2 is very limited. You should focus on the actual size of Pearson’s r and the amount of shared variance between the two variables. To interpret the strength of your correlation coefficient you should also take into account other research that has been conducted in your particular topic area. If other researchers in your area have only been able to predict 9 per cent of the variance (a correlation of .3) in a particular outcome (e.g., anxiety), then your study that explains 25 per cent would be impressive in comparison. In other topic areas, 25 per cent of the variance explained may seem small and irrelevant.

Linear regressionMultiple regression

Understanding regression • Regression is a way of predicting performance on the dependent variable via one or more independent variables. • In simple regression, we predict scores on one variable on the basis of scores on a second. • In multiple regression, we expand the possible sources of prediction and test to see which of many variables and which combination of variables allow us to make the best prediction.

Linear regression • Regression and correlation are related procedures. The correlation coefficient is central to simple linear regression. While we can’t make causal claims on the basis of correlation, we can use correlation to predict one variable from another. • We can’t just throw variables into a multiple regression and hope that, magically, answers will appear. • We should have a sound thoretical or conceptual reason for the analysis and, in particular, the order of variables entering the equation.

Uses of multiple regression • how well a set of variables is able to predict a particular outcome; • which variable in a set of variables is the best predictor of an outcome; and • whether a particular predictor variable is still able to predict an outcome when the effects of another variable are controlled for.

Assumptions of multiple regression • Sample size Stevens (1996) recommends that ‘for social science research, about 15 subjects per predictor are needed for a reliable equation’. Tabachnick and Fidell (1996, p. 132) give a formula for calculating sample size requirements, taking into account the number of independent variables that you wish to use: N > 50 + 8m (where m = number of independent variables). If you have five independent variables you would need 90 cases. More cases are needed if the dependent variable is skewed. For stepwise regression there should be a ratio of forty cases for every independent variable.

Multicollinearity. It exists when the independent variables are highly correlated (r=.9 and above). Multiple regression doesn’t like multicollinearity, and it certainly doesn’t contribute to a good regression model, so always check for this problem before you start. • Outliers. Multiple regression is very sensitive to outliers (very high or very low scores). • Normality, linearity

MLAT and language learning The closer r is to ±1 the smaller the error will be in predicting performance on one variable to that of the second. The smaller, the greater the error.

Predicting scores using regression 4 pieces of information are needed: They are • the mean for scores on one variable; • The mean for scores on the second variable; • The S’s score on X, and • The slope of the best-fitting straight line of the joint distribution. With this information, we can predict the S’s score on Y from X on a mathematical basis. By ‘regressing’ Y on X, predicting Y from X will be possible.

Regression line • Lines drawn to the straight line in the scatterplot show the amount of error. Suppose we square each of these errors and then find the mean of the sum of these squared errors. This best-fitting straight line is called regression line and is technically defined as the line that results in the smallest mean of the sum of the squared errors. • We can think of the regression line as being that which is closest to all the dots but, more precisely, it is the one that results in a mean of the squared errors that is less than any other line we might produce.

Determining the slope • Turn MLAT and language learning to z score for comparability. • Then plot the intersection of each S’s z score on the MLAT and on the test. As the z scores on the MLAT increase they form a ‘run’. The horizontal line of a triangle. At the same time, the z scores on the test increase to form a ‘rise’, the vertical line. • The slope (b) of the regression line is shown as we connect these 2 lines to form the third side of the triangle.

Regression coefficient with known r and SD • In the diagram, an increase of say 6 units on the run (MLAT) would equal 2 units of increase on the rise. • The slope is the rise divided by the run. The result is a fraction. That fraction is the correlation coefficient. • The correlation coefficient is the same as the slope of the best-fitting line in a z-score scatterplot. In the triangle, the slope of the regression line was 2÷6, and so r for the two is .33. suppose SDs are 8 and 10 respectively for Y and X. • To obtain the slope, we multiply the correlation coefficient by the standard deviation of Y over the standard deviation of X.

Regression coefficient with raw data • With r and SD, it is very easy to find the slope. With raw data, the formula for slope follows:

Example: using TSE to predict TOEFL • Mean on TOEFL=540, SD=40. Mean on TSE=30, SD=4. r=.80, b=8.0 • A student achieved 36 on the TSE, 6 higher than the mean. Multiplying that by the slope, we get 8×6=48. So our prediction of TOEFL is mean Y (540) +48=588. The formula follows: • Another regression equation is:

Standard error of estimate • There is some overlap in the variance of the two variables. When we square the value of r, we find the degree of shared variance. • Of the original 100% of the variance, with an r=.50, we have accurately accounted for 25% of the variance using the straight line as the bass for prediction. The error variance now is reduced to 75%. • In regression, standard error of estimate (SEE) shows the dispersion of scores away from the straight line. If all the data are tightly clustered on the line, little error is made in prediction. • SEE tells us how much error is likely to occur in prediction.

Error variance • To compute SEE, we need to know the error variance, which is the sum of squares of actual scores minus predicted scores divided by N-2. • The square root of this variance is referred to as the SEE (1.35): Mean for X=8, SD=4.47; mean for Y=10.8, SD=2.96; r=.89

Confidence interval • 68% confidence interval: ± 1 SEE (eg.± 1.35): 68% of actual Y scores would fall within .± 1.35 of the predicted Y score. • 95% confidence interval: ± 1.96×SEE • 99% confidence interval: ±2.58×SEE • Suppose estimated score is 11.98, then • 95% confidence interval : between 9.33 (11.98-1.35×1.96) and 14.63 (11.98+1.35×1.96) • 99% confidence interval? • 8.5(11.98-3.48) - 15.46 (11.98+3.48)

Estimated L2 scores predicted from class hours

Goodness of fit for regression model: R2 • R2, also called multiple correlation or the coefficient of multiple determination, is the percent of the variance in the dependent explained uniquely or jointly by the independents. • Adjusted R2 is an adjustment for the fact that when one has a large number of independents, it is possible that R2 will become artificially high simply because some independents' chance variations "explain" small parts of the variance of the dependent. • The greater the number of independents, the more the researcher is expected to report the adjusted coefficient.

T-test • t-tests are used to assess the significance of individual b coefficients. specifically testing the null hypothesis that the regression coefficient is zero.

F test • F test is used to test the significance of R, which is the same as testing the significance of R2, which is the same as testing the significance of the regression model as a whole. • If prob(F) < .05, then the model is considered significantly better than would be expected by chance and we reject the null hypothesis of no linear relationship of y to the independents.

Multicollinearity • Multicollinearity is the intercorrelation of independent variables. R2's near 1 violate the assumption of no perfect collinearity, while high R2's increase the standard error of the beta coefficients and make assessment of the unique role of each independent difficult or impossible.

tolerance or VIF • To assess multivariate multicollinearity, one uses tolerance or VIF, which build in the regressing of each independent on all the others. • As a rule of thumb, if tolerance is less than .20, a problem with multicollinearity is indicated. • When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable. • The more the multicollinearity, the lower the tolerance, the more the standard error of the regression coefficients.

Selecting method for predicting variables: Forward selection • This method starts with a model containing none of the explanatory variables. In the first step, the procedure considers variables one by one for inclusion and selects the variable that results in the largest increase in R2. In the second step, the procedures considers variables for inclusion in a model that only contains the variable selected in the first step. In each step, the variable with the largest increase in R2 is selected until, according to an F-test, further additions are judged to not improve the model.

Backward selection • This method starts with a model containing all the variables and eliminates variables one by one, at each step choosing the variable for exclusion as that leading to the smallest decrease in R2. Again, the procedure is repeated until, according to an F-test, further exclusions would represent a deterioration of the model.

Stepwise selection • This method is, essentially, a combination of the previous two approaches. Starting with no variables in the model, variables are added as with the forward selection method. In addition, after each inclusion step, a backward elimination process is carried out to remove variables that are no longer judged to improve the model.

Interpretation of the results from multiple regression • Checking the assumptions • Evaluating the model • Evaluating each of the independent variables

Presenting the results of multiple regression • It would be a good idea to look for examples of the presentation of different statistical analysis in the journals relevant to your topic area. Different journals have different requirements and expectations. Given the severe space limitations in journals these days, often only a brief summary of the results is presented and readers are encouraged to contact the author for a copy of the full results.

Research Methods of Applied Linguistics and Statistics (11)