Create Presentation
Download Presentation

Download Presentation
## Data Analysis : Correlation and regression analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Correlation**• Correlation measures to what extent two (or more) variables are related • Correlation expresses a relationship that is not necessarily precise (e.g. height and weight) • Positive correlation indicates that the two variables move in the same direction • Negative correlation indicates that they move in opposite directions Research Methods & Data Analysis**Covariance**• Covariance measures the “joint variability” • If two variables are independent, then the covariance is zero (however, Cov=O does not mean that two variables are independent) • Where E(…) indicates the expected value (i.e. average value) Research Methods & Data Analysis**Correlation coefficient**• The correlation coefficient r gives a measure (in the range –1, +1) of the relationship between two variables • r=0 means no correlation • r=+1 means perfect positive correlation • r=-1 means perfect negative correlation • Perfect correlation indicates that a p% variation in x corresponds to a p% variation in y Research Methods & Data Analysis**Correlation coefficient and covariance**Pearson correlation coefficient Correlation coefficient - POPULATION SAMPLE Research Methods & Data Analysis**Bivariate and multivariate correlation**• Bivariate correlation • 2 variables • Pearson correlation coefficient • Partial correlation • The correlation between two variables after allowing for the effect of other “control” variables Research Methods & Data Analysis**Significance level in correlation**• Level of correlation (value of the correlation coefficient): indicates to what extent the two variables “move together” • Significance of correlation (p value): given that the correlation coefficient is computed on a sample, indicates whether the relationship appear to be statistically significant • Examples • Correlation is 0.50, but not significant: the sampling error is so high that the actual correlation could even be 0 • Correlation is 0.10 and highly significant: the level of correlation is very low, but we can be confident on the value of such correlation Research Methods & Data Analysis**Correlation and covariance in SPSS**Choose between bivariate & partial Research Methods & Data Analysis**Bivariate correlation**Select the variables you want to analyse Require the significance level (two tailed) Ask for additional statistics (if necessary) Research Methods & Data Analysis**Bivariate correlation output**Research Methods & Data Analysis**Partial correlations**List of variables to be analysed Control variables Research Methods & Data Analysis**Partial correlation output**- - - P A R T I A L C O R R E L A T I O N C O E F F I C I E N T S - - - Controlling for.. SIZE STYLE AMTSPENT USECOUP ORG AMTSPENT 1.0000 .2677 -.0116 ( 0) ( 775) ( 775) P= . P= .000 P= .746 USECOUP .2677 1.0000 .0500 ( 775) ( 0) ( 775) P= .000 P= . P= .164 ORG -.0116 .0500 1.0000 ( 775) ( 775) ( 0) P= .746 P= .164 P= . (Coefficient / (D.F.) / 2-tailed Significance) " . " is printed if a coefficient cannot be computed Partial correlations still measure the correlation between two variables, but eliminate the effect of other variables, i.e. the correlations are computed on consumers shopping in stores of identical size and with the same shopping style Research Methods & Data Analysis**Bivariate and partial correlations**• Correlation between Amount spent and Use of coupon • Bivariate correlation: 0.291 (p value 0.00) • Partial correlation: 0.268 (p value 0.00) • The amount spent is positively correlated with the use of coupon (0=no use, 1=from newspaper, 2=from mailing, 3=both) • The level of correlation does not change much after accounting for different shop size and shopping styles Research Methods & Data Analysis**Linear regression analysis**Intercept Error Dependent variable Independent variable (explanatory variable, regressor…) Regression coefficient Research Methods & Data Analysis**Regression analysis**y x Research Methods & Data Analysis**Example**• We want to investigate if there is a relationship between cholesterol and age on a sample of 18 people • The dependent variable is the cholesterol level • The explanatory variable is age Research Methods & Data Analysis**What regression analysis does**• Determine whether a relationships exist between the dependent and explanatory variables • Determine how much of the variation in the dependent variable is explained by the independent variable (goodness of fit) • Allow to predict the values of the dependent variable Research Methods & Data Analysis**Regression and correlation**• Correlation: there is no causal relationship assumed • Regression: we assume that the explanatory variables “cause” the dependent variable • Bivariate: one explanatory variable • Multivariate: two or more explanatory variables Research Methods & Data Analysis**How to estimate the regression coefficients**• The objective is to estimate the population parameters a ebon our data sample: • A good way to estimate it is by minimising the error ei, which represents the difference between the actual observation and the estimated (predicted) one Research Methods & Data Analysis**The objective is to identify the line (i.e. the a and b**coefficients) that minimise the distance between the actual points and the fit line Research Methods & Data Analysis**The least square method**• This is based on minimising the square of the distance (error) rather than the distance Research Methods & Data Analysis**Bivariate regression in SPSS**Research Methods & Data Analysis**Regression dialog box**Dependent variable Explanatory variable Leave this unchanged! Research Methods & Data Analysis**Regression output**Statistical significance Is the coefficient different from 0? Value of the coefficients Research Methods & Data Analysis**Model diagnostics: goodness of fit**The value of the R square is included between 0 and 1 and represents the proportion of total variation that is explained by the regression model Research Methods & Data Analysis**R-square**Total variation Variation explaned by regression Residual variation Research Methods & Data Analysis**Multivariate regression**• The principle is identical to bivariate regression, but there are more explanatory variables • The goodness of fit can be measured through the adjusted R-square, which takes into account the number of explanatory variables Research Methods & Data Analysis**Multivariate regression in SPSS**• Analyze / Regression / Linear Simply select more than one explanatory variable Research Methods & Data Analysis**Output**Research Methods & Data Analysis**Coefficient interpretation**• The constant represents the amount spent being 0 all other variables (£ 296.5) • Health food stores, Size of store and being vegetarian are not significantly different from 0 • Gender coeff = -69.6: On average being woman (G=1) implies spending £ 69 less • Shopping style coeff = +22.8 S • S=1 (shop per himself) = +22.8 • S=2 (shop per himself & spouse) = +45.6 • S=3 (shop per himself & family) = +68.4 • Coupon use coeff = 30.4 C • C=1 (do not use coupon) = +30.4 • C=2 (coupon from newspapers) = +60.8 • C=3 (coupon from mailings) = +91.2 • C=4 (coupon from both) = +121.6 Categorization problems? Research Methods & Data Analysis**Prediction**• On average, how much will someone with the following characteristics spend: • Male (G=0) • Shopping for family (S=3) • Not using coupons (C=1) Research Methods & Data Analysis**How good is the model?**• The regression model explain less than 19% of the total variation in the amount spent Research Methods & Data Analysis**Task A**• Examine the relationship between the amount spent and the following customer characteristics: • Being male/female • Being vegetarian • Shopping for himself / for himself and others • Shopping style (weekly, bi-weekly, etc.) • Potential methods: • Battery of hypothesis testing & Analysis of variance • Regression Analysis Research Methods & Data Analysis**Task B**• Examine the relationship between the amount spent and the following customer characteristics: • Hypothesis: the average amount spent in health-oriented shop is higher than those of other shops. True or false? • Test the same hypothesis accounting for different shop sizes • Potential methods: • Battery of hypothesis testing & Analysis of variance • Regression Analysis Research Methods & Data Analysis**Task C**• Find a relationship between the average amount spent per store and the following store characteristics: • Size of store • Health-oriented store • Store organisation • Potential methods: • Transform the customer data set into a store data set • Battery of ANOVA • Regression Analysis Research Methods & Data Analysis**Task D**• Hypothesis: is the amount spent by those that use coupon significantly higher? • What is the most effective way of distributing coupons: • By mail • On newspapers • Both • Potential methods: • Recode the variable into 1=not using coupon and 2=using coupon • Hypothesis testing • Analysis of variance Research Methods & Data Analysis