Multivariate Data Analysis Using SPSS

Multivariate Data Analysis Using SPSS John Zhang ARL, IUP

Topics • A Guide to Multivariate Techniques • Preparation for Statistical Analysis • Review: ANOVA • Review: ANCOVA • MANOVA • MANCOVA • Repeated Measure Analysis • Factor Analysis • Discriminant Analysis • Cluster Analysis

Guide-1 • Correlation: 1 IV – 1 DV; relationship • Regression: 1+ IV – 1 DV; relation/prediction • T test: 1 IV (Cat.) – 1 DV; group diff. • One-way ANOVA: 1 IV (2+ cat.) – 1 DV; group diff. • One-way ANCOVA: 1 IV (2+ cat.) – 1 DV – 1+ covariates; group diff. • One-way MANOVA: 1 IV (2+ cat.) – 2+ DVs; group diff.

Guide-2 • One-way MANCOVA: 1 IV (2+cat.) – 2+ DVs – 1+ covariate; group diff. • Factorial MANOVA: 2+ IVs (2+cat.) – 2+ DVs; group diff. • Factorial MANCOVA: 2+ IVs (2+cat.) – 2+ DVs – 1+ covariate; group diff. • Discriminant Analysis: 2+ IVs – 1 DV (cat.); group prediction • Factor Analysis: explore the underlying structure

Preparation for Stat. Analysis-1 • Screen data • SPSS Utility procedures • Frequency procedure • Missing data analysis (missing data should be random) • Check if patterns exist • Drop data case-wise • Drop data variable-wise • Impute missing data

Preparation for Stat. Analysis-2 • Outliers (generally, statistical procedures are sensitive to outliers. • Univariate case: boxplot • Multivariate case: Mahalanobis distance (a chi-square statistics), a point is an outlier when its p-value is < .001. • Treatment: • Drop the case • Report two analysis (one with outlier, one without)

Preparation for Stat. Analysis-3 • Normality • Testing univariate normal: • Q-Q plot • Skewness and Kurtosis: they should be 0 when normal; not normal when p-value < .01 or .001 • Komogorov-Smirnov statistic: significant means not normal. • Testing multivariate normal: • Scatterplots should be elliptical • Each variable must be normal

Preparation for Stat. Analysis-4 • Linearity • Linear combination of variables make sense • Two variables (or comb. of variables) are linear • Check for linearity • Residual plot in regression • Scatterplots

Preparation for Stat. Analysis-5 • Homoscedasticity: the covariance matrixes are equal across groups • Box’s M test: test the equality of the covariance matrixes across groups • Sensitive to normality • Levene’s test: test equality of variances across groups. • Not sensitive to normality

Preparation for Stat. Analysis-Example-1 • Steps in preparation for stat. analysis: • Check for variable codling, recode if necessary • Examining missing data • Check for univariate outlier, normality, homogeneity of variances (Explore) • Test for homogeneity of variances (ANOVA) • Check for multivariate outliers (Regression>Save> Mahalanobis) • Check for linearity (scatterplots; residual plots in regression)

Preparation for Stat. Analysis-Example-2 • Use dataset dssft.sav • Objective: we are interested in investigating group differences (satjob2) in income (income91), age (age_2) and education (educ) • Check for coding: need to recode rincome91 into rincome_2 (22, 98, 99 be system missing) • Transform>Recode>Into Different Variable

Preparation for Stat. Analysis-Example-3 • Check for missing value • Use Frequency for categorical variable • Use Descriptive Stat. for measurement variable • For categorical variables: • If missing value is < 5%, use List-wise option • If >=5%, define the missing value as a new category • For measurement variables: • If missing value is < 5%, use List-wise option • If between 5% and 15%, use Transform>Replace Missing Value. Replacing less than 15% of data has little effect on the outcome • If greater than 15%, consider to drop the variable or subject

Preparation for Stat. Analysis-Example-4 • Check missing value for satjob2 • Analysis>Descriptive Statistics>Frequency • Check for missing value for rincome_2 • Analysis>Descriptive Statistics>Descriptive • Replaying the missing values in rincome_2 • Transform>Replacing Missing Value

Preparation for Stat. Analysis-Example-5 • Check for univariate outliers, normality, Homogeneity of variances • Analysis>Descriptive Statistics>Explore • Put rincome_2, age_2, and educ into the Dependent List box; satjob2 into Factor List box • There are outliers in rincome_2, lets change those outliers to the acceptable min or max value • Transform>Recode>Into Different Variable • Put income_2 into Original Variable box, type income_3 as the new name • Replace all values <= 3 by 4, all other values remain the same

Preparation for Stat. Analysis-Example-6 • Explore rincome_3 again: not normal • Transform rincome_3 into rincome_4 by ln or sqrt • Explore rincome_4 • Check for multivariate outliers • Analysis>Regression>linear • Put id (dummy variable) into Depend box, put rincome_4, age_2, and educ into Independent box • Click at Save, then Mahalanobis box • Compare Mahalanobis dist. with chi-sqrt critical value at p=.001 and df=number of independent variables

Preparation for Stat. Analysis-Example-7 • Check for multivariate normal: • Must univariate normal • Construct a scatterplot matrix, each scatterplot should be elliptical shape • Check for Homoscedasticity • Univariate (ANOVA, Levene’s test) • Multivariate (MANOVA, Box’s M test, use .01 level of significance level)

Review: ANOVA -1 • One-way ANOVA test the equality of group means • Assumptions: independent observations; normality; homogeneity of variance • Two-way ANOVA tests three hypotheses simultaneously: • Test the interaction of the levels of the two independent variables • Interaction occurs when the effects of one factor depends on the different levels of the second factor • Test the two independent variable separately

Review: ANCOVA -1 • Idea: the difference on a DV often does not just depend on one or two IVs, it may depend on other measurement variables. ANCOVA takes into account of such dependency. • i.e. it removes the effect of one or more covariates • Assumptions: in addition to the regular ANOVA assumptions, we need: • Linear relationship between DV and covariates • The slope for the regression line is the same for each group • The covariates are reliable and is measure without error

Review: ANCOVA -2 • Homogeneity of slopes = homogeneity of regression = there is interaction between IVs and the covariate • If the interaction between covariate and IVs are significant, ANCOVA should not be conducted • Example: determine if hours worked per week (hrs2) is different by gender (sex) and for those satisfy or dissatisfied with their job (satjob2), after adjusted to their income (or equalized to their income)

Review: ANCOVA -3 • Analysis>GLM>Univariate • Move hrs2 into DV box; move sex and satjob2 into Fixed Factor box; move rincome_2 into Covariate box • Click at Model>Custom • Highlight all variables and move it to the Model box • Make sure the Interaction option is selected • Click at Option • Move sex and satjob2 into Display Means box • Click Descriptive Stat.; Estimates of effect size; and Homogeneity tests • This tests the homogeneity of regression slopes

Review: ANCOVA -4 • If there is no interaction found by the previous step, then repeat the previous step except click at Model>Factorial instead of Model>Custom

Review: ANOVA -2 • Interaction is significant means the two IVs in combination result in a significant effect on the DV, thus, it does not make sense to interpret the main effects. • Assumptions: the same as One-way ANOVA • Example: the impact of gender (sex) and age (agecat4) on income (rincome_2) • Explore (omitted) • Analysis>GLM>univariate • Click model>click Full factorial>Cont. • Click Options>Click Descriptive Stat; Estimates of effect size; Homogeneity test • Click Post Hoc>click LSD; Bonferroni; Scheffe; Cont. • Click Plots>put one IV into Horizontal and the other into Separate line

MANOVA-1 • Characteristics • Similar to ANOVA • Multiple DVs • The DVs are correlated and linear combination makes sense • It tests whether mean differences among k groups on a combination of DVs are likely to have occurred by chance • The idea of MANOVA is find a linear combination that separates the groups ‘optimally’, and perform ANOVA on the linear combination

MANOVA-2 • Advantages • The chance of discovering what actually changed as a result of the the different treatment increases • May reveal differences not shown in separate ANOVAs • Without inflation of type one error • The use of multiple ANOVAs ignores some very important info (the fact that the DVs are correlated)

MANOVA-3 • Disadvantages • More complicated • ANOVA is often more powerful • Assumptions: • Independent random samples • Multivariate normal distribution in each group • Homogeneity of covariance matrix • Linear relationship among DVs

MANOVA-4 • Steps in carry out MANOVA • Check for assumptions • If MANOVA is not significant, stop • If MANOVA is significant, carry out univariate ANOVA • If univariate ANOVA is significant, do Post Hoc • If homoscedasticity, use Wilks Lambda, if not, use Pillai’s Trace. In general, all 4 statistics should be similar.

MANOVA-5 • Example:An experiment looking at the memory effects of different instructions: 3 groups of human subjects learned nonsense syllables as they were presented and were administered two memory tests: recall and recognition. The first group of subjects was instructed to like or dislike the syllables as they were presented (to generate affect). A second group was instructed that they will be tested (induce anxiety?). The 3rd group was told to count the syllable as the were presented (interference). The objective is to access group differences in memory

MANOVA-6 • How to do it? • File>Open Data • Open the file As9.por in Instruct>Zhang Multivariate Short Course folder • Analyze>GLM>Multivariate • Move recall and recog into Dependent Variable box; move group into Fixed Factors box • Click at Options; move group into Display means box (this will display the marginal means predicted by the model, these means may be different than the observed means if there are covariates or the model is not factorial); Compare main effect box is for testing the every pair of the estimated marginal means for the selected factors. • Click at Estimates of effect size and Homogeneity of variance

MANOVA-7 • Push buttons: • Plots: create a profile plot for each DV displaying group means • Post Hoc: Post Hoc tests for marginal means • Save: save predicted values, etc. • Contrast: perform planned comparisons • Model: specify the model • Options: • Display Means for: display the estimated means predicted by the model • Compare main effects: test for significant difference between every pair of estimated marginal means for each of the main effects

MANOVA-8 • Observed power: produce a statistical power analysis for your study • Parameter estimate: check this when you need a predictive model • Spread vs. level plot: visual display of homogeneity of variance

MANOVA-9 • Example 2: Check for the impact of job satisfaction (satjob) and gender (sex) on income (rincome_2) and education (educ) (in gssft.sav) • Screen data: transform educ to educ2 to eliminate cases with ‘6 or less’ • Check for assumptions: explore • MANOVA

MANCOVA-1 • Objective: Test for mean differences among groups for a linear combination of DVs after adjusted for the covariate. • Example: to test if there is differences in productivity (measured by income and hours worked) for individuals in different age groups after adjusted for the education level

MANCOVA-2 • Assumptions: similar to ANCOVA • SPSS how to: • Analysis>GLM>Multivariate • Move rincome_2 and educ2 to DV box; move sex and satjob into IV box; move age to Covariate box • Check for homogeneity of regression • Click at Model>Custom; Highlight all variables and move them to Model box • If the covariate-IVs interaction is not significant, repeat the process but select the Full under model

Repeated Measure Analysis-1 • Objective: test for significant differences in means when the same observation appears in multiple levels of a factor • Examples of repeated measure studies: • Marketing – compare customer’s ratings on 4 different brands • Medicine – compare test results before, immediately after, and six months after a procedure • Education – compare performance test scores before and after an intervention program

Repeated Measure Analysis-2 • The logic of repeated measure: SPSS performs repeated measure ANOVA by computing contrasts (differences) across the repeated measures factor’s levels for each subject, then testing if the means of the contrasts are significantly different from 0; any between subject tests are based on the means of the subjects.

Repeated Measure Analysis-3 • Assumptions: • Independent observations • Normality • Homogeneity of variances • Sphericity: if two or more contrasts are to be pooled (the test of main effect is based on this pooling), then the contrasts should be equally weighted and uncorrelated (equal variances and uncorrelated contrasts); this assumption is equivalent to the covariance matrix is diagonal and the diagonal elements are the same)

Repeated Measure Analysis-4 • Example 1: A study in which 5 subjects were tested in each of 4 drug conditions • Open data file: • File>Open…Data; select Repmeas1.por • SPSS repeated measure procedure: • Analyze>GLM>Repeated Measure • Within-Subject Factor Name (the name of the repeated measure factor): a repeated measure factor is expressed as a set of variables • Replace factor1 with Drug • Number of levels: the number of repeated measurements • Type 4

Repeated Measure Analysis-5 • The Measure pushbutton for two functions • For multiple dependent measures (e.g. we recorded 4 measures of physiological stress under each of the drug conditions) • To label the factor levels • Click Measure; type memory in Measure name box; click add • Click Define: here we link the repeated measure factor level to variable names; define between subject factors and covariates • Move drug1 – drug 4 to the Within-Subject box • You can move a selected variable by the up and down button

Repeated Measure Analysis-6 • Model button: by default a complete model • Contrast button: specify particular contrasts • Plot button: create profile plots that graph factor level estimated marginal means for up to 3 factors at a time • Post Hoc: provide Post Hoc tests for between subject factors • Save button: allow you to save predicted values, residuals, etc. • Options: similar to MANOVA • Click Descriptive; click at Transformation Matrix (it provides the contrasts)

Repeated Measure Analysis-7 • Interpret the results • Look at the descriptive statistics • Look at the test for Sphericity • If Sphericity is significant, use the Multivariate results (test on the contrasts). It tests whether all of the contrast variables are zero in the population • If Sphericity is not significant, use the Sphericity Assumed result • Look at the tests for within subject contrasts: it test the linear trend; the quadratic trend… • It may not be make sense in some applications, as in this example (but it makes sense in terms of time and dosage)

Repeated Measure Analysis-8 • Transformation matrix provide info on what are linear contrast, etc. • The fist table is for the average across the repeated measure factor (here they are all .5, it means each variable is weighted equally, normalization requires that the square of the sums equals to 1) • The second table defines the corresponding repeated measure factor • Linear – increase by a constant, etc. • Linear and quadratic is orthogonal, etc. • Having concluded there are memory differences due to drug condition, , we want to know which condition differ to which others

Repeated Measure Analysis-9 • Repeat the analysis, except under Option button, move ‘drug’ into Display Means, click at Compare Main effects and select Bonferroni adjustment • Transformation Coefficients (M Matrix): it shows how the variables are created for comparison. Here, we compare the drug conditions, so the M matrix is an identity matrix • Suppose we want to test each adjacent pair of means: drug1 vs. drug2; drug2 vs. drug3; drug3 vs. drug 4: • Repeated measure>Define>Contrast>Select Repeated

Repeated Measure Analysis-10 • Example 2: A marketing experiment was devised to evaluate whether viewing a commercial produces improved ratings for a specific brand. Ratings on 3 brands were obtained from objects before and after viewing the commercial. Since the hope was that the commercial would improve ratings of only one brand (A), researchers expected a significant brand by pre-post commercial interaction. There are two between-subjects factors: sex and brand used by the subject

Repeated Measure Analysis-11 • SPSS how to: • Analyze>GLM>Repeated Measures • Replace factor1 with prepost in the Within-Subject Factor box; type 2 in the Number of level box; click add • Type brand in the Within-Subject Factor box; type 3 in the Number of level box; click add • Click measure; type measure in Measure Name box; click add • Note: SPSS expects 2 between-subject factors

Repeated Measure Analysis-12 • Click Define button; move the appropriate variable into place; move sex and user into Between-Subject Factor box • Click Options button; move sex, user, prepost and brand into the Display means box • Click Homogeneity tests and descriptive boxes • Click Plot; move user into Horizontal Axis box and brand into Separate Lines box • Click continue; OK

Factor Analysis-1 • The main goal of factor analysis is data reduction. A typical use of factor analysis is in survey research, where a researcher wishes to represent a number of questions with a smaller number of factors • Two questions in factor analysis: • How many factors are there and what they represent (interpretation) • Two technical aids: • Eigenvalues • Percentage of variance accounted for

Factor Analysis-2 • Two types of factor analysis: • Exploratory: introduce here • Confirmatory: SPSS AMOS • Theoretical basis: • Correlations among variables are explained by underlying factors • An example of mathematical 1 factor model for two variables: V1=L1*F1+E1 V2=L2*F1+E2

Factor Analysis-3 • Each variable is compose of a common factor (F1) multiply by a loading coefficient (L1, L2 – the lambdas or factor loadings) plus a random component • V1 and V2 correlate because the common factor and should relate to the factor loadings, thus, the factor loadings can be estimated by the correlations • A set of correlations can derive different factor loadings (i.e. the solutions are not unique) • One should pick the simplest solution

Factor Analysis-4 • A factor solution needs to be confirm: • By a different factor method • By a different sample • More on terminology • Factor loading: interpreted as the Pearson correlation between the variable and the factor • Communality: the proportion of variability for a given variable that is explained by the factor • Extraction: the process by which the factors are determined from a large set of variables

Multivariate Data Analysis Using SPSS