Multivariate data analysis using spss
This presentation is the property of its rightful owner.
Sponsored Links
1 / 124

Multivariate Data Analysis Using SPSS PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Multivariate Data Analysis Using SPSS. John Zhang ARL, IUP. Topics. A Guide to Multivariate Techniques Preparation for Statistical Analysis Review: ANOVA Review: ANCOVA MANOVA MANCOVA Repeated Measure Analysis Factor Analysis Discriminant Analysis Cluster Analysis. Guide-1.

Download Presentation

Multivariate Data Analysis Using SPSS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Multivariate data analysis using spss

Multivariate Data Analysis Using SPSS

John Zhang




  • A Guide to Multivariate Techniques

  • Preparation for Statistical Analysis

  • Review: ANOVA

  • Review: ANCOVA



  • Repeated Measure Analysis

  • Factor Analysis

  • Discriminant Analysis

  • Cluster Analysis

Guide 1


  • Correlation: 1 IV – 1 DV; relationship

  • Regression: 1+ IV – 1 DV; relation/prediction

  • T test: 1 IV (Cat.) – 1 DV; group diff.

  • One-way ANOVA: 1 IV (2+ cat.) – 1 DV; group diff.

  • One-way ANCOVA: 1 IV (2+ cat.) – 1 DV – 1+ covariates; group diff.

  • One-way MANOVA: 1 IV (2+ cat.) – 2+ DVs; group diff.

Guide 2


  • One-way MANCOVA: 1 IV (2+cat.) – 2+ DVs – 1+ covariate; group diff.

  • Factorial MANOVA: 2+ IVs (2+cat.) – 2+ DVs; group diff.

  • Factorial MANCOVA: 2+ IVs (2+cat.) – 2+ DVs – 1+ covariate; group diff.

  • Discriminant Analysis: 2+ IVs – 1 DV (cat.); group prediction

  • Factor Analysis: explore the underlying structure

Preparation for stat analysis 1

Preparation for Stat. Analysis-1

  • Screen data

    • SPSS Utility procedures

    • Frequency procedure

  • Missing data analysis (missing data should be random)

    • Check if patterns exist

    • Drop data case-wise

    • Drop data variable-wise

    • Impute missing data

Preparation for stat analysis 2

Preparation for Stat. Analysis-2

  • Outliers (generally, statistical procedures are sensitive to outliers.

    • Univariate case: boxplot

    • Multivariate case: Mahalanobis distance (a chi-square statistics), a point is an outlier when its p-value is < .001.

    • Treatment:

      • Drop the case

      • Report two analysis (one with outlier, one without)

Preparation for stat analysis 3

Preparation for Stat. Analysis-3

  • Normality

    • Testing univariate normal:

      • Q-Q plot

      • Skewness and Kurtosis: they should be 0 when normal; not normal when p-value < .01 or .001

      • Komogorov-Smirnov statistic: significant means not normal.

    • Testing multivariate normal:

      • Scatterplots should be elliptical

      • Each variable must be normal

Preparation for stat analysis 4

Preparation for Stat. Analysis-4

  • Linearity

    • Linear combination of variables make sense

    • Two variables (or comb. of variables) are linear

    • Check for linearity

      • Residual plot in regression

      • Scatterplots

Preparation for stat analysis 5

Preparation for Stat. Analysis-5

  • Homoscedasticity: the covariance matrixes are equal across groups

    • Box’s M test: test the equality of the covariance matrixes across groups

      • Sensitive to normality

    • Levene’s test: test equality of variances across groups.

      • Not sensitive to normality

Preparation for stat analysis example 1

Preparation for Stat. Analysis-Example-1

  • Steps in preparation for stat. analysis:

    • Check for variable codling, recode if necessary

    • Examining missing data

    • Check for univariate outlier, normality, homogeneity of variances (Explore)

    • Test for homogeneity of variances (ANOVA)

    • Check for multivariate outliers (Regression>Save> Mahalanobis)

    • Check for linearity (scatterplots; residual plots in regression)

Preparation for stat analysis example 2

Preparation for Stat. Analysis-Example-2

  • Use dataset dssft.sav

  • Objective: we are interested in investigating group differences (satjob2) in income (income91), age (age_2) and education (educ)

  • Check for coding: need to recode rincome91 into rincome_2 (22, 98, 99 be system missing)

    • Transform>Recode>Into Different Variable

Preparation for stat analysis example 3

Preparation for Stat. Analysis-Example-3

  • Check for missing value

    • Use Frequency for categorical variable

    • Use Descriptive Stat. for measurement variable

    • For categorical variables:

      • If missing value is < 5%, use List-wise option

      • If >=5%, define the missing value as a new category

    • For measurement variables:

      • If missing value is < 5%, use List-wise option

      • If between 5% and 15%, use Transform>Replace Missing Value. Replacing less than 15% of data has little effect on the outcome

      • If greater than 15%, consider to drop the variable or subject

Preparation for stat analysis example 4

Preparation for Stat. Analysis-Example-4

  • Check missing value for satjob2

    • Analysis>Descriptive Statistics>Frequency

  • Check for missing value for rincome_2

    • Analysis>Descriptive Statistics>Descriptive

  • Replaying the missing values in rincome_2

    • Transform>Replacing Missing Value

Preparation for stat analysis example 5

Preparation for Stat. Analysis-Example-5

  • Check for univariate outliers, normality, Homogeneity of variances

    • Analysis>Descriptive Statistics>Explore

      • Put rincome_2, age_2, and educ into the Dependent List box; satjob2 into Factor List box

    • There are outliers in rincome_2, lets change those outliers to the acceptable min or max value

      • Transform>Recode>Into Different Variable

        • Put income_2 into Original Variable box, type income_3 as the new name

        • Replace all values <= 3 by 4, all other values remain the same

Preparation for stat analysis example 6

Preparation for Stat. Analysis-Example-6

  • Explore rincome_3 again: not normal

    • Transform rincome_3 into rincome_4 by ln or sqrt

  • Explore rincome_4

  • Check for multivariate outliers

    • Analysis>Regression>linear

      • Put id (dummy variable) into Depend box, put rincome_4, age_2, and educ into Independent box

      • Click at Save, then Mahalanobis box

      • Compare Mahalanobis dist. with chi-sqrt critical value at p=.001 and df=number of independent variables

Preparation for stat analysis example 7

Preparation for Stat. Analysis-Example-7

  • Check for multivariate normal:

    • Must univariate normal

    • Construct a scatterplot matrix, each scatterplot should be elliptical shape

  • Check for Homoscedasticity

    • Univariate (ANOVA, Levene’s test)

    • Multivariate (MANOVA, Box’s M test, use .01 level of significance level)

Review anova 1

Review: ANOVA -1

  • One-way ANOVA test the equality of group means

    • Assumptions: independent observations; normality; homogeneity of variance

  • Two-way ANOVA tests three hypotheses simultaneously:

    • Test the interaction of the levels of the two independent variables

      • Interaction occurs when the effects of one factor depends on the different levels of the second factor

    • Test the two independent variable separately

Review ancova 1

Review: ANCOVA -1

  • Idea: the difference on a DV often does not just depend on one or two IVs, it may depend on other measurement variables. ANCOVA takes into account of such dependency.

    • i.e. it removes the effect of one or more covariates

  • Assumptions: in addition to the regular ANOVA assumptions, we need:

    • Linear relationship between DV and covariates

    • The slope for the regression line is the same for each group

    • The covariates are reliable and is measure without error

Review ancova 2

Review: ANCOVA -2

  • Homogeneity of slopes = homogeneity of regression = there is interaction between IVs and the covariate

    • If the interaction between covariate and IVs are significant, ANCOVA should not be conducted

  • Example: determine if hours worked per week (hrs2) is different by gender (sex) and for those satisfy or dissatisfied with their job (satjob2), after adjusted to their income (or equalized to their income)

  • Review ancova 3

    Review: ANCOVA -3

    • Analysis>GLM>Univariate

      • Move hrs2 into DV box; move sex and satjob2 into Fixed Factor box; move rincome_2 into Covariate box

      • Click at Model>Custom

        • Highlight all variables and move it to the Model box

        • Make sure the Interaction option is selected

      • Click at Option

        • Move sex and satjob2 into Display Means box

        • Click Descriptive Stat.; Estimates of effect size; and Homogeneity tests

      • This tests the homogeneity of regression slopes

    Review ancova 4

    Review: ANCOVA -4

    • If there is no interaction found by the previous step, then repeat the previous step except click at Model>Factorial instead of Model>Custom

    Review anova 2

    Review: ANOVA -2

    • Interaction is significant means the two IVs in combination result in a significant effect on the DV, thus, it does not make sense to interpret the main effects.

    • Assumptions: the same as One-way ANOVA

    • Example: the impact of gender (sex) and age (agecat4) on income (rincome_2)

      • Explore (omitted)

      • Analysis>GLM>univariate

        • Click model>click Full factorial>Cont.

        • Click Options>Click Descriptive Stat; Estimates of effect size; Homogeneity test

        • Click Post Hoc>click LSD; Bonferroni; Scheffe; Cont.

        • Click Plots>put one IV into Horizontal and the other into Separate line

    Manova 1


    • Characteristics

      • Similar to ANOVA

      • Multiple DVs

      • The DVs are correlated and linear combination makes sense

      • It tests whether mean differences among k groups on a combination of DVs are likely to have occurred by chance

      • The idea of MANOVA is find a linear combination that separates the groups ‘optimally’, and perform ANOVA on the linear combination

    Manova 2


    • Advantages

      • The chance of discovering what actually changed as a result of the the different treatment increases

      • May reveal differences not shown in separate ANOVAs

      • Without inflation of type one error

      • The use of multiple ANOVAs ignores some very important info (the fact that the DVs are correlated)

    Manova 3


    • Disadvantages

      • More complicated

      • ANOVA is often more powerful

    • Assumptions:

      • Independent random samples

      • Multivariate normal distribution in each group

      • Homogeneity of covariance matrix

      • Linear relationship among DVs

    Manova 4


    • Steps in carry out MANOVA

      • Check for assumptions

      • If MANOVA is not significant, stop

      • If MANOVA is significant, carry out univariate ANOVA

      • If univariate ANOVA is significant, do Post Hoc

    • If homoscedasticity, use Wilks Lambda, if not, use Pillai’s Trace. In general, all 4 statistics should be similar.

    Manova 5


    • Example:An experiment looking at the memory effects of different instructions: 3 groups of human subjects learned nonsense syllables as they were presented and were administered two memory tests: recall and recognition. The first group of subjects was instructed to like or dislike the syllables as they were presented (to generate affect). A second group was instructed that they will be tested (induce anxiety?). The 3rd group was told to count the syllable as the were presented (interference). The objective is to access group differences in memory

    Manova 6


    • How to do it?

      • File>Open Data

        • Open the file As9.por in Instruct>Zhang Multivariate Short Course folder

      • Analyze>GLM>Multivariate

        • Move recall and recog into Dependent Variable box; move group into Fixed Factors box

        • Click at Options; move group into Display means box (this will display the marginal means predicted by the model, these means may be different than the observed means if there are covariates or the model is not factorial); Compare main effect box is for testing the every pair of the estimated marginal means for the selected factors.

        • Click at Estimates of effect size and Homogeneity of variance

    Manova 7


    • Push buttons:

      • Plots: create a profile plot for each DV displaying group means

      • Post Hoc: Post Hoc tests for marginal means

      • Save: save predicted values, etc.

      • Contrast: perform planned comparisons

      • Model: specify the model

      • Options:

        • Display Means for: display the estimated means predicted by the model

          • Compare main effects: test for significant difference between every pair of estimated marginal means for each of the main effects

    Manova 8


    • Observed power: produce a statistical power analysis for your study

    • Parameter estimate: check this when you need a predictive model

    • Spread vs. level plot: visual display of homogeneity of variance

    Manova 9


    • Example 2: Check for the impact of job satisfaction (satjob) and gender (sex) on income (rincome_2) and education (educ) (in gssft.sav)

      • Screen data: transform educ to educ2 to eliminate cases with ‘6 or less’

      • Check for assumptions: explore

      • MANOVA

    Mancova 1


    • Objective: Test for mean differences among groups for a linear combination of DVs after adjusted for the covariate.

    • Example: to test if there is differences in productivity (measured by income and hours worked) for individuals in different age groups after adjusted for the education level

    Mancova 2


    • Assumptions: similar to ANCOVA

    • SPSS how to:

      • Analysis>GLM>Multivariate

        • Move rincome_2 and educ2 to DV box; move sex and satjob into IV box; move age to Covariate box

        • Check for homogeneity of regression

          • Click at Model>Custom; Highlight all variables and move them to Model box

        • If the covariate-IVs interaction is not significant, repeat the process but select the Full under model

    Repeated measure analysis 1

    Repeated Measure Analysis-1

    • Objective: test for significant differences in means when the same observation appears in multiple levels of a factor

    • Examples of repeated measure studies:

      • Marketing – compare customer’s ratings on 4 different brands

      • Medicine – compare test results before, immediately after, and six months after a procedure

      • Education – compare performance test scores before and after an intervention program

    Repeated measure analysis 2

    Repeated Measure Analysis-2

    • The logic of repeated measure: SPSS performs repeated measure ANOVA by computing contrasts (differences) across the repeated measures factor’s levels for each subject, then testing if the means of the contrasts are significantly different from 0; any between subject tests are based on the means of the subjects.

    Repeated measure analysis 3

    Repeated Measure Analysis-3

    • Assumptions:

      • Independent observations

      • Normality

      • Homogeneity of variances

      • Sphericity: if two or more contrasts are to be pooled (the test of main effect is based on this pooling), then the contrasts should be equally weighted and uncorrelated (equal variances and uncorrelated contrasts); this assumption is equivalent to the covariance matrix is diagonal and the diagonal elements are the same)

    Repeated measure analysis 4

    Repeated Measure Analysis-4

    • Example 1: A study in which 5 subjects were tested in each of 4 drug conditions

    • Open data file:

      • File>Open…Data; select Repmeas1.por

    • SPSS repeated measure procedure:

      • Analyze>GLM>Repeated Measure

        • Within-Subject Factor Name (the name of the repeated measure factor): a repeated measure factor is expressed as a set of variables

          • Replace factor1 with Drug

        • Number of levels: the number of repeated measurements

          • Type 4

    Repeated measure analysis 5

    Repeated Measure Analysis-5

    • The Measure pushbutton for two functions

      • For multiple dependent measures (e.g. we recorded 4 measures of physiological stress under each of the drug conditions)

      • To label the factor levels

        • Click Measure; type memory in Measure name box; click add

      • Click Define: here we link the repeated measure factor level to variable names; define between subject factors and covariates

        • Move drug1 – drug 4 to the Within-Subject box

          • You can move a selected variable by the up and down button

    Repeated measure analysis 6

    Repeated Measure Analysis-6

    • Model button: by default a complete model

    • Contrast button: specify particular contrasts

    • Plot button: create profile plots that graph factor level estimated marginal means for up to 3 factors at a time

    • Post Hoc: provide Post Hoc tests for between subject factors

    • Save button: allow you to save predicted values, residuals, etc.

    • Options: similar to MANOVA

      • Click Descriptive; click at Transformation Matrix (it provides the contrasts)

    Repeated measure analysis 7

    Repeated Measure Analysis-7

    • Interpret the results

      • Look at the descriptive statistics

      • Look at the test for Sphericity

        • If Sphericity is significant, use the Multivariate results (test on the contrasts). It tests whether all of the contrast variables are zero in the population

        • If Sphericity is not significant, use the Sphericity Assumed result

      • Look at the tests for within subject contrasts: it test the linear trend; the quadratic trend…

        • It may not be make sense in some applications, as in this example (but it makes sense in terms of time and dosage)

    Repeated measure analysis 8

    Repeated Measure Analysis-8

    • Transformation matrix provide info on what are linear contrast, etc.

      • The fist table is for the average across the repeated measure factor (here they are all .5, it means each variable is weighted equally, normalization requires that the square of the sums equals to 1)

      • The second table defines the corresponding repeated measure factor

        • Linear – increase by a constant, etc.

        • Linear and quadratic is orthogonal, etc.

  • Having concluded there are memory differences due to drug condition, , we want to know which condition differ to which others

  • Repeated measure analysis 9

    Repeated Measure Analysis-9

    • Repeat the analysis, except under Option button, move ‘drug’ into Display Means, click at Compare Main effects and select Bonferroni adjustment

      • Transformation Coefficients (M Matrix): it shows how the variables are created for comparison. Here, we compare the drug conditions, so the M matrix is an identity matrix

    • Suppose we want to test each adjacent pair of means: drug1 vs. drug2; drug2 vs. drug3; drug3 vs. drug 4:

      • Repeated measure>Define>Contrast>Select Repeated

    Repeated measure analysis 10

    Repeated Measure Analysis-10

    • Example 2: A marketing experiment was devised to evaluate whether viewing a commercial produces improved ratings for a specific brand. Ratings on 3 brands were obtained from objects before and after viewing the commercial. Since the hope was that the commercial would improve ratings of only one brand (A), researchers expected a significant brand by pre-post commercial interaction. There are two between-subjects factors: sex and brand used by the subject

    Repeated measure analysis 11

    Repeated Measure Analysis-11

    • SPSS how to:

      • Analyze>GLM>Repeated Measures

        • Replace factor1 with prepost in the Within-Subject Factor box; type 2 in the Number of level box; click add

        • Type brand in the Within-Subject Factor box; type 3 in the Number of level box; click add

        • Click measure; type measure in Measure Name box; click add

        • Note: SPSS expects 2 between-subject factors

    Repeated measure analysis 12

    Repeated Measure Analysis-12

    • Click Define button; move the appropriate variable into place; move sex and user into Between-Subject Factor box

    • Click Options button; move sex, user, prepost and brand into the Display means box

    • Click Homogeneity tests and descriptive boxes

    • Click Plot; move user into Horizontal Axis box and brand into Separate Lines box

    • Click continue; OK

    Factor analysis 1

    Factor Analysis-1

    • The main goal of factor analysis is data reduction. A typical use of factor analysis is in survey research, where a researcher wishes to represent a number of questions with a smaller number of factors

    • Two questions in factor analysis:

      • How many factors are there and what they represent (interpretation)

    • Two technical aids:

      • Eigenvalues

      • Percentage of variance accounted for

    Factor analysis 2

    Factor Analysis-2

    • Two types of factor analysis:

      • Exploratory: introduce here

      • Confirmatory: SPSS AMOS

    • Theoretical basis:

      • Correlations among variables are explained by underlying factors

      • An example of mathematical 1 factor model for two variables:



    Factor analysis 3

    Factor Analysis-3

    • Each variable is compose of a common factor (F1) multiply by a loading coefficient (L1, L2 – the lambdas or factor loadings) plus a random component

    • V1 and V2 correlate because the common factor and should relate to the factor loadings, thus, the factor loadings can be estimated by the correlations

    • A set of correlations can derive different factor loadings (i.e. the solutions are not unique)

    • One should pick the simplest solution

    Factor analysis 4

    Factor Analysis-4

    • A factor solution needs to be confirm:

      • By a different factor method

      • By a different sample

  • More on terminology

    • Factor loading: interpreted as the Pearson correlation between the variable and the factor

    • Communality: the proportion of variability for a given variable that is explained by the factor

    • Extraction: the process by which the factors are determined from a large set of variables

  • Factor analysis 5

    Factor Analysis-5

    • Principle component: one of the extraction methods

      • A principle component is a linear combination of observed variables that is independent (orthogonal) of other components

      • The first component accounts for the largest amount of variance in the input data; the second component accounts for the largest amount or the remaining variance…

      • Components are orthogonal means they are uncorrelated

    Factor analysis 6

    Factor Analysis-6

    • Possible application of principle components:

      • E.g. in a survey research, it is common to have many questions to address one issue (e.g. customer service). It is likely that these questions are highly correlated. It is problematic to use these variables in some statistical procedures (e.g. regression). One can use factor scores, computed from factor loadings on each orthogonal component

    Factor analysis 7

    Factor Analysis-7

    • Principle component vs. other extract methods:

      • Principle component focus on accounting for the maximum among of variance (the diagonal of a correlation matrix)

      • Other extract methods (e.g. principle axis factoring) focus more on accounting for the correlations between variables (off diagonal correlations)

      • Principle component can be defined as a unique combination of variables but the other factor methods can not

      • Principle component are use for data reduction but more difficult to interpret

    Factor analysis 8

    Factor Analysis-8

    • Number of factors:

      • Eigenvalues are often used to determine how many factors to take

        • Take as many factors there are eigenvalues greater than 1

          • Eigenvalue represents the amount of standardized variance in the variable accounted for by a factor

          • The amount of standardized variance in a variable is 1

          • The sum of eigenvalues is the percentage of variance accounted for

    Factor analysis 9

    Factor Analysis-9

    • Rotation

      • Objective: to facilitate interpretation

      • Orthogonal rotation: done when data reduction is the objective and factors need to be orthogonal

        • Varimax: attempts to simplify interpretation by maximize the variances of the variable loadings on each factor

        • Quartimax: simplify solution by finding a rotation that produces high and low loadings across factors for each variable

      • Oblique rotation: use when there are reason to allow factors to be correlated

        • Oblimin and Promax (promax runs fast)

    Factor analysis 10

    Factor Analysis-10

    • Factor scores: if you are satisfy with a factor solution

      • You can request that a new set of variables be created that represents the scores of each observation on the factor (difficult of interpret)

      • You can use the lambda coefficient to judge which variables are highly related to the factor; the compute the sum of the mean of this variables for further analysis (easy to interpret)

    Factor analysis 11

    Factor Analysis-11

    • Sample size: the sample size should be about 10 to 15 times of the number of variables (as other multivariate procedures)

    • Number of methods: there are 8 factoring methods, including principle component

      • Principle axis: account for correlations between the variables

      • Unweighted least-squares: minimize the residual between the observed and the reproduced correlation matrix

    Factor analysis 12

    Factor Analysis-12

    • Generalize least-squares: similar to Unweighted least-squares but give more weight the the variables with stronger correlation

    • Maximum Likelihood: generate the solution that is the most likely to produce the correlation matrix

    • Alpha Factoring: Consider variables as a sample; not using factor loadings

    • Image factoring: decompose the variables into a common part and a unique part, then work with the common part

    Factor analysis 13

    Factor Analysis-13

    • Recommendations:

      • Principle components and principle axis are the most common used methods

      • When there are multicollinearity, use principle components

      • Rotations are often done. Try to use Varimax

    Factor analysis 14

    Factor Analysis-14

    • Example 1: whether a small number of athletic skills account for performance in the ten separate decathlon events

      • File>Open>Data…; select Olymp88.por

      • Looking at correlation:

        • Analyze>Correlation>Bivariate

      • Principle component with orthogonal rotation

        • Analyze>Data Reduction>Factor

          • Select all variables except score

          • Click Extract button>click Scree Plot

          • Check off Unrotated factor solution

          • Click continue

    Factor analysis 15

    Factor Analysis-15

    • Click Rotation button>click Varimax; Loading plots; click continue

    • Click options button>click sorted by size; click Suppress absolute values box; change .1 to ,3; click continue

    • Click Descriptive>Univariate descriptive; KMO and Bartlett’s test of sphericity (KMO measures how well the sample data are suited for factor analysis: .9 is great and less than .5 is not acceptable; Bartlett’s test tests the sphericity of the correlation matrix); click continue

    • Click OK

    Factor analysis 16

    Factor Analysis-16

    • Try to validate the first factor solution using a different method

      • Analyze>Data Reduction>Factor Analysis

        • Click Extraction>Select Principle axis factoring; click continue

        • Click Rotation>Select Direct Oblimin (leave delta value at 0, most oblique value possible); type 50 in the Max Iteration box; click continue

        • Click Score button>click save as variables (this involve solving system of equation for the factors, regression is one of the methods to solve the equations); click continue

        • Click OK

    Factor analysis 17

    Factor Analysis-17

    • Note: the Patten matrix gives the standardized linear weights and the Structure matrix gives the correlation between variable and factors (in principle component analysis, the component matrix gives both factor loadings and the correlations)

    Discriminant analysis 1

    Discriminant Analysis-1

    • Discriminant analysis characterize the relationship between a set of IVs with a categorical DV with relatively few categories

      • It creates a linear combination of the IVs that best characterizes the differences among the groups

      • Predictive discriminant analysis focus on creating a rule to predict group membership

      • Descriptive DA studies the relationship between the DV and the IVs.

    Discriminant analysis 2

    Discriminant Analysis-2

    • Possible applications:

      • Whether a bank should offer a loan to a new customer?

      • Which customer is likely to buy?

      • Identify patients who may be at high risk for problems after surgery

    Discriminant analysis 3

    Discriminant Analysis-3

    • How does it work?

      • Assume the population of interest is composed of distinct populations

      • Assume the IVs follows multivariate normal distribution

      • DS seek a linear combination of the IVs that best separate the populations

      • If we have k groups, we need k-1 discriminate functions

      • A discriminant score is computed for each function

      • This score is used to classify cases into one of the categories

    Discriminant analysis 4

    Discriminant Analysis-4

    • There are three methods to classify group memberships:

      • Maximum likelihood method: assign case to group k is the probability of membership is greater in group k than any other group

      • Fisher (linear) classification functions: assign a membership to group k if its score on the function for group k is greater than any other function scores

      • Distance function: assign membership to group k if its distance to the centroid of the group is minimum

      • Note: SPSS uses Maximum likelihood method

    Discriminant analysis 5

    Discriminant Analysis-5

    • Basic steps in DA:

      • Identify the variables

      • Screen data: look for outliers, variables may not be good predictors, etc

      • Run DA

      • Check for the correct prediction rate

      • Check for the importance of individual predictors

      • Validate the model

    Discriminant analysis 6

    Discriminant Analysis-6

    • Assumptions:

      • IVs are either dichotomous or measurement

      • Normality

      • Homogeneity of variances

    Discriminant analysis 7

    Discriminant Analysis-7

    • Example 1: VCR buyers filled out a survey; we want to determine which set of demographic information and attitude best predict which customer may buy another VCR

      • File>Open Data…>CSM.por

      • Explore the data

      • Analyze>Classify>Discriminant

        • Move age, complain, educ, fail, pinnovat, preliabl, puse, qual, use, and value into Independent box

        • Move buyyes into Grouping box

        • Click Define Range; type 1 for Min and 2 for Max

        • Click continue

    Discriminant analysis 8

    Discriminant Analysis-8

    • Click Statistics>click Box’s M and Fisher’s; continue

    • Click Classify button>click Summary table; Separate groups; Continue

    • Click Save button>click on Discriminant Scores; continue

    • Click OK

  • How original variables related to the discriminant score?

    • Graphs>Scatter>Click Define

      • Move pinnovat into X and dis1_1 into Y; move buyyes into Set Markers by box

  • Discriminant analysis 9

    Discriminant Analysis-9

    • Since Box’s M test was significant, one can ask SPSS to run DA using ‘separate covariances’ option (under Classify) and compare the results

    • From the 1st analysis, we see that ‘age’ was not important, one can redo the analysis without ‘age’ and compare the results

    Discriminant analysis 10

    Discriminant Analysis-10

    • Validate the model: leave-one-out classification

      • Repeat the analysis, click on Classify>click leave-one-out classification; Click continue

    • Example 2: predict smoking and drinking habits

      • Analyze>Classify>Discriminant

        • Move smkdrnk into Grouping Variable box; move age, attend, black, class, educ, sex and white into IV list

        • Click Statistics>Select Fisher’s and Box M; Continue

        • Click Classify>Summary table, Combine-groups; Territorial map; Continue

        • Click OK

    Cluster analysis 1

    Cluster Analysis-1

    • Cluster analysis is an exploratory data analysis technique design to reveal groups

    • How?

      • By distance: close together observations should be in the same group, and observations in the groups should be far apart

    • Applications:

      • Plants and animals into ecological groups

      • Companies for product usage

    Cluster analysis 2

    Cluster Analysis-2

    • Two types of method

      • Hierarchical: requires observations to remain together once they have joint in a cluster

        • Complete linkage

        • Between groups average linkage

        • Ward’s method

      • Nonhierarchical: no such requirement

        • Research must pick a number of clusters to run (K-means algorithm)

    Cluster analysis 3

    Cluster Analysis-3

    • Recommendations:

      • For relative small samples, use hierarchical (less than a few hundred)

      • For large samples, use K-means

    • Example 1: evaluating 20 types of beer

      • File>Open>Data; select beer.por

      • Analyze>Descriptive Stat>Descriptive

        • Move cost, calories, sodium, and alcohol into variable list

        • Click at Save standardized values; OK

    Cluster analysis 4

    Cluster Analysis-4

    • Analyze>Classify>Hierarchical Cluster

      • Move cost, calories, sodium, and alcohol into Variable list box

      • Move Beer into label cases by box

      • Click Plots>click Dendrogram; click none in Icicle area; continue

      • Click Method>select Z-score from the standardize drop-down list; Continue

      • Click Save>Click range of solutions; range 2-5 clusters; continue

      • OK

    Cluster analysis 5

    Cluster Analysis-5

    • Additional analysis

      • Look at the last 4 column of the data (clu5_1 to clu2_1) they contain memberships for each solution between 5 and 2 clusters

      • Analyze>Descriptive>Frequencies

        • Move clu2_1 to clu5_1 to Variable box

        • OK

      • Obtain mean profile for clusters

        • Graph>Line>summary of separate variables

          • Click Define>move zcost, zcalorie, zsodium, and zalcohol to Lines Rep. Box

          • Click clu4_1 and move it to Category box

    Path analysis 1

    Path Analysis-1

    • Path analysis is a technique based on regression to establish causal relationship

      • Start with a diagram with causal flow

      • Direct causal effects model (regression)

        • The direct causal effect of an IV on a DV is the coefficient (the number of unit change in DV for 1 unit change in X)

      • Building on the DCEM

    • Two forms of causal model:

      • Diagram

      • Equation (structure equation)

    Path analysis 2

    Path Analysis-2

    • An example of a causal model

      • Structural equation:

        • Z4=p41Z1+p42Z2+p43Z3+e4

          • P: path coefficient

          • e: disturbance

          • Z4, endogenous variable

          • Z1: exogenous variable

      • Path diagram

        • Indirect effect is the multiplication of the path coefficients

    Path analysis 3

    Path Analysis-3

    • Steps in path analysis:

      • Create a path diagram

      • Use regression to estimate structural equation coefficients

      • Assess to model:

        • Compare the observed and reproduced correlations (reproduced correlations will be computed by hand)

    Path analysis 4

    Path Analysis-4

    • Research questions:

      • Is our model-which describe the causal effects among the variables ‘region of the world’, ‘status as a developing nation’, ‘number of doctors’, and ‘male life expectancy’-consistent with our observed correlation among these variables?

      • If our model is consistent, what are the estimated direct, indirect, and total causal effects among the variables?

    Path analysis 5

    Path Analysis-5

    • Legal path:

      • No path may pass through the same variable more than once

      • No path may go backward on an arrow after going forward on another arrow

      • No path may include more than one double headed curve arrow

    Path analysis 6

    Path Analysis-6

    • Component labels:

      • D: direct effect (just one straight arrow)

      • I: indirect effect (more than one straight arrows)

      • S: spurious effect (there is a backward arrow)

      • U: effect is uncertain (start with a two arrows curve)

    Path analysis 7

    Path Analysis-7

    • If the model is in question (some of the reproduced correlations differ from the observed correlations by more than .05)

      • Test all missing paths (running additional regressions and check for significance of the coefficients)

      • Reduce the existing paths if their coefficients are not significant

    Logistic regression motivations

    Logistic regression - Motivations

    • When the dependent variable is dichotomous, regular regression is not appropriate

      • We want to predict probability

      • OLS regression predictions could be any numbers, not just numbers between 0 and 1

      • When dealing with proportions, variance is depended on mean, equal variance assumption in OLS is violated

    Motivations continue


    • Fit a S curve to the data

    What is logistic regression

    What is Logistic Regression?

    • Regressions of the form


    • ln(Odds) is called a logic

    • Odds=Porb/(1-Prob)

    Application of logistic regression

    Application of Logistic Regression

    • When to use it?

      • When the dependent valuable is dichotomous

    • Objectives:

      • Run a logistic regression

      • Apply a stepwise logistic regression

      • Use ROC (response operating characteristic) curve to access the model

    Assumptions of logistic regression

    Assumptions of logistic regression

    • The indep. variables be interval or dichotomous

    • All relevant predictors be included, no irrelevant predictors be included and the form of the relationship is linear

    • The expected value of the error term is zero

    • There is no autocorrelation

    Assumptions of logistic regression cont

    Assumptions of logistic regression – Cont.

    • There is no correlation between the error and the independent variables

    • There is an absence of perfect multicollinearity between the independent variables

    • Need to have a large sample (rule of thumb: n should be > 30 times of the number of parameters)

    Note on assumptions

    Note on assumptions

    • No need for normality of errors

    • No need for equal variance



    • Objective: to predict low birth weight babies

    • Variables:

      • Low: 1: <=2500 grams, 0: >2500 grams

      • LWT: weight at last menstrual cycle

      • Age

      • Smoke

      • PTL: # of premature deliveries

      • HT: History of Hypertension

      • UI: uterine irritability

      • FTV: # of physician visits during first trimester

      • Race: 1=white, 2=black, 3=other



    • File > Open > Data > Select SPSS Portable type > select Birthwt (in Regression)

    • Analyze > Regression > Binary Logistic

      • Move ‘low’ to the Dependent list box

      • Move ‘age’, ‘ftv’, ‘ht’, ‘ptl’, ‘race’, ‘smoke’, and ‘ui’ into the Covariate list box

    Example cont

    Example (cont.)

    • Click the Categorical button

      • Place ‘race’ in the Categorical Covariates box

    • Click Continue, click Save

      • Click the Probability and Group Membership check boxes

    • Click Continue and then the Option button

    Example cont1

    Example (cont.)

    • Click on the Classification plots and Hosmer-Lemeshow goodness of fit checkboxes

    • Click Continue, then OK

    Logistic outputs

    Logistic outputs

    • Initial summary output: info on dependent and categorical variables

    • Block 0: based on the model just include a constant – provides baseline info

    • Block 1: Method Enter – include the model info

      • Chi-square tests if all the coeffs are 0 (similar to ‘F’ in regression)

    Logistic outputs cont

    Logistic outputs (cont.)

    • The Modle chi-square value is the difference of the initial and final –2LL (small value of -2LL indicates a good fit, -2LL=0 indicates a perfect fit)

    • The Step and Block display the the result of last Step and Block (they are the same here because we are not using stepwise regression)

    Logistic outputs cont1

    Logistic outputs (cont.)

    • The goodness of fit statistics –2LL is 203.554

    • Cox & Snell R square – similar to R-square in OLS

    • Nagelkerke R squre (prefered b/c it can be 1)

    • Hosmer and Lemeshow test: test “there is no difference between expected and observe counts”. I.e. we prefer a non-significant result

    Logistic outputs cont2

    Logistic outputs (cont.)

    • Classification table: can our model to predict accurately?

      • Overall accuracy is 73%

      • We do much better on higher birth weight

      • Does a poor job on lower birth weight

      • A significant model doesn’t mean having high predictability

    Interpretation of the coefficients

    Interpretation of the coefficients

    • E.g. HT (hypertension)

      • B=1.736 – hypertension in the mother increase the log odds by 1.736

      • Exp(B)=5.831 - hypertension in the mother increase the odds of having a low birth baby by a factor of 5.831

      • What is the prob. change?

        • If the original odds is 1:100 (p=.0099), it changes to 5.831:100 (p=.0551); if the original odds is 1:1 (p=.5), it changes to 5:1 (p=.83)

    Interpretation of the coefficients cont

    Interpretation of the coefficients (cont.)

    • Categorical variable Race:

      • First an overall effect

      • Race(1) – white: the effect of being white is significant, acting to decrease the odds ratio compared to those of ‘other’ by a factor of .4

      • The effect of being black is not significant compared with ‘other’

    Making prediction

    Making prediction

    • Suppose a mother;

      • Age 20

      • Weigh 130 pounds

      • Smoke

      • No hypertension or premature labor

      • Has uterine irritability

      • White

      • Two visits to her doctor

    Making prediction cont

    Making prediction (cont.)

    • P(event) = 1/(1+exp(-(a+b1X1+…+bkXk)

    • P=.397

    • Predicted to be not have low birth rate because the prob. is less that .5

    Checking classification

    Checking classification

    • Need to study the characteristics of mispredicted cases

      • Transform>Compute> Pred_err=1 if…

      • Analyze>Compare Means (LWT vs Pred_err)

        • The mean LWT for mispredicted is much lower than the correctly predicted

    Residual analysis

    Residual Analysis

    • Analyze>Regression>Logistic>Click Save >Click Cook’s, Leverage, Unstandardized, Logit, and Standardized

    • Examining data

      • Cook’s and Leverage should be small (if a case has no influence on the regression result, the values would be 0)

      • Res_1 is the residual of probability (e.g. 1st case have predicted prob. .29804 and and actual ‘low’ value is 0, and the res_1=0-.29804=-.29804)

      • Zre_1 is the standardized residual of the probs

      • lre_1 is the residual in terms of logit

    Roc curve receiver operating characteristic

    ROC curve (Receiver Operating Characteristic)

    • Sensitivity: true positive

    • Specificity: true negative

    • Changing cut off points (.5) changes both the sensitivity and specificity

    • ROC can help us to select an ‘optimal’ cut off point

    • Graph>ROC Curve>move pre_1 to ‘Test Variable’, low to ‘State Variable’, type ‘1’ in the ‘Value of State Variable’, click ‘with diagonal reference line’ and ‘Coordinate points of the ROC Curve’

    Roc curve interpretation

    ROC curve interpretation

    • Vertical axis: sensitivity (true positive rate)

    • Horizontal axis: false negative rate

    • Diagonal: reference

    • Give the trade off between sensitivity and false negative rates

    • Pay attention to the area where the curve rise rapidly

    • The 1st column of ‘coordinate of the curve’ gives the cut off prob.

    Residual analysis cont

    Residual Analysis – Cont.

    • Examine the distribution of zre_1

      • Graph>Interactive>Histogram>drag zre_1 to X axis, click Histogram, click Normal Curve

        • Note: this plot need not to should normality

    • Finding influential cases

      • Graph>Scatterplot>Define>Move id to X axis, coo_1 to Y axis

    • Multicollinearity

      • Use OLS regression to check (?)

    Multinomial logistic regression

    Multinomial Logistic Regression

    • The dependent variable is categorical with two or more categories

    • It is an extension of the logistic regression

    • The assumptions are the assumptions for logistic regression plus ‘the dependent variable has multinomial distribution



    • Objective: predict risk credit risk (3 categories) base on financial and demographic variables

    • Variables:

      • Age

      • Income

      • Gender

      • Marital (single, married, divsepwid)

      • Numkids: # of dependent children

    Example cont2

    Example Cont.

    • Numcards: #of credit cards

    • Howpaid: how often paid (weekly, monthly)

    • Mortgage: have a mortgage (y, n)

    • Storecar: # of store credit cards

    • Loans: # of other loads

    • Risk: 1=bad risk, 2=bad risk-profit, 3=good risk

    How does it work

    How does it work?

    • Let f(j) be the probability of being in outcome category j

      • f(1)=P(bad risk-lost)

      • f(2)=P(bad risk-profit)

      • f(3)=P(good risk)

      • g(1)=f(1)/f(3)

      • g(2)=f(2)/f(3)

      • g(3)=f(3)/f(3)=1

    How does it work cont

    How does it work? – Cont.

    • Fit the modele:

      • ln(g(1))= A1+B11X1+…+B1kXk

      • ln(g(2))= A2+B21X1+…+B2kXk

      • ln(g(3))= ln(1)=0=A3+B31X1+…+B3kXk

    How does it work cont1

    How does it work? – Cont.

    Example cont3

    Example – Cont.

    • File > Open > Data > Select Risk > Open

    • Move risk into dependent list box

    • Move marital and mortgage into the Factor(s) list box

    • Move income and numberkids into the Covariate(s) list box

    • Click model button

      • Click cancel button

    Example cont4

    Example (Cont.)

    • Click Statistics button

      • Check the Classification table check box

      • Click Continue

    • Click Save

      • The Multinomial Logistic regression in SPSS version 10 will only save model info in an XML (Extensible Markup Language) format

      • Click cancel

    • Click OK

    Multinomial output

    Multinomial output

    • Model Fit and Pseudo R-square, Likelihood ratio test are similar to logistic regression

    • Parameter estimates table is different

      • There are two sets of parameters

        • One for the probability ratio of(bad risk-lost)/(good risk)

        • Another set for the prob. Ratio of

          (bad risk-profit)/(good risk)

    Interpretation of coefficients

    Interpretation of coefficients

    • Income in the ‘bad lost’ section

      • It is significant

      • Exp(B)=.962: the expected probability ratio is decreased a little (by a factor of .962) for one unit increase of income

    How to predict

    How to predict?

    • F(1) – the chance in ‘bad loss’ group

    • F(2) – the chance in ‘bad profit’ group

    • F(3) – the chance in ‘good risk’ group

    • F(j)=g(j)/sum(g(i))

    • g(j)=exp(modelj)

    How to predict cont

    How to predict? (cont.)

    • Suppose an individual

      • Single, has a mortgage

      • No children

      • Income of 35,000 pounds

    • g(1)=.218

    • g(2)=.767

    • g(3)=1

    How to predict1

    How to predict?

    • F(1)=.218/(.218+.767+1)=.110

    • F(2)=.386

    • F(3)=.504

    • The individual is classified as good risk

    Multinomial logistic reg with interaction

    Multinomial Logistic Reg. With Interaction

    • Analyze>Regression>Multinomial Logistic>Click at Model, select custom>specify your model (all main effects and the interaction between Marital and Mortgage)

    • Interpret the results as usual

    Interaction effects in logistic regression

    Interaction effects in logistic Regression

    • It is similar to OLS regression:

      • Add interaction terms to the model as crossproducts

      • In SPSS, highlight two variables (holding down the ctrl key) and move them into the variable box will create the interaction term

  • Login