- 325 Views
- Uploaded on 09-11-2011
- Presentation posted in: General

Analyzing Patterns of Missing Data

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in the set of tools licensed by the University. However, we can replicate much of the analysis with other SPSS procedures.

The first set of tasks in the missing data analysis involve the creation of diagnostic variables that support the analysis: first, a variable that counts the number of variables with missing data for each case; second, one new dichotomous variable for each original variable that indicates whether or not the original variable had a missing data value; and third, a single pattern variable for each case that summarizes the missing or valid status of values for all of the variables in the analysis.

Using the diagnostic variable that counts the missing values for each case, we can identify cases with large concentrations of missing data as candidates for elimination from the analysis. After we remove specific cases with large numbers of missing variables, we do a frequency distribution for the remaining cases to see if any variables have so many missing cases that the variable should be considered a candidate for exclusion.

Next, we compute a frequency distribution for the pattern variable to identify patterns that occur often in the data, indicating a problematic missing data process.

Next, using the valid/missing variables as a grouping variable, we examine whether or not the missing cases are statistically different from the valid cases for all of the other variables in the analysis. If the variable is metric, we do a t-test for group differences; if the variable is non-metric, we do a chi-square test of independence to detect group differences.

Finally, we do a correlation matrix of the valid/missing variables to detect concentrations of missing data across multiple variables.

Analyzing Patterns of Missing Data

Download the HATMISS data set from the course web page and save it in your C:\SW388R7 folder.

Analyzing Patterns of Missing Data

One of the major information items we need for the missing data analysis is the number of variables that have missing data for each case in the sample.

We will create a new variable which we will name num_miss that will contain the number of variables from the first ten in the data set, x1 through x10. We include only the first ten variables in this calculation to maintain consistency with the text.

The SPSS function NMISS counts the number of variables that have missing values. We will use this function to calculate the value for our NUM_MISS variable for each case.

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

To determine whether or not the pattern of missing data is random, we create a special diagnostic variable that indicates whether the variable is missing or valid for each case in the data set. Each diagnostic variable is dichotomous, using the value 1 for 'Valid' and the value 0 for 'Missing'

Since we may need to refer back to the original variables in the course of the missing data analysis, I recommend a naming convention for the diagnostic variables that makes it easy to identify the original variable. If the original variable name is less than eight characters, an underscore is appended to the end of the original variable name, e.g. the diagnostic variable for race would be race_. If the original variable name is eight characters, the last character is replaced with an underscore, e.g. the diagnostic variable name for response would be respons_. If replacing the last character with an underscore duplicates the name assigned to another diagnostic variable for an eight-character variable name, we drop the last two characters from the original name and append an underscore followed by a sequence letter or digit, e.g. the diagnostic variable name for response would be respon_1 if we had already used the name respons_ for a diagnostic variable.

When we assign variable labels to the diagnostic variables, we can add a keyword to the original variable label to designate it as a missing/valid diagnostic variable, e.g. the variable label for the diagnostic variable that had an original variable label of Grade Level could be Grade Level (Valid/Missing).

We will demonstrate the process of creating dichotomous Valid/Missing variables for diagnosing missing data using the variables in the HATMISS.SAV data set. If the copy of HATMISS.SAV that you are working with does not have variable labels and value labels, do the exercise Applying a Data Dictionary to apply the data labels from the HATCO.SAV data set to the HATMISS.SAV data set. A quick test for the presence of variable labels is to position the mouse over a variable name in the data editor. If a variable label appears in a yellow tips box, a variable label has been added for that variable.

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Another indication of a problematic missing data process is the frequent occurrence of the same pattern of missing data among the variables. While patterns can be detected by sorting and scanning the data set, this task is facilitated by the creation of a pattern variable. The pattern variable is a string variable containing one character for each variable in the data set. Each character in the pattern variable is set to a character indicating missing data or a character indicating valid data. To make the pattern more visually intuitive, the characters selected should have the same width when printed. If we do not use same width characters, we cannot scan down values to compare them because the column alignment of the characters is not the same from one value to the next. We will use an X for missing data and a tilde, ~, for valid data, because both are full width characters.

To create the pattern variable, we first create a one-character string variable for each of the original variables. Then, we use the SPSS 'CONCAT' function to add the string variables together into a single variable.

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

To identify the cases that we should consider removing, we will sort the data set in descending order by the number of missing variables. The candidates for elimination will appear at the top of the data set.

Once we have located the cases that we want to eliminate, we specify a filter condition to eliminate the cases from further analysis. The cases are not deleted from the data set, so we can include them in later analysis should we desire to do so.

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Filtering cases with 50% or more missing data removed six cases from the data set, reducing our effective sample size to 64 cases. We next look at a frequency distribution for each variable to see if any variables have such a high proportion of missing data that they should be considered candidates for removal from the analysis.

We can see the distribution of missing data on each of our variables by using the Frequencies command, which produces the SPSS output equivalent to Table 2.2 on page 56 of the text. We will use a Frequencies command instead of a Descriptives command, because the Frequencies command will provide a count of the remaining missing cases for each variable.

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

In a previous exercise, Adding a Pattern Variable to the Data Set, we created a pattern variable that contained a single string of ten characters representing valid or missing data for the first ten variables in the data set. To create table 2.4 on page 58, we do frequency distribution on the pattern variable. This frequency distribution will tell us if there are one or two patterns of missing data that occur with sufficient frequency to require further investigation.

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

In previous exercises, we created dichotomous grouping variables for the variables X1 through X10, where the grouping variable was assigned a 1 if the data was valid and a 0 if the data was missing. We will use these grouping variables to determine whether the valid and missing groups differ in their relationship to other variables in the data set. If the missing and valid groups are statistically equivalent on other variables, then the missing cases can be characterized as random, and of no consequence to our analysis. If the missing group shows a statistically significant relationship to the other variable, it suggests that there is a missing data process that requires further understanding.

The statistical tests that we use in this analysis are chi-square tests of independence, if the variable to be tested is nonmetric, or t-tests for two independent samples, if the variable to be tested is metric. The authors use the separate variance output for all t-tests instead of examining individual tests of homogeneity. We will follow this practice.

When this analysis is conducted, there are usually a large number of statistical relationships tested. We know that using an alpha level of 0.05 in these tests implies that we will make an incorrect inference in one out of every twenty tests. With a large number of tests, we will get some statistically significant relationships even when there is no serious problem with our data. We are not looking at the individual test results, as much as we are concerned with an overall pattern of relationships.

NOTE. I cannot reconcile the findings on these tests to the discussion of findings on page 58 of the text. The statistical results are consistent with table 2.5 on page 59, while the text discussion appears to be a carryover from the fourth edition of the text, which does not contain the same statistical results as the fifth edition.

Analyzing Patterns of Missing Data

We will use the grouping variable 'Delivery Speed (Valid/Missing)' (X1_) to explore differences among the next nine variables in the data set, 'Price Level' through 'Satisfaction Level' (X2 through X10). In each statistical test, we are testing the null hypothesis of no relationship associated with the grouping variable, 'Delivery Speed (Valid/Missing)'. If we reject the null hypothesis, we would conclude that persons who did not answer the question on Delivery Speed had a different pattern of responses than did persons who did provide Delivery Speed.

The variable 'Firm Size' (x8) is a nonmetric variable and we will do a chi-square test of independence for this variable.

The variables 'Price Level' (x2), 'Price Flexibility' (x3), 'Manufacturer Image' (x4), 'Service' (x5), 'Salesforce Image' (x6), 'Product Quality' (x7), 'Usage Level' (x9), and 'Satisfaction Level' (x10) are all metric and we will do t-tests for these variables.

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

To continue our missing data analysis, we run a correlation matrix for the dichotomous grouping variables: 'Delivery Speed (Valid/Missing)', 'Price Level (Valid/Missing)', 'Price Flexibility (Valid/Missing)', 'Manufacturer Image (Valid/Missing)', 'Service (Valid/Missing)', 'Salesforce Image (Valid/Missing)', 'Product Quality (Valid/Missing)', 'Usage Level (Valid/Missing)', and 'Satisfaction Level (Valid/Missing)'.

We examine the pattern of correlations to see if there is are large correlations among multiple pairs of variables that do not have an obvious explanation. An obvious explanation would be that subjects only answered these questions if their answer to another question were some value, e.g. only answer the question about job satisfaction if you are employed.

If there are variables that show a strong pattern of systematic missing data without an obvious explanation, we should evaluate the impact that this pattern has on our research questions, and make our decision about including, eliminating, or substituting for these variables.

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data