Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data While SPSS contains a rich set of procedures for analyzing patterns of missing data, they are not included in the set of tools licensed by the University. However, we can replicate much of the analysis with other SPSS procedures. The first set of tasks in the missing data analysis involve the creation of diagnostic variables that support the analysis: first, a variable that counts the number of variables with missing data for each case; second, one new dichotomous variable for each original variable that indicates whether or not the original variable had a missing data value; and third, a single pattern variable for each case that summarizes the missing or valid status of values for all of the variables in the analysis. Using the diagnostic variable that counts the missing values for each case, we can identify cases with large concentrations of missing data as candidates for elimination from the analysis. After we remove specific cases with large numbers of missing variables, we do a frequency distribution for the remaining cases to see if any variables have so many missing cases that the variable should be considered a candidate for exclusion. Next, we compute a frequency distribution for the pattern variable to identify patterns that occur often in the data, indicating a problematic missing data process. Next, using the valid/missing variables as a grouping variable, we examine whether or not the missing cases are statistically different from the valid cases for all of the other variables in the analysis. If the variable is metric, we do a t-test for group differences; if the variable is non-metric, we do a chi-square test of independence to detect group differences. Finally, we do a correlation matrix of the valid/missing variables to detect concentrations of missing data across multiple variables. Analyzing Patterns of Missing Data

1. Download the data set Download the HATMISS data set from the course web page and save it in your C:\SW388R7 folder. Analyzing Patterns of Missing Data

2. Tallying the Number of Missing Variables One of the major information items we need for the missing data analysis is the number of variables that have missing data for each case in the sample. We will create a new variable which we will name num_miss that will contain the number of variables from the first ten in the data set, x1 through x10. We include only the first ten variables in this calculation to maintain consistency with the text. The SPSS function NMISS counts the number of variables that have missing values. We will use this function to calculate the value for our NUM_MISS variable for each case. Analyzing Patterns of Missing Data

Computing the Number Missing by Case Analyzing Patterns of Missing Data

Specifying the Variables in the Function Analyzing Patterns of Missing Data

3. Creating Dichotomous Valid/Missing Variables for Diagnosing Missing Data To determine whether or not the pattern of missing data is random, we create a special diagnostic variable that indicates whether the variable is missing or valid for each case in the data set. Each diagnostic variable is dichotomous, using the value 1 for 'Valid' and the value 0 for 'Missing' Since we may need to refer back to the original variables in the course of the missing data analysis, I recommend a naming convention for the diagnostic variables that makes it easy to identify the original variable. If the original variable name is less than eight characters, an underscore is appended to the end of the original variable name, e.g. the diagnostic variable for race would be race_. If the original variable name is eight characters, the last character is replaced with an underscore, e.g. the diagnostic variable name for response would be respons_. If replacing the last character with an underscore duplicates the name assigned to another diagnostic variable for an eight-character variable name, we drop the last two characters from the original name and append an underscore followed by a sequence letter or digit, e.g. the diagnostic variable name for response would be respon_1 if we had already used the name respons_ for a diagnostic variable. When we assign variable labels to the diagnostic variables, we can add a keyword to the original variable label to designate it as a missing/valid diagnostic variable, e.g. the variable label for the diagnostic variable that had an original variable label of Grade Level could be Grade Level (Valid/Missing). We will demonstrate the process of creating dichotomous Valid/Missing variables for diagnosing missing data using the variables in the HATMISS.SAV data set. If the copy of HATMISS.SAV that you are working with does not have variable labels and value labels, do the exercise Applying a Data Dictionary to apply the data labels from the HATCO.SAV data set to the HATMISS.SAV data set. A quick test for the presence of variable labels is to position the mouse over a variable name in the data editor. If a variable label appears in a yellow tips box, a variable label has been added for that variable. Analyzing Patterns of Missing Data

Recoding Diagnostic Variables for Missing Data Analyzing Patterns of Missing Data

Opening the Dialog for Old and New Values Analyzing Patterns of Missing Data

Add the Value for Missing Data Analyzing Patterns of Missing Data

Add the Value for Valid Data Analyzing Patterns of Missing Data

Completing the Values Dialog Box Analyzing Patterns of Missing Data

Adding Diagnostic Variables for the Remaining Variables Analyzing Patterns of Missing Data

Adding Value Labels to the Diagnostic Variables Analyzing Patterns of Missing Data

Adding the Value Label for Missing Analyzing Patterns of Missing Data

Add the Value Label for Valid Analyzing Patterns of Missing Data

Apply the Value Labels Analyzing Patterns of Missing Data

Displaying the Value Labels for the Variables Analyzing Patterns of Missing Data

The Diagnostic Variables Analyzing Patterns of Missing Data

4. Adding a Pattern Variable to the Data Set Another indication of a problematic missing data process is the frequent occurrence of the same pattern of missing data among the variables. While patterns can be detected by sorting and scanning the data set, this task is facilitated by the creation of a pattern variable. The pattern variable is a string variable containing one character for each variable in the data set. Each character in the pattern variable is set to a character indicating missing data or a character indicating valid data. To make the pattern more visually intuitive, the characters selected should have the same width when printed. If we do not use same width characters, we cannot scan down values to compare them because the column alignment of the characters is not the same from one value to the next. We will use an X for missing data and a tilde, ~, for valid data, because both are full width characters. To create the pattern variable, we first create a one-character string variable for each of the original variables. Then, we use the SPSS 'CONCAT' function to add the string variables together into a single variable. Analyzing Patterns of Missing Data

Recode the Original Variables into String Variables Analyzing Patterns of Missing Data

Opening the Dialog for Old and New Values Analyzing Patterns of Missing Data

Add the Value for Missing Data Analyzing Patterns of Missing Data

Add the Value for Valid Data Analyzing Patterns of Missing Data

Completing the Values Dialog Box Analyzing Patterns of Missing Data

Adding String Variables for the Other Original Variables Analyzing Patterns of Missing Data

The String Variables Analyzing Patterns of Missing Data

Create the Variable Containing the Concatenated Data Analyzing Patterns of Missing Data

Enter the Formula for the Concatenated Variable Analyzing Patterns of Missing Data

The Missing Data Pattern Variable Analyzing Patterns of Missing Data

5. Removing Cases with a Large Proportion of Missing Variables To identify the cases that we should consider removing, we will sort the data set in descending order by the number of missing variables. The candidates for elimination will appear at the top of the data set. Once we have located the cases that we want to eliminate, we specify a filter condition to eliminate the cases from further analysis. The cases are not deleted from the data set, so we can include them in later analysis should we desire to do so. Analyzing Patterns of Missing Data

Sorting the Cases Analyzing Patterns of Missing Data

The Cases Sorted by Number Missing Analyzing Patterns of Missing Data

Excluding the Cases Analyzing Patterns of Missing Data

Specifying the If Condition Analyzing Patterns of Missing Data

Specify Filtering for Unselected Cases Analyzing Patterns of Missing Data

The Data Set with Filtered Cases Analyzing Patterns of Missing Data

6. Summary Statistics for the Unfiltered Cases Filtering cases with 50% or more missing data removed six cases from the data set, reducing our effective sample size to 64 cases. We next look at a frequency distribution for each variable to see if any variables have such a high proportion of missing data that they should be considered candidates for removal from the analysis. We can see the distribution of missing data on each of our variables by using the Frequencies command, which produces the SPSS output equivalent to Table 2.2 on page 56 of the text. We will use a Frequencies command instead of a Descriptives command, because the Frequencies command will provide a count of the remaining missing cases for each variable. Analyzing Patterns of Missing Data

Requesting the Frequency Distributions Analyzing Patterns of Missing Data

Requesting Specific Statistics Analyzing Patterns of Missing Data

The Frequencies Output Analyzing Patterns of Missing Data

Changing the Orientation of the Table Analyzing Patterns of Missing Data

The Transposed Frequencies Table Analyzing Patterns of Missing Data

7. Tabulating Missing Data Patterns In a previous exercise, Adding a Pattern Variable to the Data Set, we created a pattern variable that contained a single string of ten characters representing valid or missing data for the first ten variables in the data set. To create table 2.4 on page 58, we do frequency distribution on the pattern variable. This frequency distribution will tell us if there are one or two patterns of missing data that occur with sufficient frequency to require further investigation. Analyzing Patterns of Missing Data

Request a Frequency Distribution for the Pattern Variable Analyzing Patterns of Missing Data

The Frequency of Different Patterns Analyzing Patterns of Missing Data

8. T-tests and Chi-square Tests for Diagnosing Randomness of Missing Data In previous exercises, we created dichotomous grouping variables for the variables X1 through X10, where the grouping variable was assigned a 1 if the data was valid and a 0 if the data was missing. We will use these grouping variables to determine whether the valid and missing groups differ in their relationship to other variables in the data set. If the missing and valid groups are statistically equivalent on other variables, then the missing cases can be characterized as random, and of no consequence to our analysis. If the missing group shows a statistically significant relationship to the other variable, it suggests that there is a missing data process that requires further understanding. The statistical tests that we use in this analysis are chi-square tests of independence, if the variable to be tested is nonmetric, or t-tests for two independent samples, if the variable to be tested is metric. The authors use the separate variance output for all t-tests instead of examining individual tests of homogeneity. We will follow this practice. When this analysis is conducted, there are usually a large number of statistical relationships tested. We know that using an alpha level of 0.05 in these tests implies that we will make an incorrect inference in one out of every twenty tests. With a large number of tests, we will get some statistically significant relationships even when there is no serious problem with our data. We are not looking at the individual test results, as much as we are concerned with an overall pattern of relationships. NOTE. I cannot reconcile the findings on these tests to the discussion of findings on page 58 of the text. The statistical results are consistent with table 2.5 on page 59, while the text discussion appears to be a carryover from the fourth edition of the text, which does not contain the same statistical results as the fifth edition. Analyzing Patterns of Missing Data

The Statistical Tests to Be Computed We will use the grouping variable 'Delivery Speed (Valid/Missing)' (X1_) to explore differences among the next nine variables in the data set, 'Price Level' through 'Satisfaction Level' (X2 through X10). In each statistical test, we are testing the null hypothesis of no relationship associated with the grouping variable, 'Delivery Speed (Valid/Missing)'. If we reject the null hypothesis, we would conclude that persons who did not answer the question on Delivery Speed had a different pattern of responses than did persons who did provide Delivery Speed. The variable 'Firm Size' (x8) is a nonmetric variable and we will do a chi-square test of independence for this variable. The variables 'Price Level' (x2), 'Price Flexibility' (x3), 'Manufacturer Image' (x4), 'Service' (x5), 'Salesforce Image' (x6), 'Product Quality' (x7), 'Usage Level' (x9), and 'Satisfaction Level' (x10) are all metric and we will do t-tests for these variables. Analyzing Patterns of Missing Data

The Chi-square Test of Independence Analyzing Patterns of Missing Data

Requesting the Chi-square Test Analyzing Patterns of Missing Data

Specifying Cell Contents Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Analyzing Patterns of Missing Data

Presentation Transcript

MISSING DATA

Analyzing Browse Patterns of Mobile Clients

The mysteries of missing data

Analyzing data

Analyzing data

Analyzing Data

Analyzing Missing Data

Handling Missing Data

Missing Data

Missing Data

Missing Data

Handling Missing Data

Missing Data

Handling Missing Data

Analyzing Data

Analyzing Data

Analyzing Data

Analyzing Data

Analyzing Data

Treatment of Missing Data

Analyzing Data

Analyzing Data