1 / 28

Saeid Shahraz MD, PhD Student Heller School of Social Policy and Management

Improving the quality of data through imputing missing values (Part One: Introduction to types of missing data). Saeid Shahraz MD, PhD Student Heller School of Social Policy and Management. Basic questions. What does the ‘missing data’ mean? What does ‘imputation’ mean?

cliff
Download Presentation

Saeid Shahraz MD, PhD Student Heller School of Social Policy and Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving the quality of data through imputing missing values (Part One: Introduction to types of missing data) Saeid Shahraz MD, PhD Student Heller School of Social Policy and Management Saeid Shahraz

  2. Basic questions • What does the ‘missing data’ mean? • What does ‘imputation’ mean? • What does ‘data improvement’ mean? • How much missingness is acceptable? • Is missing data a usual problem? • Is ‘imputation’ always a right solution? Saeid Shahraz

  3. What does the “missing data” mean? Please look at Table one in the next slide. We have 5 observations in this ultra-small data set and as you see observations number 3 and number 5 have missing values on the variable “number of follow-up rehabilitation visits”. Saeid Shahraz

  4. Table 1-Two values are missing Saeid Shahraz

  5. What does “ imputation” mean? If we figure out what the missing values are and put them in the missing boxes we have done imputation. So please look at Table two in which the missing values have been imputed. Please do not think of how the imputation processed. Indeed, I put some arbitrary numbers in. Saeid Shahraz

  6. Table 2-Two values imputed Saeid Shahraz

  7. What does “data improvement” mean? Please look at Table three. In this table you see three columns for number of visits. The left column is the actual (non-missing) variable. The middle is a column with missing values and the most right column is the one with imputed values. The last row of the table shows you what the average numbers of visits are given the actual data, the missing data, and the imputed data. You clearly see that the average for imputed column is closer to that of the actual information. So, this means “imputation” actually improved the quality of data. Saeid Shahraz

  8. Table 3- Data improvement Saeid Shahraz

  9. How much missingness is acceptable? Like a threshold for the significance level for p-values, there is no empirical answer to the question. Leong and Austin (2006) for instance suggested 5%. I have personally seen in actual research work some social science and health service researches accepted 10% of missingness. So, for now, let us agree with the tolerance level at 5%. Saeid Shahraz

  10. Is missing data a usual problem? Yes. In most administrative data sets that I have been working with a considerable number of values on my desired variables were missing. We need to seriously think of significant amount of missing even when the data has a reputation for being clean and complete. Examples of the latter is Demographic and Health Surveys, better known as DHS. These data sets carry a lot of invaluable information but missing data is sometimes a prohibiting factor for researchers using them. Saeid Shahraz

  11. Is imputation always a right solution? With some exceptions yes. But I would like you to answer this question when we are done with the whole presentations. Saeid Shahraz

  12. TYPES OF MISSING (RUBIN’S TYPOLOGY) • MISSING COMPLETELY AT RANDOM (MCAR) • MISSING AT RANDOM (MAR) • MISSING NOT AT RANDOM (MNAR) Saeid Shahraz

  13. Missing Completely At Random (MCAR) • The cause of missingness cannot be found through looking at other observed variables. • The cause of missingness is independent of values of missing variable. NO-NO condition Saeid Shahraz

  14. MCAR: EXAMPLE ONE: Lab samples thrown outImagine that blood samples from a randomly selected population to test fasting blood sugar have been sent to 3 labs. One of the labs reports that all the samples have been accidentally thrown out. So, a portion of data on the variable blood sugar level will be missed in the final data set. Here, the event causes missingness is exogenous to the process of data gathering and characteristics of the population ( independency of the likelihood of missing from observed information). Also, the missingness was independent of whether or not blood sugar was high or low. Saeid Shahraz

  15. MCAR-1 Saeid Shahraz

  16. MCAR: EXAMPLE TWO: Coin tossingThis example is the famous coin tossing in sport to define which team own the ball first. Two possibilities: head and tail. Imagine that we know the age of the referee and the type of the sport in our data set and some of the values on the result of coin tossing are missing from the data. Obviously, having missing values on the result is not dependent on either observed variables (age of the referee and type of sport) or on the missing (unobserved) values. To elaborate on the latter I would say having 70% of the results on coin tossing as head up does not imply that 70% or the majority of the missing values have to be head up. Saeid Shahraz

  17. MCAR-2 Saeid Shahraz

  18. Missing At Random (MAR) • The cause of missing values is independent of missing (unobservable) values • But can be predicted by other observed values NO-YES condition Saeid Shahraz

  19. MAR: EXAMPLE ONE: Females and kidney donationThe example is a study through which the effect of kidney donation on the donor’s household income is investigated. If during the study it is found that female donors more than male donors tend to refuse to answer to the income question the missing pattern on the income variable is called Missing At Random or MAR. In this case women with low or high income respond to the question of income with the same probability. In other words the missingness is independent of the missing (unobserved) values Saeid Shahraz

  20. MAR-1 Saeid Shahraz

  21. MAR: EXAMPLE TWO: attitudes toward having social insuranceThis is a study on the attitudes towards implementing a universal social welfare insurance program. It was found that people with affiliation to a type of political party tended not to respond to the insurance question. In this example, the pattern of missing on the response to having social insurance is MAR because at least one observed variable (political party) somehow determined the likelihood of the response to be missing. Positive or negative response toward having the social insurance was assumed to be independent of missing pattern. This means that the probability of missing answer to the insurance questions was the same for both people who tended to provide negative results and those who wanted to answer positively. Saeid Shahraz

  22. MAR-2 Saeid Shahraz

  23. Missing Not At Random (MNAR) • The cause of missing values is dependent of missing (unobservable) values • And can usually be predicted by other observed values YES-YES condition Saeid Shahraz

  24. MNAR: EXAMPLE ONE: Synthetic insulin and blood sugar reduction time The first scenario is a research study through which the effect of a new type of synthetic insulin on the time of blood sugar reduction in human is investigated. The protocol mandates the researcher if the reduction time is greater than one third of the standard reduction time (defined in the protocol) the researchers should stop the treatment and refer the patient to the emergency department. These patients quit the study and the final result on the reduction time is missing. In this example, the likelihood of missing depends exactly on the unobserved (missing) values. This means that reduction time pattern (the variable that has considerable number of missing cases) determines whether or not the value is missing or not Saeid Shahraz

  25. MNAR-1 Saeid Shahraz

  26. MNAR: EXAMPLE TWO: A new pain killer and experience with painThe second scenario is a study in which a new pain killer medication is administered to patients with migraine headache and the amount of pain reduction is asked the day after. It was found out that missing values on the variable ‘how much pain was reduced’ were much greater among patients who experienced severe pain. Saeid Shahraz

  27. MNAR-2 Saeid Shahraz

  28. Thank you and looking forward to having you for the next sessionPlease email me your questions at sshahraz@yahoo.com Saeid Shahraz

More Related