1 / 13

Data Quality/ Data Heterogeneity An evolving mission

Data Quality/ Data Heterogeneity An evolving mission. Kent Bailey Susan Welch Jeff Tarlowe. What is “Data Quality”?. Reliability? Accuracy? Reproducibility? Information content? Presence in all patients? Missingness ? Lack of bias? Suitability to the question? All of the above?.

nuri
Download Presentation

Data Quality/ Data Heterogeneity An evolving mission

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality/ Data HeterogeneityAn evolving mission Kent Bailey Susan Welch Jeff Tarlowe

  2. What is “Data Quality”? • Reliability? Accuracy? • Reproducibility? • Information content? • Presence in all patients? Missingness? • Lack of bias? • Suitability to the question? • All of the above?

  3. What is “Data Quality?” • Individual datum level • Obvious errors • Non-obvious errors • Not “errors” but poor reflection of a characteristic of an individual • Batch Level • Bias, suitability to question, information content, e.g. “signal to noise ratio”

  4. Projects under Data Quality • Project 1: Data Heterogeneity Study • Project 2: Difficult data elements • Body Mass Index • Smoking • Project 3: Comparison of Computer Algorithm vs. Manual Review for Treatment Cohort selection (a.k.a. the “John Henry” study) • Project 4: Measuring information in quantitative data

  5. Project 1: Data Heterogeneity • Purpose: Compare EHR data between institutions in terms of characteristics (not “quality”) • Institutions: Mayo and Intermountain • Methods: extract data relative to Type 2 Diabetes from EHR at each institution: diagnoses, labs, meds • Analysis: • Descriptive (compare frequencies and distributions) • Tweak selection parameters, and study effects • Study within-institution heterogeneity / bias • Study differences in institutional source datasets

  6. Project 1: Data Heterogeneity • Current Status/Milestones • IRB, data sharing approval • Initial DM2 datasets exist at each institution • De-identification (homemade) • Initial exchange of de-identified data! • Analysis proceeding • Comparative analysis of ICD9 codes • Comparison of datasources, missingness

  7. Project 1:Data Heterogeneity • Next Steps/ Future directions • Compare and contrast Mayo and Intermountain data • Compare and elucidate idiosyncracies of data sources • Draw generalization on heterogeneities • Assess impact of these heterogeneities on secondary use • White paper?

  8. Project 2: Difficult Data Elements • Purpose: characterize quality aspects of difficult data elements (BMI, smoking, …) and develop mitigation or warnings • Method: • extract data (height/weight/BMI, smoking) within Data Heterogeneity study at Intermountain and Mayo • Detect errors/ missingness • Develop mitigations if possible

  9. P2: Difficult Data Elements • Current status: • data related to BMI have been shared, are being analyzed • Next steps/future directions • Comparative analysis of BMI data, data quality / absence issues • Smoking status derived from cTAKES-based algorithm about to be reviewed by chart review • Develop widgets? White paper?

  10. Project 3: Algorithm/human review (“John Henry”) study • Purpose: Demonstrate and quantify the cost benefit associated with developing and implementing a computer algorithm to derive a cohort with high risk type 2 Diabetes compared to manual review to derive such a cohort. Analyze discordancies between the 2 methods. • Method: After phased preliminary comparative studies of 20 and 50 potential cases, with refining of algorithm, analyze 200 cases by the 2 approaches. Analyze the discordancies, but also the cumulative costs associated with both methods. Extrapolate to other target sample sizes

  11. P3: John Henry • Current Status/Milestones • First phases complete • Final contest (200 charts) imminent • Next steps/Future directions • Analyze costs using various assumptions • Report results • Generalize to other settings?

  12. Project 4: Measuring information in quantitative data • Purpose: to develop methods to quantify the signal to noise ratio in quantitative data that can be used to inform choices or weights applied to different potential variables related to the same underlying phenotype • Methods: application of ANOVA to estimate between-subject and within subject components of variance and other methods for estimating signal and noise components, example random capillary glucose and HbA1c • Current status: gleam in the eye, preliminary proof of concept

  13. Questions/suggestions

More Related