Data Quality/ Data Heterogeneity An evolving mission

Data Quality/ Data HeterogeneityAn evolving mission Kent Bailey Susan Welch Jeff Tarlowe

What is “Data Quality”? • Reliability? Accuracy? • Reproducibility? • Information content? • Presence in all patients? Missingness? • Lack of bias? • Suitability to the question? • All of the above?

What is “Data Quality?” • Individual datum level • Obvious errors • Non-obvious errors • Not “errors” but poor reflection of a characteristic of an individual • Batch Level • Bias, suitability to question, information content, e.g. “signal to noise ratio”

Projects under Data Quality • Project 1: Data Heterogeneity Study • Project 2: Difficult data elements • Body Mass Index • Smoking • Project 3: Comparison of Computer Algorithm vs. Manual Review for Treatment Cohort selection (a.k.a. the “John Henry” study) • Project 4: Measuring information in quantitative data

Project 1: Data Heterogeneity • Purpose: Compare EHR data between institutions in terms of characteristics (not “quality”) • Institutions: Mayo and Intermountain • Methods: extract data relative to Type 2 Diabetes from EHR at each institution: diagnoses, labs, meds • Analysis: • Descriptive (compare frequencies and distributions) • Tweak selection parameters, and study effects • Study within-institution heterogeneity / bias • Study differences in institutional source datasets

Project 1: Data Heterogeneity • Current Status/Milestones • IRB, data sharing approval • Initial DM2 datasets exist at each institution • De-identification (homemade) • Initial exchange of de-identified data! • Analysis proceeding • Comparative analysis of ICD9 codes • Comparison of datasources, missingness

Project 1:Data Heterogeneity • Next Steps/ Future directions • Compare and contrast Mayo and Intermountain data • Compare and elucidate idiosyncracies of data sources • Draw generalization on heterogeneities • Assess impact of these heterogeneities on secondary use • White paper?

Project 2: Difficult Data Elements • Purpose: characterize quality aspects of difficult data elements (BMI, smoking, …) and develop mitigation or warnings • Method: • extract data (height/weight/BMI, smoking) within Data Heterogeneity study at Intermountain and Mayo • Detect errors/ missingness • Develop mitigations if possible

P2: Difficult Data Elements • Current status: • data related to BMI have been shared, are being analyzed • Next steps/future directions • Comparative analysis of BMI data, data quality / absence issues • Smoking status derived from cTAKES-based algorithm about to be reviewed by chart review • Develop widgets? White paper?

Project 3: Algorithm/human review (“John Henry”) study • Purpose: Demonstrate and quantify the cost benefit associated with developing and implementing a computer algorithm to derive a cohort with high risk type 2 Diabetes compared to manual review to derive such a cohort. Analyze discordancies between the 2 methods. • Method: After phased preliminary comparative studies of 20 and 50 potential cases, with refining of algorithm, analyze 200 cases by the 2 approaches. Analyze the discordancies, but also the cumulative costs associated with both methods. Extrapolate to other target sample sizes

P3: John Henry • Current Status/Milestones • First phases complete • Final contest (200 charts) imminent • Next steps/Future directions • Analyze costs using various assumptions • Report results • Generalize to other settings?

Project 4: Measuring information in quantitative data • Purpose: to develop methods to quantify the signal to noise ratio in quantitative data that can be used to inform choices or weights applied to different potential variables related to the same underlying phenotype • Methods: application of ANOVA to estimate between-subject and within subject components of variance and other methods for estimating signal and noise components, example random capillary glucose and HbA1c • Current status: gleam in the eye, preliminary proof of concept

Questions/suggestions

Data Quality/ Data Heterogeneity An evolving mission

Data Quality/ Data Heterogeneity An evolving mission

Presentation Transcript

Introduction

Quantitative Data Analysis

Objectives

Data Management and Data Processing Support on Array-Based Scientific Data

TRUE BLUE

Data Quality Management Control Program

Data Science 101

Data ! Data! Data!

Data Management Services in GT2 and GT3

Discrete Choice Modeling

Working with Data and ADO.Net

XML and Web Data

Data Quality Management Control Program

Supporting on-the-fly data Integration for bioinformatics

The Wind Lidar Mission ADM-Aeolus Data Processing

CS590D: Data Mining Chris Clifton

Data Warehousing

Coding Compliance Editor Monitoring Data Quality

Data Quality in the MHS Tips and Tricks

“Data Quality Tools You Can Use” Charlene Colon, Clinical Data Analyst

CURRICULUM / INSTRUCTION / ASSESSMENT