Standardization of Data Dictionaries: A Case Study from the Pharmacogenomics Research Network
- Pharmacogenomics is a multidisciplinary science; it also has become a data-intensive science. As such, it requires increasingly clear annotation and representation of phenotypes to support data integration and cross-database analyses. However, traditional biomedical research studies and clinical trials are being conducted independently, and common and standardized representations for data are seldom used. This leads to heterogeneity in the collected data and it hinders data reuse, integration and meta-analyses across multiple datasets. We propose a codification of standardized phenotype definitions and relationships, in coordination with other established government-funded efforts.
- In this paper, we report the results from a preliminary standardization effort for pharmacogenomics metadata from the Pharmacogenomics Research Network (PGRN)1:
- Significant heterogeneity in data representation
- Methods for normalization and semantic annotation
- Results from the initial analysis of dictionary harmonization
- Identification of overlaps among dictionaries and gaps in standard terminologies and metadata
- Demonstrate the need for pharmacogenomics data standards
- Data dictionaries are collections of variables
- Often constructed for a particular study
- May contain both common and study-specific variables
- There is no consistent format for data dictionaries
- Differences in data collection or representation can complicate (or prevent) data integration and analysis
- Differences in format (require transformations)
- Differences in semantic meaning
- Differences in data values
- Data standardization is often achieved by harmonizing data dictionaries and by using standard elements to represent data
- Standardized representations also enable secondary use
- 1. http://pgrn.org/display/pgrnwebsite/PGRN+Home
- 2. http://www.ihtsdo.org/snomed-ct/
- 3. S.H. Brown, P.L. Elkin, S.T. Rosenbloom, C. Husse r, B.A. Bauer, M.J. Lincoln et al. VA national drug file reference terminology: a cross-institutional content coverage Study. Medinfo, 2004 (2004), pp. 477–481
- McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003;49:624–33
- The authors thank Zonghui Lian and Scott Bauer for IT support, Donna Ihrke for participation in the initial data analysis.
- The authors thank the Pharmacogenomic Research Network (PGRN) for supporting this work (PHONT; U19-GM061388)
Qian Zhu, PhD; Robert R. Freimuth, PhD; Matthew J. Durski, MA; Jyotishman Pathak, PhD; Christopher G. Chute, MD, DrPH
Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN
- Data and metadata standards help to mitigate the problems that arise from semantic and syntactic differences between research groups, which are major barriers that hinder effective communication among scientists and that slow the pace of advancement and discovery.
- This work describes the initial standardization of data dictionaries from pharmacogenomics research sites. Our results demonstrate that there is a significant amount of variability in how data is represented among PGRN sites.
- Existing standards, including both structured data elements and controlled terminologies, contain many of the concepts that are needed to represent pharmacogenomic data, although some extensions to those terminologies are required.
- This pilot study demonstrates that a broader harmonization effort would be successful in reducing the redundancy and heterogeneity in how data is represented within the PGRN.
- Future work
- The data dictionaries used in this study did not contain genetic data. This represents an important area for data integration; data and termiology standards are needed.
- Standard clinical data models will be evaluated for use as PGRN standards.
- Data normalization results
- Variables mapped to caDSR Data Elements (DEs):
- Semantic Annotations from NCBO Bioportal:
- Of 629 variables, 595 contained sufficient descriptions to enable semantic annotation
- Semantic annotations were reviewed to determine if the annotations captured the semantics of the variable completely, partially, or not at all
The variety of disease phenotypes that are studied in the PGRN, as well as differences in clinical systems in use at each PGRN site, lead to data that is heterogeneous, non-standardized, and institution-specific. This not only hinders aggregation of data among sites that are collaborating on a given study, but also it complicates or prevents secondary use of the data (e.g., in meta-analyses).
- Harmonization process for PGRN data dictionaries
- Variables were manually grouped into 5 categories
- Variables were manually mapped, when possible, to existing elements in the caDSR
- Variables were annotated with terms from 5 standard terminologies
~ 1/3 of the variables were selected for the initial analysis
Example: Medication History, Drug List
- Different permissible values
- Content and meaning of values
- Local meanings ("other", "unknown")
- Could also have different representations (numbers/codes vs. text)
- Can use standard drug ontologies (RxNorm)
- Standardize drug names and classes (NDF-RT)
- Auto-classify agents (more flexible, less error prone)
- PGRN Data Dictionaries
- Data dictionaries were collected from PGRN research sites
- Multiple formats: xls, pdf, txt, html, doc
- Dictionaries from three sites were chosen for this pilot study
- Terminology Standards and Formalized Metadata
- To maximize the utility of the PGRN data dictionary harmonization effort and the potential reuse of data elements, we leveraged several existing terminology and metadata standards
- SNOMED-CT2, NDF-RT3, NCI Thesaurus4, LOINC5 and RxNorm6
- Cancer Data Standards Repository (caDSR)7