standardization of data dictionaries a case study from the pharmacogenomics research network
Skip this Video
Download Presentation
Standardization of Data Dictionaries: A Case Study from the Pharmacogenomics Research Network

Loading in 2 Seconds...

play fullscreen
1 / 1

Standardization of Data Dictionaries: A Case Study from the Pharmacogenomics Research Network - PowerPoint PPT Presentation

  • Uploaded on

Standardization of Data Dictionaries: A Case Study from the Pharmacogenomics Research Network. Introduction. Methods. Results. Conclusions. Background.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Standardization of Data Dictionaries: A Case Study from the Pharmacogenomics Research Network' - salene

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
standardization of data dictionaries a case study from the pharmacogenomics research network
Standardization of Data Dictionaries: A Case Study from the Pharmacogenomics Research Network






  • Pharmacogenomics is a multidisciplinary science; it also has become a data-intensive science. As such, it requires increasingly clear annotation and representation of phenotypes to support data integration and cross-database analyses. However, traditional biomedical research studies and clinical trials are being conducted independently, and common and standardized representations for data are seldom used. This leads to heterogeneity in the collected data and it hinders data reuse, integration and meta-analyses across multiple datasets. We propose a codification of standardized phenotype definitions and relationships, in coordination with other established government-funded efforts.
  • In this paper, we report the results from a preliminary standardization effort for pharmacogenomics metadata from the Pharmacogenomics Research Network (PGRN)1:
  • Significant heterogeneity in data representation
  • Methods for normalization and semantic annotation
  • Results from the initial analysis of dictionary harmonization
  • Identification of overlaps among dictionaries and gaps in standard terminologies and metadata
  • Demonstrate the need for pharmacogenomics data standards
  • Data dictionaries are collections of variables
    • Often constructed for a particular study
    • May contain both common and study-specific variables
    • There is no consistent format for data dictionaries
  • Differences in data collection or representation can complicate (or prevent) data integration and analysis
    • Differences in format (require transformations)
    • Differences in semantic meaning
    • Differences in data values
  • Data standardization is often achieved by harmonizing data dictionaries and by using standard elements to represent data
    • Standardized representations also enable secondary use


  • 1.
  • 2.
  • 3. S.H. Brown, P.L. Elkin, S.T. Rosenbloom, C. Husse r, B.A. Bauer, M.J. Lincoln et al. VA national drug file reference terminology: a cross-institutional content coverage Study. Medinfo, 2004 (2004), pp. 477–481
  • McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003;49:624–33


  • The authors thank Zonghui Lian and Scott Bauer for IT support, Donna Ihrke for participation in the initial data analysis.
  • The authors thank the Pharmacogenomic Research Network (PGRN) for supporting this work (PHONT; U19-GM061388)

Qian Zhu, PhD; Robert R. Freimuth, PhD; Matthew J. Durski, MA; Jyotishman Pathak, PhD; Christopher G. Chute, MD, DrPH

Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN


  • Data and metadata standards help to mitigate the problems that arise from semantic and syntactic differences between research groups, which are major barriers that hinder effective communication among scientists and that slow the pace of advancement and discovery.
  • This work describes the initial standardization of data dictionaries from pharmacogenomics research sites. Our results demonstrate that there is a significant amount of variability in how data is represented among PGRN sites.
  • Existing standards, including both structured data elements and controlled terminologies, contain many of the concepts that are needed to represent pharmacogenomic data, although some extensions to those terminologies are required.
  • This pilot study demonstrates that a broader harmonization effort would be successful in reducing the redundancy and heterogeneity in how data is represented within the PGRN.
  • Future work
    • The data dictionaries used in this study did not contain genetic data. This represents an important area for data integration; data and termiology standards are needed.
    • Standard clinical data models will be evaluated for use as PGRN standards.
  • Data normalization results
  • Variables mapped to caDSR Data Elements (DEs):
  • Semantic Annotations from NCBO Bioportal:
    • Of 629 variables, 595 contained sufficient descriptions to enable semantic annotation
      • Semantic annotations were reviewed to determine if the annotations captured the semantics of the variable completely, partially, or not at all

The variety of disease phenotypes that are studied in the PGRN, as well as differences in clinical systems in use at each PGRN site, lead to data that is heterogeneous, non-standardized, and institution-specific. This not only hinders aggregation of data among sites that are collaborating on a given study, but also it complicates or prevents secondary use of the data (e.g., in meta-analyses).

  • Harmonization process for PGRN data dictionaries
    • Variables were manually grouped into 5 categories
    • Variables were manually mapped, when possible, to existing elements in the caDSR
    • Variables were annotated with terms from 5 standard terminologies

~ 1/3 of the variables were selected for the initial analysis

Example: Medication History, Drug List

  • Different permissible values
    • Content and meaning of values
    • Local meanings ("other", "unknown")
    • Could also have different representations (numbers/codes vs. text)
  • Can use standard drug ontologies (RxNorm)
    • Standardize drug names and classes (NDF-RT)
    • Auto-classify agents (more flexible, less error prone)


  • PGRN Data Dictionaries
    • Data dictionaries were collected from PGRN research sites
    • Multiple formats: xls, pdf, txt, html, doc
    • Dictionaries from three sites were chosen for this pilot study
  • Terminology Standards and Formalized Metadata
  • To maximize the utility of the PGRN data dictionary harmonization effort and the potential reuse of data elements, we leveraged several existing terminology and metadata standards
  • SNOMED-CT2, NDF-RT3, NCI Thesaurus4, LOINC5 and RxNorm6
  • Cancer Data Standards Repository (caDSR)7