1 / 1

Qian Zhu, PhD; Robert R. Freimuth, PhD; Jyotishman Pathak, PhD; Matthew J. Durski, MA;

Development of infrastructure, data standards, and best practices for the curation of pharmacogenomics data. Abstract. Results. Materials. Limitation and Future work. Conclusions. Tooling. Methods. References. Acknowledgements. 1. http://pgrn.org/display/pgrnwebsite/PGRN+Home

emele
Download Presentation

Qian Zhu, PhD; Robert R. Freimuth, PhD; Jyotishman Pathak, PhD; Matthew J. Durski, MA;

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Development of infrastructure, data standards, and best practices for the curation of pharmacogenomics data Abstract Results Materials Limitation and Future work Conclusions Tooling Methods References Acknowledgements 1. http://pgrn.org/display/pgrnwebsite/PGRN+Home 2. http://www.ihtsdo.org/snomed-ct/ 3. S.H. Brown, P.L. Elkin, S.T. Rosenbloom, C. Husse r, B.A. Bauer, M.J. Lincoln et al. VA national drug file reference terminology: a cross-institutional content coverage Study. Medinfo, 2004 (2004), pp. 477–481 4. http://ncit.nci.nih.gov/ 5. McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003;49:624–33 6. http://www.nlm.nih.gov/research/umls/rxnorm/ 7. Qian Zhu, PhD; Robert R. Freimuth, PhD; Jyotishman Pathak, PhD; Matthew J. Durski, MA; Zonghui Lian, MS; H. Scott Bauer; Christopher G. Chute, MD, DrPH Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN • Semantic Annotation Reviewer • To facilitate review of the annotation results, a web application was developed that allowed the curator select the best term(s) for annotation. This tooling will enable the standardization of PGRN data dictionaries to be completed more efficiently than using the more labor-intensive approaches that are used today. • Harmonization of value sets was out of scope for this initial study. Future work will include value set harmonization, and we anticipate adding support for this function into the infrastructure and workflows that are being developed. • Six categories were identified in this pilot study. Additional categories are expected as data dictionaries from other PGRN sites are included in the harmonization effort. • The harmonized data elements from this work will be mapped to existing standards when possible, or new standards will be proposed if necessary. • This study focused on clinical data related to lab tests, medications, and disease diagnoses. Interestingly, none of the dictionaries included in this pilot contained elements to represent genomic data; clearly an area for future work. • The standardization of genomics data is an active topic in many organizations. We are working with the HL7 Clinical Genomics work group, CDISC, and the NCI Information Representation work group to leverage existing efforts. • Future releases of the harmonization infrastructure will include role-based functions and support for curation workflows. • The Pharmacogenomics Research Network (PGRN)1is a collaborative partnership of research groups funded by the U.S. National Institutes of Health to discover and understand how the genome contributes to an individual’s response to medication. It is contributing significantly to the scientific base of knowledge in pharmacogenomics, a trend that is expected to continue. However, traditional biomedical research studies and clinical trials are being conducted independently, common and standardized representations for data are seldom used. This leads to heterogeneity in the collected data and hinders data reuse, integration and meta-analyses across multiple datasets. • Curation of pharmacogenomics data sets requires that standards exist for a given terminology, ontology, or other standard representation of data. • identifying core entities common to pharmacogenomics studies • proposing standards for representing those data • developing a workflow and supporting infrastructure to enable the efficient curation of metadata related to pharmacogenomics studies. • To support the creation workflow and assist the PGRN community in managing their data and related standards, a new software application has been developed. This tooling will enable the standardization of PGRN data dictionaries to be completed more efficiently than using the more labor-intensive approaches that are used today. • Curation workflow composed into steps • Data pre-processing, each dictionary by reformatting it and filling in missing data, loading into a MySQL database for harmonization. • Decomposition and Normalization • Variable descriptions were split into single words, which were then reassembled into phrases. • Normalized by stop word list and UMLS Specialist Lexicon7 • Semantic annotation by controlled terminologies • Manually annotation review by semantic annotation reviewer • Categorization based on UMLS semantic types and domain knowledge • Decomposition and Normalization results Data and metadata standards help to mitigate the problems that arise from semantic and syntactic differences between research groups. These differences are major barriers that hinder effective communication among scientists and that slow the pace of advancement and discovery. The proposed standard representations for pharmacogenomics data that will be produced by this study will enable the PGRN community to more effectively share and reuse their data. Furthermore, the workflow and best practices for the curation of pharmacogenomics data that are being developed for this project are generalizable to other curation efforts that require the harmonization of disparate data elements from a broad community of investigators. • Semantic Annotation Results • Variable description decomposition and normalization pipeline • The length of each phrase reassembled from the mapping components (MCs) was limited to a maximum of six single words • Removed all words that were contained in the stop words list; and removed MCs that included more than 50% stop words • Verb tense converted to a common base form, plural nouns to singular form, and possessive nouns to base forms using the LRAGR lexicon • Verbs, adjectives, and adverbs were converted to nouns using the LRNOM lexicon • PGRN Data Dictionaries • Data dictionaries were collected from PGRN research sites • Multiple formats: xls, pdf, txt, html, doc • Dictionaries from three sites were chosen for this pilot study         Natural Phenomenon or Process             Biologic Function                 Physiologic Function                     Organism Function                         Mental Process                     Organ or Tissue Function                     Cell Function                     Molecular Function                         Genetic Function                 Pathologic Function                     Disease or Syndrome                         Mental or Behavioral Dysfunction                         Neoplastic Process                     Cell or Molecular Dysfunction                     Experimental Model of Disease         Injury or Poisoning  • Categorization Results for 797 variables with complete annotations • UMLS Semantic Types used for categorization • Terminology Standards and Formalized Metadata • To maximize the utility of the PGRN data dictionary harmonization effort and the potential reuse of data elements, we leveraged several existing terminology, SNOMED-CT2, NDF-RT3, NCI Thesaurus4, LOINC5 and RxNorm6 http://www.ncbi.nlm.nih.gov/books/NBK9680/ This work was supported by the NIH/NIGMS (U19 GM61388; the Pharmacogenomic Research Network).

More Related