1 / 23

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects . Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center. NIAID Bioinformatics Resource Centers. www.pathogenportal.net.

tomai
Download Presentation

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center

  2. NIAID Bioinformatics Resource Centers www.pathogenportal.net

  3. Influenza Research Database www.fludb.org

  4. NIAID Genome Sequencing Centers

  5. Metadata Inconsistencies • Each project was providing different types of metadata • No consistent nomenclature being used • Impossible to perform reliable comparative genomics analysis

  6. Dengue Clinical Metadata

  7. Complex Query Interface

  8. Additional Clinical Characteristics

  9. GSC-BRC Metadata Standards Working Group • NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs • Develop metadata standards for pathogen isolate sequencing projects

  10. GSC-BRC Metadata Working Groups

  11. Metadata Standards Process • Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors • Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) • Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific • For each data field, provide: • definitions, • synonyms, • allowed value sets preferably using controlled vocabularies, • expected syntax, • examples, • data categories, • data providers • Merge subgroup core elements into a common set of core metadata fields and attributes • Assemble metadata fields into a semantic network • Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) • Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples • Establish policies and procedures for metadata submission workflows and GenBank linkage • Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

  12. Core Sample Metadata 30 Core Sample Metadata Fields

  13. Core Project Metadata 16 Core Project Metadata Fields

  14. Metadata Standards Process • Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors • Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) • Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific • For each data field, provide: • definitions, • synonyms, • allowed value sets preferably using controlled vocabularies, • expected syntax, • examples, • data categories, • data providers • Merge subgroup core elements into a common set of core metadata fields and attributes • Assemble metadata fields into a semantic network (Scheuermann) • Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng) • Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples • Establish policies and procedures for metadata submission workflows and GenBank linkage • Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

  15. Specimen Isolation v5-6 v15 v16 v17 v19 v18 v2 v7 v8 v9 v3-4 temporal interval date/time denotes has_part spatial region GPS location temporal-spatial region b30 b18 b27 b25 b26 b28 b24 b23 b22 b29 located_in denotes spatial region geographic location located_in Comments has_quality organism ID specimen source role plays ???? environment denotes environmental material has_input specimen isolation procedure X instance_of has_output specimen X specimen type has_input specimen capture role plays equipment is_about has_specification has_authorization organism part hypothesis specimen collector role plays person instance_of isolation protocol IRB/IACUC approval has_affiliation denotes specimen isolation procedure type affiliation name

  16. Metadata Processes temporal-spatial region Investigation Specimen Isolation located_in Material Processing type ID qualities denotes temporal-spatial region instance_of has_quality located_in specimen source – organism or environmental has_output has_output has_input specimen isolation process sample processing enriched NA sample specimen has_input specimen collector has_specification has_part has_part isolation protocol microorganism genomic NA microorganism is_about Sequencing Assay Data Processing has_output is_about data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician denotes equipment GenBank ID

  17. Core-Project

  18. Core-Specimen

  19. Generic Assay analyte X sample type instance_of has_part sample material X sample ID denotes GPS location geographic location date/time target role plays has_quality denotes denotes located_in quality x spatial region temporal interval spatial region reagent type instance_of has_input has_part lot # material X denotes temporal-spatial region reagent role plays has_input located_in has_output is_about species instance_of primary data input sample material X assay X has_input person X name denotes technician role has_specification instance_of plays denotes has_input equipment type instance_of assay type assay protocol run ID equipment X serial # denotes has_part signal detection role plays objectives

  20. Generic Material Transformation sample type instance_of sample material X sample ID denotes GPS location geographic location target role plays has_quality date/time denotes denotes quality x located_in reagent type spatial region temporal interval spatial region instance_of has_input lot # material X denotes has_part reagent role plays temporal-spatial region has_input quality x located_in has_quality has_output species instance_of output material X material transformation X denotes sample ID has_input person X name denotes instance_of material type technician role has_specification instance_of plays denotes has_input equipment type instance_of material transformation type material transformation protocol run ID equipment X serial # denotes has_part signal detection role plays objectives

  21. Generic Data Transformation GPS location geographic location date/time denotes denotes located_in spatial region temporal interval spatial region has_part temporal-spatial region located_in software has_input has_output output data input data data transformation X has_specification instance_of denotes person X name denotes plays is_about run ID data transformation type algorithm data analyst role material X

  22. Generic Material (IC) GPS location GPS location geographic location geographic location date/time date/time denotes denotes denotes denotes denotes denotes located_in located_in spatial region spatial region spatial region spatial region temporal interval temporal interval has_part has_part temporal-spatial region temporal-spatial region located_in located_in quality x quality y has_quality has_quality material type instance_of material X denotes ID has_part has_part material Y material Z

  23. Conclusions Utility of semantic representation Identified gaps in data field list (e.g. temporal components) Identified gaps in ontology data standards (use case-driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Two flavors of MIBBI Distinguish between minimum information to reproduce an experiment and the minimum information to structure in a database for query and analysis OBI-based framework is re-usable Sequencing => “omics” Practical issues about implementation strategies Challenge of using ontologies for preferred value sets Can be large May not directly match common language

More Related