1 / 15

Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers. Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center. N01AI2008038

Download Presentation

Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center N01AI2008038 N01AI40041

  2. Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute N01AI2008038 N01AI40041

  3. Genome Sequencing Centers for Infectious Disease (GSCID) Bioinformatics Resource Centers (BRC) www.viprbrc.org www.fludb.org

  4. High Throughput Sequencing • Enabling technology • Epidemiology of outbreaks • Pathogen evolution • Host range restriction • Genetic determinants of virulence and pathogenicity • Metadata requirements • Temporal-spatial information about isolates • Selective pressures • Host species of specimen source • Disease severity and clinical manifestations

  5. Metadata Submission Spreadsheets 1 1 1 1 4 4 3 2 2 4 3

  6. Complex Query Interface

  7. Metadata Inconsistencies • Each project was providing different types of metadata • No consistent nomenclature being used • Impossible to perform reliable comparative genomics analysis • Required extensive custom bioinformatics system development

  8. GSC-BRC Metadata Standards Working Group • NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs • Develop metadata standards for pathogen isolate sequencing projects • Bottom up approach • Assemble into a semantic framework

  9. GSC-BRC Metadata Working Groups

  10. Metadata Standards Process • Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors • Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) • Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific • For each data field, provide common set of attributes, including definitions, synonyms, allowed value sets preferably using controlled vocabularies, and expected syntax, etc. • Merge subgroup core elements into a common set of core metadata fields and attributes • Assemble set of pathogen-specific and project-specific metadata fields to be used in conjunction with core fields • Compare, harmonize, map to other relevant initiatives, including OBI, MIGS, MIxS, BioProjects, BioSamples (ongoing) • Assemble all metadata fields into a semantic network (ongoing) • Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) • Draft data submission spreadsheets to be used for all white paper and BRC-associated projects • Finalize version 1.0 metadata standard and version 1.0 data submission spreadsheet • Beta test version 1.0 standard with new white paper projects, collecting feedback

  11. Data Fields: Core Project Core Sample Attributes

  12. Specimen Isolation temporal interval date/time ID gender age health status CS2/3 CS5/6 CS18 CS13 CS14 CS1 CS8 CS7 CS4 denotes has_part denotes spatial region GPS location temporal-spatial region CS9/10 has quality located_in organism denotes specimen source role spatial region geographic location plays CS11/12 environmental material located_in has_quality ID has part environment denotes pathogenic disposition has disposition organism has_input specimen isolation procedure X instance_of has_output specimen X specimen type has_input specimen capture role plays equipment is_about has_specification has_authorization organism part hypothesis specimen collector role plays person instance_of CS15/16 isolation protocol IRB/IACUC approval has_affiliation denotes specimen isolation procedure type affiliation name

  13. Metadata Processes Quality Assessment Investigation temporal-spatial region Specimen Isolation Material Processing qualities located_in has_output temporal-spatial region quality assessment assay temporal-spatial region has_quality has_input located_in located_in has_output has_output has_input specimen source – organism or environmental specimen isolation process sample processing enriched NA sample has_input specimen instance_of denotes has_specification has_part has_part specimen collector type ID isolation protocol microorganism genomic NA microorganism is_about Data Processing Sequencing Assay has_output data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data is_about input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician located_in located_in denotes located_in temporal-spatial region equipment temporal-spatial region GenBank ID temporal-spatial region

  14. Outcome of Metadata Standards WG • Consistent metadata captured across GSCID • Guidance to collaborators regarding metadata expectations for sequencing and analysis services • Support more standardized BRC interface development • Harmonization with related stakeholders – Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioSample • Represented in the context of an extensible semantic framework

  15. Conclusions Metadata standards for microorganism sequencing projects Bottom up approach focuses standard on important features Harmonizing with related standards from the Genome Standards Consortium, OBO Foundry and NCBI Being beta-tested by GSCIDs for adoption by all NIAID-sponsored sequencing projects Utility of semantic representation Identified gaps in data field list (e.g. temporal components) Includes logical structure for other, project-specific, data fields - extensible Identified gaps in ontology data standards (use case-driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Ontology-based framework is extensible Sequencing => “omics”

More Related