1 / 31

GSC-BRC Metadata Standards

GSC-BRC Metadata Standards. Richard H. Scheuermann U.T. Southwestern Medical Center. Metadata Inconsistencies. Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis. Dengue Clinical Metadata.

guang
Download Presentation

GSC-BRC Metadata Standards

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GSC-BRC Metadata Standards Richard H. Scheuermann U.T. Southwestern Medical Center

  2. Metadata Inconsistencies • Each project was providing different types of metadata • No consistent nomenclature being used • Impossible to perform reliable comparative genomics analysis

  3. Dengue Clinical Metadata

  4. Virus Isolate Information

  5. Complex Query Interface

  6. Additional Clinical Characteristics

  7. GSC-BRC Metadata Standards Working Group • NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs • Develop metadata standards for pathogen isolate sequencing projects

  8. Metadata Standards Process • Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors • Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) • Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific • For each data field, provide definitions, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, examples, data categories and data providers • Merge subgroup core elements into a common set of core metadata fields and attributes • Assemble metadata fields into a semantic network • Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) • Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples • Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

  9. GSC-BRC Metadata Working Groups

  10. Example Metadata

  11. Virus Core Metadata Sheet

  12. Metadata Merge

  13. Network Overview temporal-spatial region - independent continuant - dependent continuant - occurrent - temporal-spatial region ital - relations located_in type ID qualities denotes temporal-spatial region instance_of has_quality located_in specimen source – organism or environmental has_output has_output has_input specimen isolation process sample processing enriched NA sample specimen has_input specimen collector has_specification has_part has_part isolation protocol microorganism genomic NA microorganism is_about has_output is_about data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician denotes equipment GenBank ID

  14. temporal-spatial region Investigation Specimen Isolation located_in Material Processing type ID qualities denotes temporal-spatial region instance_of has_quality located_in specimen source – organism or environmental has_output has_output has_input specimen isolation process sample processing enriched NA sample specimen has_input specimen collector has_specification has_part has_part isolation protocol microorganism genomic NA microorganism is_about Sequencing Assay Data Processing has_output is_about data transformations – variant detection serotype marker detect. gene detection genotype/serotype/ gene data input sample has_input has_output has_output has_output has_input has_input reagents has_input data transformations – image processing assembly data archiving process sequence data sequence data record primary data sequencing assay technician denotes equipment GenBank ID

  15. Metadata Categories Investigation Host/Source Characterization Specimen Isolation Pathogen Detection Pathogen Isolation Pathogen Characterization Specimen Processing Sample Shipment Sequencing Sample Preparation Sequencing Assay Data Transformation

  16. Host/Source Characterization vX – row X in virus sheet - independent continuant - dependent continuant - occurrent - temporal-spatial region ital - relations v13 v12 v10 v11 temporal interval date/time b20 b19 b14 b17 b16 b15 denotes has_part spatial region GPS location temporal-spatial region common name located_in located_in denotes denotes spatial region geographic location species/ strain organism ID age, gender, symptom denotes instance_of has_quality organism specimen source role plays environmental material has_input specimen isolation procedure X

  17. Specimen Isolation v5-6 v15 v16 v17 v19 v18 v2 v7 v8 v9 v3-4 temporal interval date/time denotes has_part spatial region GPS location temporal-spatial region b30 b18 b27 b25 b26 b28 b24 b23 b22 b29 located_in denotes spatial region geographic location located_in Comments has_quality organism ID specimen source role plays ???? environment denotes environmental material has_input specimen isolation procedure X instance_of has_output specimen X specimen type has_input specimen capture role plays equipment is_about has_specification has_authorization organism part hypothesis specimen collector role plays person instance_of isolation protocol IRB/IACUC approval has_affiliation denotes specimen isolation procedure type affiliation name

  18. Pathogen Detection v16 v15 v28 v27 GPS location geographic location date/time denotes denotes denotes located_in spatial region spatial region pathogen detection protocol temporal interval b21 has_part has_specification temporal-spatial region pathogen detection method located_in specimen type instance_of instance_of has_input pathogen detection process X specimen X denotes ID has_part has_quality has_output amount is_about microorganism X data about pathogen presence instance_of species/ strain

  19. Pathogen Isolation v26 v15 v16 v34 GPS location geographic location date/time denotes denotes denotes ID pathogen type located_in amount spatial region spatial region temporal interval denotes instance_of has_quality has_part pathogen isolate X temporal-spatial region has_output located_in pathogen isolation method pathogen isolation protocol pathogen isolation process X has_specification instance_of has_input specimen type instance_of specimen X denotes ID has_part has_quality amount microorganism X instance_of species/ strain

  20. PathogenCharacterization v34 v16 v15 v31 v27 v29 v30 v32 v27 ID GPS location geographic location date/time pathogen type amount denotes denotes denotes denotes located_in instance_of has_quality spatial region spatial region temporal interval pathogen isolate X has_part b12 b10 b13 b11 b2 b7 b8 b5 b6 b4 b9 b3 is_about has_input temporal-spatial region located_in has_output has_output genus/species/strain determination assay X genus/species/strain characteristic pathogen isolation method pathogen isolation protocol pathogen isolation process X has_specification biological characteristic assay X biovar characteristic instance_of has_input specimen type antigenic characteristic assay X serovar characteristic instance_of pathologic characteristic assay X pathovar characteristic specimen X denotes ID has_part has_quality genetic characteristic assay X genotype characteristic amount microorganism X chromosome/plasmid assay X chromosome/plasmid characteristic instance_of antibiotic sensitivity assay X antibody sensitivity characteristic species/ strain

  21. SpecimenProcessing GPS location geographic location GPS location geographic location date/time date/time v23 v22 v20 v16 v15 v27 denotes denotes denotes denotes denotes denotes located_in located_in spatial region spatial region temporal interval spatial region spatial region temporal interval has_part has_part specimen T aliquot U temporal-spatial region temporal-spatial region b42 b43 b41 b40 species/ strain aliquoting process sample set assembly process specimen M aliquot N instance_of located_in located_in microorganism X instance_of instance_of specimen A aliquot B has_part has_output has_output has_input has_input sample set assembly process X sample set X aliquoting process X specimen X aliquot Y specimen X has_input has_specification has_specification repository deposition process X sample set assembly protocol aliquoting protocol denotes denotes denotes instance_of has_quality instance_of has_quality instance_of has_quality has_output ID ID ID specimen type amount specimen type specimen type repository specimen X amount amount located_in denotes specimen repository instance_of has_quality ID specimen type information record

  22. Sample Shipment v25 v24 v21 GPS location geographic location GPS location geographic location date/time date/time denotes denotes denotes denotes denotes denotes located_in located_in spatial region spatial region temporal interval spatial region spatial region temporal interval has_part has_part sample shipment protocol sample receipt protocol temporal-spatial region temporal-spatial region has_specification has_specification ID sample type sample shipment process sample receipt process located_in amount located_in denotes instance_of instance_of instance_of has_quality has_output has_output has_input has_input has_part sample X at GSC sample shipment process X sample set X in transit sample receipt process X sample set X at GSC sample set X denotes denotes denotes instance_of has_quality instance_of has_quality instance_of has_quality ID ID ID sample set type sample set type sample set type amount amount amount

  23. Sequencing Sample Preparation v37 v27 v33 v39 v38 v15 v36 v35 v16 GPS location geographic location GPS location geographic location GPS location geographic location date/time date/time date/time denotes denotes denotes denotes denotes denotes denotes denotes denotes located_in located_in located_in spatial region spatial region temporal interval spatial region spatial region temporal interval spatial region spatial region temporal interval b32 b33 b31 has_part has_part has_part temporal-spatial region temporal-spatial region temporal-spatial region species/ strain aliquoting process NA enrichment process NA amplification process library construction protocol instance_of located_in located_in located_in microorganism genomic NA microorganism X instance_of instance_of instance_of ID has_part has_part has_output has_output has_output has_input has_input has_input NA enrichment process X enriched NA sample X NA amplification process X NA amplified sample X aliquoting process X specimen aliquot X specimen X has_specification has_specification has_specification NA enrichment protocol NA amplification protocol aliquoting protocol denotes denotes denotes denotes instance_of has_quality instance_of has_quality instance_of has_quality instance_of has_quality ID ID ID ID specimen type amount specimen type specimen type specimen type amount amount amount

  24. Sequencing Assay v41 v14 v40 sample type instance_of sample material X sample ID denotes GPS location geographic location template role plays date/time denotes denotes located_in b38 b34 reagent type spatial region temporal interval spatial region instance_of has_input lot # material X denotes has_part reagent role plays temporal-spatial region has_input located_in has_output species instance_of primary data sequencing assay X has_input person X name denotes sequencing tech. role has_specification insatnce_of plays denotes has_input equipment type instance_of sequencing assay type sequencing protocol run ID equipment X serial # denotes has_part signal detection role plays objectives – coverage, genome type targeted, finishing

  25. Data Transformations GPS location geographic location v32 v42 v30 v45 v44 v46 v47 v43 v29 v31 GPS location geographic location date/time date/time denotes denotes denotes denotes located_in located_in spatial region temporal interval spatial region spatial region temporal interval spatial region has_part has_part temporal-spatial region temporal-spatial region b37 b39 b35 b36 algorithm finishing status run ID located_in has_specification has_quality located_in denotes software has_input data transformations – image processing assembly X has_output data archiving process has_output sequence data sequence data record primary data has_input has_input species instance_of has_specification denotes person X name has_input GenBank ID denotes data transfer protocol has_input data transformations – variant detection plays is_about bioinformatics tech. role data transformations – serotype marker detection has_output has_input microorganism genomic NA genotype data has_output is_about data transformations – gene detection part_of serotype data instance_of species/ strain has_output microorganism X gene data

  26. Generic Assay analyte X sample type instance_of has_part sample material X sample ID denotes GPS location geographic location date/time target role plays has_quality denotes denotes located_in quality x spatial region temporal interval spatial region reagent type instance_of has_input has_part lot # material X denotes temporal-spatial region reagent role plays has_input located_in has_output is_about species instance_of primary data input sample material X assay X has_input person X name denotes technician role has_specification instance_of plays denotes has_input equipment type instance_of assay type assay protocol run ID equipment X serial # denotes has_part signal detection role plays objectives

  27. Generic Material Transformation sample type instance_of sample material X sample ID denotes GPS location geographic location target role plays has_quality date/time denotes denotes quality x located_in reagent type spatial region temporal interval spatial region instance_of has_input lot # material X denotes has_part reagent role plays temporal-spatial region has_input quality x located_in has_quality has_output species instance_of output material X material transformation X denotes sample ID has_input person X name denotes instance_of material type technician role has_specification instance_of plays denotes has_input equipment type instance_of material transformation type material transformation protocol run ID equipment X serial # denotes has_part signal detection role plays objectives

  28. Generic Data Transformation GPS location geographic location date/time denotes denotes located_in spatial region temporal interval spatial region has_part temporal-spatial region located_in software has_input has_output output data input data data transformation X has_specification instance_of denotes person X name denotes plays is_about run ID data transformation type algorithm data analyst role material X

  29. Generic Material (IC) GPS location GPS location geographic location geographic location date/time date/time denotes denotes denotes denotes denotes denotes located_in located_in spatial region spatial region spatial region spatial region temporal interval temporal interval has_part has_part temporal-spatial region temporal-spatial region located_in located_in quality x quality y has_quality has_quality material type instance_of material X denotes ID has_part has_part material Y material Z

  30. OBI specimen creation e32 e33 e35 e36 e41 e23 e31 e46 e47 e50 e15 e16 e14 e21 e44 e30 e40 e27 e22 e29 e38 e42 e39 e20 e37 e24 e18 e19 e17 e43 e45 e26 e25 genetic characteristics information quality individual organism identifier infectious agent measurement datum denotes has_quality anatomical entity (‘portion of body substance’ or ’ portion of tissue’) is_about geographic location material_entity is_about synonym located_in geographic location has_participant located_in is_about time measurement datum treatment organization is_a unfolds_in has_participant is_duration_of has_specified_input organism (for ‘collecting specimen from an organism’) has_specified_output has_supplier specimen creation specimen CRID symbol material entity (for ‘environmental material collection’) denotes growth environment has_participant located_in located_in realizes is_about infectious agent is_about has_agent synonym protocol achieves_planned_objective is_member_of_organization genetic characteristics information is_about organization human being textual entity has_quality specimen creation objective denotes denotes quality information content entity written name is_about document is_quality_measured_as infectious agent measurement datum

  31. Status Core metadata merge process nearly complete Comprehensive semantic networks developed Begun the OBI harmonization process Begun the MIGS/MIMS harmonization process Still need to: Compare, harmonize, map with BioProjects and BioSamples Decide what to do about metadata fields that appear to be project specific Develop metadata submission templates Report process and results

More Related