Brc2011 session 5 data standards and metadata
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

BRC2011 Session #5 – Data Standards and Metadata PowerPoint PPT Presentation


  • 49 Views
  • Uploaded on
  • Presentation posted in: General

BRC2011 Session #5 – Data Standards and Metadata. Session chair: Richard Scheuermann ( ViPR & IRD). Session #5 - Outline. Motivation Opportunities, Challenges and Talking Points minimum information checklists ontology-based value sets use cases for metadata

Download Presentation

BRC2011 Session #5 – Data Standards and Metadata

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Brc2011 session 5 data standards and metadata

BRC2011Session #5 – Data Standards and Metadata

Session chair: Richard Scheuermann (ViPR & IRD)


Session 5 outline

Session #5 - Outline

  • Motivation

  • Opportunities, Challenges and Talking Points

    • minimum information checklists

    • ontology-based value sets

    • use cases for metadata

    • SOPs for data & metadata acquisition

  • Ontology of Biomedical Investigations – Bjoern Peters

  • Infectious Disease Ontology and extensions – Lindsay Cowell

  • GSCID-BRC Metadata Working Group efforts

  • Open discussion


Why data standards

Why Data Standards

Interoperability - the ability to exchange information between people, organizations, machines

Comparability - the ability to ascertain the equivalence of data from different sources

Data Quality – asses the completeness, accuracy and precision of the data

Dependability – ensures that you get what you expect from a database query

Accurate Statistical Analysis

Inference


What data standards

What Data Standards

  • Minimum Information Sets – what needs to be described

  • Structured Vocabulary/Ontology – how to describe them

    • Term strings – unique identifiers

    • Definitions - what terms mean

    • Syntax - how terms are used

  • Semantics - how the components relate to each other


Session 5 challenges

Session #5 – Challenges

  • Status of relevant data standards

    • Few data standards that have been widely adopted by the infectious diseases community

    • Some standards are being development without engagement of all relevant stakeholders

    • If we drive standards development, how do we get broad adoption

  • Adoption of data standards by data providers

    • Even if vocabulary standards are available, how do we get the broader community to use them

    • How do we educate them to use the data standards accurately

    • How to keep the barrier low for getting required meta-data in a standard format

  • Technical challenges

    • Usability is constrained by spreadsheet interface

    • Ontology-based controlled vocabularies sometimes too large for spreadsheet like interface or drop down lists

    • While web-based GUI smart forms are good for single submission, difficult to design them to scale

  • Need for quality control and curation

    • If data standards are not enforced, mapping to standards may be required

    • Problems with homonyms (Turkey vs turkey) and synonyms (Puerto Rico and PR)

    • Not all tasks in metadata collection lend themselves to automation

    • Data entry quality control mechanisms are especially limited because of spreadsheet functionality

    • Could be 1-2 FTEs; not budgeted

  • Compliance with HIPAA and other privacy regulations.

    • PATRIC does not anticipate working with identifying data but GSCIDs and investigators could be delayed by compliance issues

  • Special cases

    • Metadata for genomes for NBCI bulk submission and non-unique taxon ids.

    • Metadata for growth conditions to be used with transcript datasets

    • Metadata for metagenomes to correlate genomes and proteins with useful info about sites and conditions

  • How to we effectively exploit standardized data and metadata


Session 5 opportunities

Session #5 – Opportunities

  • Existing relevant ontologies are in decent shape – GO, IDO, OBI

  • Ontology for Biomedical Investigations (OBI) can provide a common framework for describing and exchanging datasets

  • GSCID-BRC Metadata Working Group

  • Leverage and harmonize with MIGS/MIMS

  • We have the opportunity to establish policies for metadata collection, exchange, and release that would be broadly applicable.

  • We are in the position to drive standards adoption

  • The BRCs support many pathogens that infect the same host(s) … can we exploit this fact to create specialized views and tools for interacting with the host resources from both pathogen and host perspectives?

  • Ontology-driven integration (GMOD, Population biology)

  • Small sequencing centers

    • Offer community a standard metadata template for isolates

    • Bring your own data and metadata to PATRIC for annotation, analysis, long term metadata storage and dissemination

  • Develop additional metadata standards and collect, store, and share additional metadata

  • More efficient encoding of things like alignments


Presentations

Presentations

Ontology of Biomedical Investigations (OBI) – Bjoern Peters

Infectious Disease Ontology (IDO) and extensions – Lindsay Cowell

GSCID-BRC Metadata Working Group


Gscid brc metadata working group

GSCID-BRC Metadata Working Group

  • Working group established to define common metadata standard for pathogen isolate sequencing projects

  • Collaboration between BRCs, GSCIDs and NIAID

  • Process

    • Collect spreadsheets, metadata examples, previous submission from sequencing projects

    • Core metadata fields collected by virus, bacteria and eukaryote subgroups

    • For each metadata field, propose:

      • preferred term

      • definition

      • synonyms

      • allowed values based on controlled vocabularies

      • preferred syntax

      • responsible provider

      • data category

      • examples

    • Merge recommendations from subgroups into a common core metadata using an OBI-based semantic framework

    • Develop recommendations for project-specific and pathogen-specific metadata fields

    • Harmonize with other relevant standards (MIGS/MIMS, IDO)

    • Establish policies and procedures for metadata submission workflows and GenBank linkage


Core metadata examples

Core Metadata Examples


Network overview

Network Overview

temporal-spatial

region

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital- relations

located_in

type

ID

qualities

denotes

temporal-spatial

region

instance_of

has_quality

located_in

specimen source –

organism or environmental

has_output

has_output

has_input

specimen isolation

process

sample

processing

enriched

NA sample

specimen

has_input

specimen

collector

has_specification

has_part

has_part

isolation

protocol

microorganism

genomic NA

microorganism

is_about

has_output

is_about

data transformations –

variant detection

serotype marker detect.

gene detection

genotype/serotype/

gene data

input sample

has_input

has_output

has_output

has_output

has_input

has_input

reagents

has_input

data transformations –

image processing

assembly

data archiving

process

sequence

data

sequence

data record

primary

data

sequencing assay

technician

denotes

equipment

GenBank

ID


Brc2011 session 5 data standards and metadata

temporal-spatial

region

Investigation

Specimen Isolation

located_in

Material Processing

type

ID

qualities

denotes

temporal-spatial

region

instance_of

has_quality

located_in

specimen source –

organism or environmental

has_output

has_output

has_input

specimen isolation

process

sample

processing

enriched

NA sample

specimen

has_input

specimen

collector

has_specification

has_part

has_part

isolation

protocol

microorganism

genomic NA

microorganism

is_about

Sequencing Assay

Data Processing

has_output

is_about

data transformations –

variant detection

serotype marker detect.

gene detection

genotype/serotype/

gene data

input sample

has_input

has_output

has_output

has_output

has_input

has_input

reagents

has_input

data transformations –

image processing

assembly

data archiving

process

sequence

data

sequence

data record

primary

data

sequencing assay

technician

denotes

equipment

GenBank

ID


Brc2011 session 5 data standards and metadata

vX– row X in virus sheet

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital- relations

common

name

temporal

interval

date/time

v3-4

v5-6

v29

v31

v43

v40

v42

v45

v46

v32

v24

v22

v44

v30

v10

v12

v27

v13

v23

v25

v15

v16

v11

v8

v7

v2

denotes

denotes

has_part

located_in

spatial

region

GPS

location

species/

strain

organism

ID

age, gender,

symptom

temporal-spatial

region

instance_of

has_quality

located_in

denotes

denotes

spatial

region

geographic

location

amount

located_in

organism

has_input

specimen

source role

has_quality

plays

ID

ID

ID

has_output

has_output

has_output

has_input

has_input

environmental

material

specimen isolation

process

NA enrichment

process

enriched

NA sample

cDNA synthesis

process

cDNA

sample

specimen

specimen

capture role

equipment

has_specification

has_part

has_specification

has_part

has_specification

plays

specimen

collector role

person

isolation

protocol

NA enrichment

protocol

microorganism

genomic NA

cDNA synthesis

protocol

microorganism

has_affiliation

denotes

instance_of

affiliation

name

is_about

is_about

species/

strain

has_output

data transformations –

variant detection

serotype marker detect.

gene detection

genotype/serotype/

gene data

temporal-spatial

region

template

role

sample

material

has_input

has_input

plays

located_in

has_output

has_output

reagent

role

has_output

has_input

has_input

material

data transformations –

image processing

assembly

data archiving

process

sequence

data

sequence

data record

primary

data

sequencing assay

sequencing

tech. role

person

has_input

has_specification

has_specification

has_specification

denotes

signal

detection role

equipment

GenBank

ID

sequencing

protocol

data transfer

protocol

software

algorithm


Metadata categories

Metadata Categories

Investigation

Specimen Isolation

Specimen Processing

Sample Shipment

Pathogen Detection & Isolation

Sequencing Sample Preparation

Sequencing Assay

Data Transformation


Specimen isolation

Specimen Isolation

vX– row X in virus sheet

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital- relations

v5-6

v12

v19

v17

v10

v18

v16

v15

v13

v27

v11

v7

v9

v2

v8

v3-4

temporal

interval

date/time

denotes

has_part

spatial

region

GPS

location

temporal-spatial

region

common

name

located_in

located_in

denotes

denotes

spatial

region

geographic

location

species/

strain

organism

ID

age, gender,

symptom

temporal

interval

date/time

denotes

has_part

denotes

spatial

region

GPS

location

temporal-spatial

region

instance_of

has_quality

located_in

denotes

spatial

region

geographic

location

located_in

Comments

organism

ID

specimen

source role

plays

????

denotes

environmental

material

has_input

specimen isolation

procedure X

instance_of

has_output

specimen X

specimen type

has_input

specimen

capture role

plays

equipment

has_part

is_about

has_specification

has_authorization

organism part

hypothesis

microorganism

specimen

collector role

plays

person

instance_of

isolation

protocol

IRB/IACUC

approval

instance_of

has_affiliation

denotes

specimen isolation

procedure type

affiliation

name

species/

strain


Specimen processing

Specimen Processing

v24

v27

v16

v15

v20

v23

v22

GPS

location

geographic

location

GPS

location

geographic

location

date/time

date/time

denotes

denotes

denotes

denotes

denotes

denotes

located_in

located_in

spatial

region

spatial

region

temporal

interval

spatial

region

spatial

region

temporal

interval

has_part

has_part

specimen T

aliquot U

temporal-spatial

region

temporal-spatial

region

species/

strain

aliquoting

process

sample set

assembly process

specimen M

aliquot N

instance_of

located_in

located_in

microorganism X

instance_of

instance_of

specimen A

aliquot B

has_part

has_output

has_output

has_input

has_input

sample set

assembly process X

sample

set X

aliquoting

process X

specimen X

aliquot Y

specimen X

has_specification

has_specification

sample set

assembly protocol

aliquoting

protocol

denotes

denotes

denotes

instance_of

has_quality

instance_of

has_quality

instance_of

has_quality

ID

ID

ID

specimen

type

amount

specimen

type

specimen

type

amount

amount


Sample shipment

Sample Shipment

v25

v24

v23

v21

GPS

location

geographic

location

GPS

location

geographic

location

date/time

date/time

denotes

denotes

denotes

denotes

denotes

denotes

located_in

located_in

spatial

region

spatial

region

temporal

interval

spatial

region

spatial

region

temporal

interval

has_part

has_part

sample shipment

protocol

sample receipt

protocol

temporal-spatial

region

temporal-spatial

region

has_specification

has_specification

ID

sample

type

sample shipment

process

sample receipt

process

located_in

amount

located_in

denotes

instance_of

instance_of

instance_of

has_quality

has_output

has_output

has_input

has_input

has_part

sample X

at GSC

sample shipment

process X

sample set X

in transit

sample receipt

process X

sample set X

at GSC

sample

set X

denotes

denotes

denotes

instance_of

has_quality

instance_of

has_quality

instance_of

has_quality

ID

ID

ID

sample set

type

sample set

type

sample set

type

amount

amount

amount


Pathogen detection isolation

Pathogen Detection & Isolation

v34

v28

v26

v27

v16

v15

GPS

location

geographic

location

date/time

denotes

denotes

denotes

ID

pathogen

type

located_in

amount

GPS

location

geographic

location

date/time

spatial

region

spatial

region

temporal

interval

denotes

denotes

denotes

denotes

instance_of

has_quality

located_in

has_part

spatial

region

spatial

region

pathogen detection

protocol

temporal

interval

pathogen

isolate X

temporal-spatial

region

has_part

has_specification

has_output

located_in

temporal-spatial

region

pathogen isolation

method

pathogen detection

protocol

pathogen isolation

process X

has_specification

pathogen detection

method

instance_of

located_in

has_input

specimen

type

instance_of

instance_of

has_input

pathogen detection

process X

specimen X

denotes

ID

has_part

has_quality

has_output

amount

is_about

microorganism X

data about

pathogen presence

instance_of

species/

strain


Sequencing sample preparation

Sequencing Sample Preparation

v33

v27

v16

v15

v35

v36

v39

v38

v37

GPS

location

geographic

location

GPS

location

geographic

location

GPS

location

geographic

location

date/time

date/time

date/time

denotes

denotes

denotes

denotes

denotes

denotes

denotes

denotes

denotes

located_in

located_in

located_in

spatial

region

spatial

region

temporal

interval

spatial

region

spatial

region

temporal

interval

spatial

region

spatial

region

temporal

interval

has_part

has_part

has_part

temporal-spatial

region

temporal-spatial

region

temporal-spatial

region

species/

strain

aliquoting

process

NA enrichment

process

cDNA synthesis

process

instance_of

located_in

located_in

located_in

microorganism

genomic NA

microorganism X

instance_of

instance_of

instance_of

ID

has_part

has_part

has_output

has_output

has_output

has_input

has_input

has_input

NA enrichment

process X

enriched

NA sample X

cDNA synthesis

process X

cDNA

sample X

aliquoting

process X

specimen

aliquot X

specimen X

has_specification

has_specification

has_specification

NA enrichment

protocol

cDNA synthesis

protocol

aliquoting

protocol

denotes

denotes

denotes

denotes

instance_of

has_quality

instance_of

has_quality

instance_of

has_quality

instance_of

has_quality

ID

ID

ID

ID

specimen

type

amount

specimen

type

specimen

type

specimen

type

amount

amount

amount


Sequencing assay

Sequencing Assay

v41

v40

v14

sample

type

instance_of

sample

material X

sample ID

denotes

GPS

location

geographic

location

template

role

plays

date/time

denotes

denotes

located_in

reagent

type

spatial

region

temporal

interval

spatial

region

instance_of

has_input

lot #

material X

denotes

has_part

reagent

role

plays

temporal-spatial

region

has_input

located_in

has_output

species

instance_of

primary

data

sequencing assay X

has_input

person X

name

denotes

sequencing

tech. role

has_specification

insatnce_of

plays

denotes

has_input

equipment

type

instance_of

sequencing

assay type

sequencing

protocol

run

ID

equipment X

serial #

denotes

has_part

signal

detection role

plays

objectives – coverage,

genome type targeted


Data transformations

Data Transformations

GPS

location

geographic

location

v44

v31

v45

v32

v42

v47

v43

v29

v46

v30

GPS

location

geographic

location

date/time

date/time

denotes

denotes

denotes

denotes

located_in

located_in

spatial

region

temporal

interval

spatial

region

spatial

region

temporal

interval

spatial

region

has_part

has_part

temporal-spatial

region

temporal-spatial

region

algorithm

run

ID

located_in

has_specification

located_in

denotes

software

has_input

data transformations –

image processing

assembly X

has_output

data archiving

process

has_output

sequence

data

sequence

data record

primary

data

has_input

has_input

species

instance_of

has_specification

denotes

person X

name

has_input

GenBank

ID

denotes

data transfer

protocol

has_input

data transformations –

variant detection

plays

is_about

bioinformatics

tech. role

data transformations –

serotype marker

detection

has_output

has_input

microorganism

genomic NA

genotype data

has_output

is_about

data transformations –

gene detection

part_of

serotype data

instance_of

species/

strain

has_output

microorganism X

gene data


Investigation

Investigation

- independent continuant

- dependent continuant

- occurrent

- temporal-spatial region

ital- relations

has_part

investigation

has_part

study design

has_part

documenting

has_part

study design execution

has_specified_input

has_part

objective specification

has_part

has_part

Information content entity

has_part

specimen preparation for assay

sequencing assay

data transformation

specimen creation


Generic assay

Generic Assay

analyte X

sample

type

instance_of

has_part

sample

material X

sample ID

denotes

GPS

location

geographic

location

date/time

target

role

plays

has_quality

denotes

denotes

located_in

quality x

spatial

region

temporal

interval

spatial

region

reagent

type

instance_of

has_input

has_part

lot #

material X

denotes

temporal-spatial

region

reagent

role

plays

has_input

located_in

has_output

is_about

species

instance_of

primary

data

input sample

material X

assay X

has_input

person X

name

denotes

technician

role

has_specification

instance_of

plays

denotes

has_input

equipment

type

instance_of

assay

type

assay

protocol

run

ID

equipment X

serial #

denotes

has_part

signal

detection role

plays

objectives


Generic material transformation

Generic Material Transformation

sample

type

instance_of

sample

material X

sample ID

denotes

GPS

location

geographic

location

target

role

plays

has_quality

date/time

denotes

denotes

quality x

located_in

reagent

type

spatial

region

temporal

interval

spatial

region

instance_of

has_input

lot #

material X

denotes

has_part

reagent

role

plays

temporal-spatial

region

has_input

quality x

located_in

has_quality

has_output

species

instance_of

output

material X

material

transformation X

denotes

sample ID

has_input

person X

name

denotes

instance_of

material

type

technician

role

has_specification

instance_of

plays

denotes

has_input

equipment

type

instance_of

material transformation

type

material transformation

protocol

run

ID

equipment X

serial #

denotes

has_part

signal

detection role

plays

objectives


Generic data transformation

Generic Data Transformation

GPS

location

geographic

location

date/time

denotes

denotes

located_in

spatial

region

temporal

interval

spatial

region

has_part

temporal-spatial

region

located_in

software

has_input

has_output

output

data

input

data

data transformation X

has_specification

instance_of

denotes

person X

name

denotes

plays

is_about

run

ID

data transformation

type

algorithm

data analyst

role

material X


Generic material ic

Generic Material (IC)

GPS

location

GPS

location

geographic

location

geographic

location

date/time

date/time

denotes

denotes

denotes

denotes

denotes

denotes

located_in

located_in

spatial

region

spatial

region

spatial

region

spatial

region

temporal

interval

temporal

interval

has_part

has_part

temporal-spatial

region

temporal-spatial

region

located_in

located_in

quality x

quality y

has_quality

has_quality

material

type

instance_of

material X

denotes

ID

has_part

has_part

material Y

material Z


Discussion points

Discussion Points

MIBBI may not be sufficient

Don’t distinguish between minimum information to reproduce and experiment and the minimum information to structure in a database

Lack a semantic framework

OBI-based framework is re-usable

Sequencing => “omics”

Challenge of using ontologies for preferred value sets

Can be large

May not directly match common language

Value of defining the semantic framework

Appropriate relations are retained

How can we take advantage of the framework for semantic query and inferential analysis?

Practical issues about implementation strategies


  • Login