Ontologies and Biomedicine

Ontologies and Biomedicine What is the "right" amount of semantics?

Ontologies and Biomedicine The “right” amount of semantics depends on what you want to do with it

Ontologies and Biomedicine Research is based on inference from what is known, and therefore it demands rigor

Ontologies and Biomedicine Without rigor, we won’t—know what we know, or where to find it, or what to infer from it.

Semantic Spectrum Ambiguous Logical and precise Natural Language Computable Ontology Highly expressive Less expressive

Ad hoc tagging approach • Let the users defined words and phrases • Foregoes the use of an expertly curated vocabulary or ontology. • Fast and distributed approach yields a vast amount of content • No recruitment and training of people to maintain the ontology is required. • No recruitment and training of annotators to interpret the material is required.

Ad hoc tagging approach • Tagging approach places the burden of interpretation and classification on every end user • Overall this is more costly and wasteful • Is inappropriate in the scientific domain • The problem is not about people communicating. It is about computers and HCI.

Build, apply, and use • Ontology captures current scientific theory that seeks to explain all of the existing evidence and is used to draw inferences and make predictions • Acts like a review • Requires curators who are experts in both the science and logic • Ontology application is the real bottleneck • But overall is less costly and wasteful

Univocity: Terms should have the same meanings on every occasion of use • Positivity: Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuine classes. • Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds. • Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level • Intelligible Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined • Reality Based: When building or maintaining an ontology, always think carefully at how classes relate to instances in reality • Distinguish Classes and Instances: What is necessarily true for instances is not necessarily true for classes

Annotation bottleneck • An active lab can easily generate 10-100GB of data per month, and it is very difficult to manage data on this scale. • Even the best analytic schemes will be for naught if we cannot find our data. • And the data is complex • Yet, the annotation effort required will be utterly wasted if it cannot be reliably computed upon.

3-dimensions Protein function Cell type Tissue Stage Cellular component Organism And more… Implies numerous “light” ontologies

3-dimensions Protein function Cell type Tissue Stage Cellular anatomy Organism And more… Plus all of the relations between these elements Or it implies a single complex one

Practicalities • The ontology should be robust or the annotator’s time is wasted • Research won’t wait, data must be annotated at the rate at which it is generated • Complex ontologies are much more difficult to get right than lighter ones • Light ontologies are easier to build and maintain • Complex ontologies can be built from lighter ones

A “successful” case study Gene Ontology

The aims of GO • To develop comprehensive shared vocabularies of terms describing aspects of molecular biology. • To describe the gene products held in each contributing model organism database. • To provide a scientific resource for access to the vocabularies, the annotations, and associated data. • To provide a software resource to assist in curation of GO term assignments to biological objects.

The primary strength of the GO • The GO covers three domains of biology • Molecular Function • Biological Process • Cellular Component • These are “precisely defined” axes of classification

The breakdown of work • Task 1 • Building the ontology: a computable description of the biological world • Task 2 • Describing your gene product—annotation • Biological process • Molecular function • Cellular localization

The early key decisions • The vocabulary itself requires a serious and ongoing effort. • Carefully define every concept • Initially keep things as simple as possible and only use a minimally sufficient data representation. • Focus initially on molecular aspects that are shared between many organisms.

GO databases: distributed and centralized • Support cross-database queries • By having a mutual understanding of the definition and meaning of any word used to describe a gene product • Provide database access to a common repository of annotations • By submitting a summary of gene products that have been annotated

GO data FTP GO CVS Scripts HTTPD Anonymous CVS

Many Scripts GO CVS AmiGO GO Database

GODatabase.org • Hits = 77,012 • Visits = 14,063 • Sites = 6,638 • Averages per week

Number of links to a site: as reported by Google www.geneontology.org 7,240 www.godatabase.org 33 obo.sourceforge.net 10 song.sourceforge.net 6 genome.ucsc.edu 3,670 www.ncbi.nih.gov 12,000 www.ebi.ac.uk 14,900 sciencemag.org 14,900 www.ncbi.nlm.nih.gov 34,500

Most Common GOIDs accessed via AmiGO 72020 GO:0006810 transport 56862 GO:0005524 ATP binding 53622 GO:0019012 virion 47773 GO:0006955 immune response 46943 GO:0003677 DNA binding 41474 GO:0006508 proteolysis and peptidolysis 41126 GO:0006355 regulation of transcription, DNA-dependent 40427 GO:0004872 receptor activity 34943 GO:0005215 transporter activity 30890 GO:0007186 G-protein coupled receptor protein signaling pathway 30001 GO:0003700 transcription factor activity 28127 GO:0006118 electron transport 26636 GO:0005509 calcium ion binding 24007 GO:0006968 cellular defense response 21250 GO:0016486 peptide hormone processing 20440 GO:0008152 metabolism 19742 GO:0005515 protein binding 19316 GO:0007155 cell adhesion 18254 GO:0005198 structural molecule activity

Taxon covered by the GO (some) Arabidopsis: TAIR, taxon:3702 Caenorhabditis: WormBase, taxon:6239 Candida albicans: CGD, taxon:5476 Danio: ZFIN, taxon:7955 Dictyostelium: DictyBase, taxon:5782 Drosophila: FlyBase, taxon:7227 Mus: MGI, taxon:10090 Oryza sativa: Gramene, taxon:39947 = Oryza sativa (japonica cultivar-group); Rattus: RGD, taxon:10116 Saccharomyces: SGD, taxon:4932 Leishmania major: GeneDB, taxon:5664 Plasmodium falciparum: GeneDB, taxon:5833 Schizosaccharomyces pombe: GeneDB, taxon:4896 Trypanosoma brucei: GeneDB, taxon:185431 Bacillus anthracis: TIGR, taxon:198094 Coxiella burnetii: TIGR, taxon:227377 Geobacter sulfurreducens: TIGR, taxon:243231 Listeria monocytogenes: TIGR, taxon:265669 Methylococcus capsulatus: TIGR, taxon:243233 Pseudomonas syringae: TIGR, taxon:223283 Shewanella oneidensis: TIGR, taxon:211586 Vibrio cholerae: TIGR, taxon:686

National Institute on Aging (NIA) National Institute of Allergy and Infectious Diseases (NIAID) National Cancer Institute (NCI) National Institute on Drug Abuse (NIDA) National Institute on Deafness and Other Communication Disorders (NIDCD) National Institute of Dental & Craniofacial Research (NIDCR) National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) National Institute of Biomedical Imaging and Bioengineering (NIBIB) National Institute of Environmental Health Sciences (NIEHS) National Eye Institute (NEI) National Institute of General Medical Sciences (NIGMS) National Institute of Child Health and Human Development (NICHD) National Human Genome Research Institute (NHGRI) National Heart, Lung and Blood Institute (NHLBI) National Library of Medicine (NLM) National Institute of Neurological Disorders and Stroke (NINDS) National Center for Research Resources (NCRR) NIH-funded experimental research that uses the GO

Other funded experimental projects that use the GO • Public Heath Service • Walter Reed Army Medical Center • United States Department of Agriculture • Department of Defense • USAID • National Science Foundation

A “successful” case study There are still challenges to meet

Building upon (sharing) light, axiomatic ontologies eliminates: • Spelling mistakes or differences • oesinophil vs. eosinophil • Differences in synonyms, names or naming conventions • Spermatazoon, sperm cell, spermatozoid, sperm • Differences in definitions • pericardial cell develops_frommesodermal cell vs.Nothingdevelops_from pericardial cell • Inconsistent structure

Inconsistent structure CL GO hemocyte hemocyte differentiation (sensu Arthropoda) plasmocyte plasmatocyte differentiation lamellocyte differentiation lamellocyte

GO immune cell activation, migration, chemotaxis… erythrocyte differentiation is_a myeloid blood cell differentiation” CL no such term: “immune cell” no such term: “myeloid blood cell” Finer granularity in the GO

GO neuroblast proliferation is_a cell proliferation CL neuroblast is_a neuronal stem cell is_a stem cell is_a cell Courser granularity in the GO

Even a “light” ontology like the GO is difficult enough • A methodology that enforces clear, coherent definitions: • Promotes quality assurance • intent is not hard-coded into software • Meaning of relationships is defined, not inferred • Guarantees automatic reasoning across ontologies and across data at different granularities • Consequences of inconsistencies • Hard to synchronize manually • Inconsistent user-search results

Meeting the goal: Drawing inferences PMID:5555 PMID:4444 Direct evidence Direct evidence ? SP:1234 SP:8723 SP:19345 A B C D Human human Indirect evidence SP:48392 PMID:8976 B Xenopus toad Indirect evidence SP:48291 SP:38921 B C PMID:3924 Drosophila PMID:9550 yeast

Chris Mungall Sima Misra Thank you NCBO Reactome GO SO

Ontologies and Biomedicine

Ontologies and Biomedicine

Presentation Transcript

-Ontologies: Bio-Ontologies: Their Creation and Design

Ontologies

Ontologies in Biomedicine What is the “right” amount of semantics?

Ontologies and Databases

Ontologies

Biomedicine and Big Data

Ontologies

Reference Ontologies, Application Ontologies, Terminology Ontologies

Ontologies

Ontologies and CToL

Top Level Ontologies Ontologies and Ontology Engineering

Ontologies

Ontologies in Biomedicine: The Good, The Bad and The Ugly

Ontologies and Simulations

HUMAN RIGHTS AND BIOMEDICINE

Biomedicine and Big Data

Ontologies and Classifications

Ontologies and Biomedicine

Modern Biomedicine and Bioethical Movement

Ontologies in Biomedicine What is the “right” amount of semantics?

Ontologies and CToL

Reference Ontologies, Application Ontologies, Terminology Ontologies