Biomedical Ontologies. Bio-Trac 40 (Protein Bioinformatics) October 9, 2008 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Overview. What is ontology ?
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Biomedical Ontologies Bio-Trac 40 (Protein Bioinformatics) October 9, 2008 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Overview • What is ontology? • What is biomedical ontology? • What is gene ontology? • How is it generated? • How is it used for annotation? • What is protein ontology? • Why is it necessary? • How to use it? • ……
Tree of Porphyry with Aristotle’s Categories Aristotle, 384 BC – 322 BC
Ontology: onto-, of being or existence; -logy, study. Greek origin; Latin, ontologia,1606 • In philosophy, it seeks to describe basic categories and relationships of being or existence to define entities and types of entities within its framework: • What do you know? How do you know it? • What is existence? What is a physical object? • What constitutes the identity of an object? …… • Central goal is to have a definitive and exhaustive classification of all entities. “The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality” – Barry Smith, U Buffalo
is_a In computer and information science • Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. Most ontologies describe individuals (instances), classes (concepts), attributes, and relations Classes Relations Attributes Classes (concepts) e.g. color, engine, door… Individuals (instances) your Ford, my Ford, his Ford…
What are ontology useful for? Ontology is a form of knowledge representation about the world or some part of it. • Terminology management • Integration, interoperability, and sharing of data • promote precise communication between scientists • enable information retrieval across multiple resources • Knowledge reuse and decision support • extend the power of computational approaches to perform data exploration, inference, and mining Biomedical Terminologyvs. Biomedical Ontology • UMLS (unified medical language system) • MeSH (medical subject heading) • NCI Thesaurus • SNOMED / SNODENT • Medical WordNet
Ontology Enables Large-Scale Biomedical Science • Structured representation of biomedicine: • For different types of entities and relations to describe biomedicine (ontology content curation). • Annotation: using ontologies to summarize and describe biomedical experimental results to enable: • Integration of their data with other researchers’ results • Cross-species analyses The center of two major activities currently in biomedical research:
Gene Ontology (GO) what makes it so wildly successful ?
GO Consortium • The Gene Ontology was originally constructed in 1998 by a consortium of researchers studying the genome of three model organisms: • Drosophila melanogaster (fruit fly) (FlyBase) • Mus musculus (mouse) (MGD) • Saccharomyces cerevisiae (yeast) (SGD) • Many other model organism databases have joined the GO consortium, contributing: • development of the ontologies • annotations for the genes of one or more organisms http://www.geneontology.org/
Need for annotation of genome sequences • What is Gene Ontology? GO provides controlled vocabulary to describe gene and gene product attributes in any organism – how gene products behave in a cellular context Three key concepts: [Currently total 25804 GO terms] (Oct. 2008) • Biological process: series of events accomplished by one or more ordered assemblies of molecular functions, e.g. signal transduction, or pyrimidine metabolism, and alpha-glucoside transport. [total: 15161] • Molecular function: describes activities, such as catalytic or binding activities, that occur at the molecular level. Activities that can be performed by individual gene products, or by assembled complexes of gene products; e.g. catalytic activity, transporter activity. [total: 8425] • Cellular component: a component of a cell that it is part of some larger object, maybe an anatomical structure (e.g. ER or nucleus) or a gene product group (e.g. ribosome, or a protein dimer). [total: 2218] • GO annotation • - Characterization of gene products using GO terms • - Members submit their data which are available at GO website.
root A C C has two parents, A and B B Relations: is_a, or part_of C Leaf node GO Representation: Tree or Network? GO is a network structure Node, a concept or a term
GO search and display tool Leaf node GO term (GO:0006366): mRNA transcription from RNA polymerase II promoter
Human p53 – GO annotation (UniProtKB:P04637) GO:0006289:nucleotide-excision repair [PMID:7663514; evidence:IMP]
GO annotation of gene products • Science basis of the GO: trained experts use the experimental observations from literature to associate GO terms with gene products (to annotate the entities represented in the gene/protein databases) • Enabling data integration across databases and making them available to semantic search http://www.geneontology.org/GO.current.annotations.shtml ~46 Human, mouse, plant, worm, yeast …
What GO is NOT…… • Ontology of gene products: e.g. cytochrome c is not in GO, but attributes of cytochrome c are, e.g. oxidoreductase activity. • Processes, functions and component unique to mutants or diseases: e.g. oncogenesis is not a valid GO. • Protein domains or structural features. • Protein-protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of cellular components, including cell types. Neither GO is Ontology of Genes!! – a misnomer
Missing GO nodes… not deep enough…not broad enough…
Estrogen receptor Lack of connections among GOs
GO: A Common Standard for Omics Data Analysis what molecular function? what biological process? what cellular component?
need more… • need to improve the quality of GO to support more rigorous logic-based reasoning across the data annotated in its terms • need to extend the GO by engaging ever broader community support for addition of new terms and for correction of errors • need to extend the methodology to other domains, including clinical domains, such as: • disease ontology • immunology ontology • symptom (phenotype) ontology • clinical trial ontology • ...
http://www.obofoundry.org/ • Establish common rules governing best practices for creating ontologies and for using these in annotations • Apply these rules to create a complete suite of orthogonal interoperable biomedical reference ontologies National Center for Biomedical Ontology (NCBO) http://bioontology.org/
The OBO Foundry • A family of interoperable gold standard biomedical reference ontologies to serve annotation of: • scientific literature • model organism databases • clinical trial data … • tight connection to the biomedical basic sciences • compatibility, interoperability, common relations • support for logic-based reasoning OBO Foundry = a subset of OBO ontologies, whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure: OBO Foundry Principles: http://www.obofoundry.org/crit.shtml
RELATION TO TIME GRANULARITY Rationale of OBO Foundry coverage
OBO Relation Ontology e.g.: A is_a B =def. every instance of A is an instance of B “rose is_a plant all instances of rose is_a plant”
What is Protein Ontology? Why? PRO http://pir.georgetown.edu/pro/
Human PRLR The Need for Representation of Various Proteins Forms Glucocorticoid receptor (GR) and PTMs…
lysosome Extracelluar, e.g. LDL binding M6P Sphingomyelin phosphodiesterase (SMPD1) (ASM_HUMAN) • Cleavage sites: • lysosomal: the enzyme is transported from the Golgi apparatus to the lysosome after additions of mannose-6-phosphate moieties (M6P) and binding to M6P receptor. • secreted: the shorter cleaved form is not modified with M6P and is targeted for secretion to the extracellular space, with different functions such as LDL binding and oxidized LDL catabolism.
Alternative splicing a single new contact between Phe32 (F32) of FGF8b and a hydrophobic groove within Ig domain 3 of FGFR2c • Only FGF8b can transform midbrain to cerebellum whereas FGF8a causes an overgrowth of midbrain. Olsen et al., Genes Dev. 2006 FGF8a, 8b – differ in their ability to pattern embryonic brain FGF8a FGF8b FGF8_HUMAN alternative splicing
OVOL2_MOUSE (Q8CIV7) GOA for Transcription factor Ovo-like 2 Form 1 - long:GO:0045892 IDA - negative regulation of transcription, DNA-dependent Form 2 – short: GO:0045893 IDA - positive regulation of transcription, DNA-dependent - Gene. 2004 336:47-58. PMID:15225875 274 aa
The Need for Protein Classes Representing Protein Evolutionary Relationships • Genes/proteins identified in model organisms, such as mouse, yeast, fly, may have important functional implications in human. • Gene function in model organism may not applied to human • Animal models for human diseases: such as mouse models for diabetes, arthritis, and tumor. • Essential genes may be redundant and nonessential in another species due to functional compensation, e.g.: • mutation of Rb1 causes retinoblastoma in early childhood • Rb1 knock-out mouse did not develop retinoblastoma because of compensation from a functional homolog p107. • Close examination of proteins in phylogenetic classes and their functional convergence and divergence in a ontological structure is important for application of disease models.
Implications of Protein Evolution B.subtilis Human Mouse Chimp Worm Yeast E.coli Rat Fly • Conclusions from experiments performed on proteins from one organism are often applicable to the homologous protein from another organism. • Information learned about existing proteins allows us to infer theproperties of ancestral proteins. Common ancestor
Protein Evolution Sequence changes Domain shuffling With enough similarity, one can trace back to a common origin What about these?
Functional convergence • Protein classes of the same function derived from different evolutionary origins, e.g. carbonate dehydratase (or carbonic anhydrase EC 184.108.40.206), which has three independent gene families with functional convergence. Animal and prokaryotic type Plant and prokaryotic type Archaea type
Functional divergence Gene Duplication (TGM3/EPB42 split) Speciation (Human/mouse split) Human TGM3 branch TGM3 (Human) Mouse TGM3 (Mouse) Human EPB42 (Human) EPB42 branch Mouse EPB42 (Mouse) TGM3 (Human) TGM3 (Mouse) EPB42 (Human) EPB42 (Mouse) TGM3 = Protein-glutamine gamma-glutamyltransferase (Transglutaminase; involved in protein modification) EBP42 = Erythrocyte membrane protein band 4.2 (Constituent of cytoskeleton; involved in cell shape)
The Need for Protein Ontology • Data integration and knowledge management for -omics work. • A gap exists in OBO for gene products. • Protein Ontology (PRO) will contain two connected components (or subontologies): • ProEvocaptures the protein classes represented by protein families at fold, domain and full length levels that reflect evolutional relationship • ProForm captures the specific protein objects of a specific gene resulting from alternative splicing, posttranslational modification, genetic variations. • ProEvo and ProMod is connected through the “reference” (canonical) protein sequence currently annotated in UniProtKB. • PRO formalization of these detailed protein objects and classes will allow accurate and consistent proteomics experimental design and data analysis/integration.
PRO Framework • PRO is designed to be a formal and well-principled OBO Foundry ontology for protein entities. • Attributes of objects will take the form of links to other ontologies, such as gene (GO), sequence (SO), modification (PSI-MOD) and disease (DO) ontologies. • A PRO prototype for TGF-beta signaling proteins was built based on this framework. • In this way, PRO aims at providing an ontological framework to define protein entities and evolutionary-related classes that community can adopt for different purposes, e.g. • annotation of entities attributes, • mapping of objects in pathways, and • modeling of biological system dynamics and disease.
Pfam Domain PRO protein domain protein Root Level is_a has_part GOGene Ontology • Family-Level Distinction • Derivation: common ancestor • Source: PIRSF family translation product of an evolutionarily-related gene molecular function is_a has_function • Gene-Level Distinction • Derivation: specific gene • Sources: PIRSF subfamily, Panther subfamily biological process translation product of a specific gene participates_in is_a cellular component • Sequence-Level Distinction • Derivation: specific allele or splice variant • Source: UniProtKB part_of (for complexes) located_in(for compartments) translation product of a specific mRNA derives_from OMIM Disease • Modification-Level Distinction • Derived from post-translational modification • Source: UniProtKB disease cleaved/modified translation product agent_in SOSequence Ontology Example: TGF-beta receptor phosphorylated smad2 isoform1 is a phosphorylated smad2 isoform1 derives_from smad2 isoform 1 is a smad2 is a TGF-b receptor-regulated smad is a smad is aprotein ProForm Modification Level sequence change Sequence Level has_agent(sequence change) agent_of (effect on function) Gene Level ProEvo PSI-MODModification Family Level protein modification Root Level has_modification Protein Ontology (PRO) http://pir.georgetown.edu/pro/
Mothers against decapentaplegic homolog 2 Smad 2 GO annotation of SMAD2_HUMAN: Cellular Component: - nucleus Molecular Function:- protein bindingBiological Process: - signal transduction - regulation of transcription, DNA-dependent
TGF-b TGF-beta receptor II I Smad 2 1 phosphorylation Smad 4 Smad 2 P P P CAMK2 ERK1 2 complex formation Smad 2 P P P P Smad 2 P P P P Smad 4 P Cytoplasm 3 nuclear translocation Smad 2 P P P Smad 4 Nucleus P 4 DNA binding ++ Transcription Regulation
Smad 2 P P Smad 2 P P P Smad 2 P P P Smad 2 P P x Smad 2 Smad2 gene products Forms Location ID SMAD2_HUMAN Smad 2 SMAD2_HUMAN SMAD2_HUMAN SMAD2_HUMAN SMAD2_HUMAN Smad 2 SMAD2_HUMAN SMAD2_HUMAN
PRO hierarchy in Obo Edit Representing evolutionary-related protein classes. In this example, children of TGF-beta-like cysteine-knot cytokine have a common architecture consisting of a signal peptide, a variable propeptide region and a transforming growth factor beta-like domain that is a cysteine-knot domain. Pfam:PF00019 "has_part Transforming growth factor beta like domain". ProEvo: Representing multiple protein products of a gene. Only forms with experimental data are included. When common protein forms exist in human and mouse, a single node is created (See details below). ProForm: OBO relations: is_a, derives_from
Summary • The vision of the biomedical ontology community is that all biomedical knowledge and data are disseminated on the Internet using principled ontologies, such that they are semantically interoperable and useful for improving biomedical science and clinical care. • The scope extends to all knowledge and data that is relevant to the understanding or improvement of human biology and health. • Knowledge and data are semantically interoperable when they enable predictable, meaningful, computation across knowledge sources developed independently to meet diverse needs. • Principled ontologies are ones that follow NCBO-recommended formats and methodologies for ontology development, maintenance, and use.