1 / 47

Gene Annotation Databases

Genes. Diseases. Diseases. Diseases. Physiology. Diseases. Physiology. Genes. Genes. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Anatomy.

prestonj
Download Presentation

Gene Annotation Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genes Diseases Diseases Diseases Physiology Diseases Physiology Genes Genes Anatomy Diseases Physiology Anatomy Diseases Physiology Anatomy Diseases Physiology Anatomy Diseases Physiology Anatomy Diseases Physiology Anatomy Diseases Anatomy Genes Genes Genes Genes Genes Genes Novel relationships & Deeper insights Medical Informatics Genomics and Bioinformatics Gene Annotation Databases Gene Annotation Databases Gene Annotation Databases Gene Annotation Databases

  2. Identification and Prioritization of Novel Disease Candidate GenesSystems Biology Based Integrative Approaches Bioinformatics to Systems Biology November 16, 2007 Anil Jegga Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center (CCHMC) Department of Pediatrics, University of Cincinnati Cincinnati, Ohio - 45229 Anil.Jegga@cchmc.org http://anil.cchmc.org

  3. Acknowledgements • All the publicly available gene annotation resources especially NCBI, MGI and UCSC • Jing Chen • Eric Bardes • Bruce Aronow Support Cincinnati Children’s Hospital Medical Center Computational Medical Center, Cincinnati Mouse Models of Human Cancers Consortium University of Cincinnati College of Medicine

  4. Two Separate Worlds….. Disease World Genome Variome Transcriptome Regulome miRNAome • Name • Synonyms • Related/Similar Diseases • Subtypes • Etiology • Predisposing Causes • Pathogenesis • Molecular Basis • Population Genetics • Clinical findings • System(s) involved • Lesions • Diagnosis • Prognosis • Treatment • Clinical Trials…… Interactome Pharmacogenome Metabolome Physiome Pathome Medical Informatics Bioinformatics & the “omes PubMed Proteome Disease Database Patient Records OMIM Clinical Synopsis Clinical Trials 382 “omes” so far……… and there is “UNKNOME” too - genes with no function known http://omics.org/index.php/Alphabetically_ordered_list_of_omics (as on November 15, 2007) With Some Data Exchange…

  5. the Ultimate Goal……. Disease World Medical Informatics Bioinformatics Genome Variome Transcriptome Regulome Disease Database • Personalized Medicine • Decision Support System • Course/Outcome Predictor • Diagnostic Test Selector • Clinical Trials Design • Hypothesis Generator • Novel Gene/Drug Targets….. Proteome • Name • Synonyms • Related/Similar Diseases • Subtypes • Etiology • Predisposing Causes • Pathogenesis • Molecular Basis • Population Genetics • Clinical findings • System(s) involved • Lesions • Diagnosis • Prognosis • Treatment • Clinical Trials…… Interactome Patient Records Metabolome Physiome Pathome Clinical Trials Integrative Genomics - Biomedical Informatics PubMed miRNAome Pharmacogenome OMIM

  6. Gene World Biomedical World No Integrative Genomics is Complete without Ontologies • Gene Ontology (GO) • Unified Medical Language System (UMLS)

  7. The 3 Gene Ontologies • Molecular Function = elemental activity/task • the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity • What a product ‘does’, precise activity • Biological Process = biological goal or objective • broad biological goals, such as dna repair or purine metabolism, that are accomplished by ordered assemblies of molecular functions • Biological objective, accomplished via one or more ordered assemblies of functions • Cellular Component= location or complex • subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme • ‘is located in’ (‘is a subcomponent of’ ) http://www.geneontology.org

  8. Example: Gene Product = hammer Function (what)Process (why) Drive a nail - into wood Carpentry Drive stake - into soilGardening Smash a bugPest Control A performer’s juggling objectEntertainment http://www.geneontology.org

  9. Unified Medical Language System Knowledge Server– UMLSKS • The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems. • The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain. • The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word. http://umlsks.nlm.nih.gov/kss

  10. Unified Medical Language SystemMetathesaurus • about over 1 million biomedical concepts • About 5 million concept names from more than 100 controlled vocabularies and classifications (some in multiple languages) used in patient records, administrative health data, bibliographic and full-text databases and expert systems. • The Metathesaurus is organized by concept or meaning. Alternate names for the same concept (synonyms, lexical variants, and translations) are linked together. • Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition. • Customizable: Users can exclude vocabularies that are not relevant for specific purposes or not licensed for use in their institutions. MetamorphoSys, the multi-platform Java install and customization program distributed with the UMLS resources, helps users to generate pre-defined or custom subsets of the Metathesaurus. • Uses: • linking between different clinical or biomedical vocabularies • information retrieval from databases with human assigned subject index terms and from free-text information sources • linking patient records to related information in bibliographic, full-text, or factual databases • natural language processing and automated indexing research

  11. Open biomedical ontologies http://obo.sourceforge.net/

  12. Mammalian Phenotype Ontology The Mammalian Phenotype (MP) Ontology enables robust annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease. Each node in MPO represents a category of phenotypes and each MP ontology term has a unique identifier, a definition, synonyms, and is associated with gene variants causing these phenotypes in genetically engineered or mutagenesis experiments. In the current version of MPO, there are >4250 terms associated to >4300 unique Entrez mouse genes (extrapolated to ~4300 orthologous human genes). http://www.informatics.jax.org

  13. Disease Gene Identification and Prioritization Hypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype. • Functional Similarity – Common/shared • Gene Ontology term • Pathway • Phenotype • Chromosomal location • Expression • Cis regulatory elements (Transcription factor binding sites) • miRNA regulators • Interactions • Other features…..

  14. Background, Problems & Issues • Most of the common diseases are multi-factorial and modified by genetically and mechanistically complex polygenic interactions and environmental factors. • High-throughput genome-wide studies like linkage analysis and gene expression profiling, tend to be most useful for classification and characterization but do not provide sufficient information to identify or prioritize specific disease causal genes.

  15. Background, Problems & Issues Since multiple genes are associated with same or similar disease phenotypes, it is reasonable to expect the underlying genes to be functionally related. Such functional relatedness (common pathway, interaction, biological process, etc.) can be exploited to aid in the finding of novel disease genes. For e.g., genetically heterogeneous hereditary diseases such as Hermansky-Pudlak syndrome and Fanconianaemia have been shown to be caused by mutations in different interacting proteins.

  16. Ellinor et al. J Am CollCardiol 2006. dilated cardiomyopathy Linkage, gene expression Linkage analysis Potential candidate genes (too many!) Locus region 10q25-26 Fine mapping Hand/cherry picking Prioritization approach ~9.5Mb with 68 genes 7 candidates selected by experts ADRB1 missing Background, Problems & Issues Disease candidate gene studies Biological experiments (expensive, time consuming)

  17. Approach without training Input: Multiple locus regions Enriched functions Prioritize genes based on the functions Background, Problems & Issues Current candidate gene prioritization tools Assumption: genes involved in the same complex disease will have similar functions dilated cardiomyopathy Approach with training Training: Known disease genes (10 from OMIM) Test: 68 genes at 10q25-26 Score test genes based on their similarity to training set

  18. TOPPGeneTranscriptome Ontology Pathway based Prioritization of Geneshttp://toppgene.cchmc.org Chen J, Xu H, Aronow BJ, Jegga AG. 2007. Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8(1): 392 [Epub ahead of print] Applications: For functional enrichment For candidate gene prioritization Why another gene prioritization method?

  19. Comparison with other related approaches

  20. Comparison with other related approaches Feature Details

  21. Mammalian Phenotype Ontology We do not check whether the human orthologous gene of a mouse gene causes similar phenotype. Rather, we assume that orthologous genes cause “orthologous phenotype” and test the potential of the extrapolated mouse phenotype terms as a similarity measure to prioritize human disease candidate genes

  22. Mammalian Phenotype Ontology 77 human genes explicitly associated with “heart development” (GO:0007507) Mouse orthologs cause various types of cardiac phenotype (MPO)

  23. ToppGene – General Schema

  24. TOPPGene - Data Sources • Gene Ontology: GO and NCBI Entrez Gene • Mouse Phenotype: MGI (used for the first time for human disease gene prioritization) • Pathways: KEGG, BioCarta, BioCyc, Reactome, GenMAPP, MSigDB • Domains: UniProt (Pfam, Interpro,etc.) • Interactions: NCBI Entrez Gene (Biogrid, Reactome, BIND, HPRD, etc.) • Pubmed IDs: NCBI Entrez Gene • Expression: GEO • Cytoband: MSigDB • Cis-Elements: MSigDB • miRNA Targets: MSigDB New features added

  25. TOPPGene - Validation • Random-gene cross-validation • Disease-gene relations from OMIM and GAD databases • Training set: disease genes with one gene (“target”) removed • Test set: 100 genes = “target” gene + 99 random genes • Rank of “target” gene • Control: random training sets • AUC and Sensitivity/Specificity

  26. Disease genes ATM BARD1 BRCA1 BRCA2 BRIP1 CASP8 CHEK2 KRAS PALB2 PIK3CA PPM1D RAD51 RB1CC1 SLC22A18 TP53 Training set BARD1 BRCA1 BRCA2 BRIP1 CASP8 CHEK2 KRAS PALB2 PIK3CA PPM1D RAD51 RB1CC1 SLC22A18 TP53 Test set KIAA1333 PQLC3 RBMY2OP ZNF133 LOC402643 FBL SLEB4 FAM32A AACSL ATM NDUFB5 DENND4A C14orf106 … … KCNJ16 • Ranked list • ATM • KIAA1333 • PQLC3 • RBMY2OP • ZNF133 • LOC402643 • FBL • SLEB4 • FAM32A • AACSL • NDUFB5 • DENND4A • C14orf106 • … • … • KCNJ16 prioritization 99 random genes TOPPGene - Validation Random-gene cross-validation: breast cancer example

  27. Training:19 diseases with 693 genes • Control: 20 random sets of 35 genes each • Sensitivity/Specificity: 77/90 • AUC: 0.916 • Sensitivity: frequency of “target” genes that are ranked above a particular threshold position • Specificity: the percentage of genes ranked below the threshold Random-gene cross-validation result

  28. Using Mouse Phenotype as a feature of similarity measure improves human disease gene prioritization Random-gene cross-validation with only one feature

  29. Using Mouse Phenotype as a feature of similarity measure improves human disease gene prioritization Random-gene cross-validation by leaving one feature out Overall performance All features: 0.913 All – MP: 0.893 All – MP – PubMed: 0.888 All All – MP Sensitivity: true positive rate at a cutoff score Specificity: true negative rate at the same cutoff All – MP - Pubmed

  30. Locus-region cross-validation using different feature sets

  31. ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis

  32. ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis

  33. ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis

  34. ToppGene web server (http://toppgene.cchmc.org) For functional enrichment analysis

  35. PPI - Predicting Disease Genes Direct protein–protein interactions (PPI) are one of the strongest manifestations of a functional relation between genes. Hypothesis: Interacting proteins lead to same or similar disease phenotypes when mutated. Several genetically heterogeneous hereditary diseases are shown to be caused by mutations in different interacting proteins. For e.g. Hermansky-Pudlak syndrome and Fanconianaemia. Hence, protein–protein interactions might in principle be used to identify potentially interesting disease gene candidates.

  36. Which of these interactants are potential new candidates? 7 Known Disease Genes 66 HPRD BioGrid Mining human interactome 778 Direct Interactants of Disease Genes Indirect Interactants of Disease Genes • Prioritize candidate genes in the interacting partners of the disease-related genes • Training sets: disease related genes • Test sets: interacting partners of the training genes

  37. Example: Breast cancer 15 342 2469

  38. ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization

  39. ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization

  40. ToppGene web server (http://toppgene.cchmc.org) For candidate gene prioritization

  41. Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007 May 27. Prioritization result:

  42. Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007 May 27.

  43. ToppGene Prioritization Example: Breast cancer Ranked Interactants

  44. Limitations General limitations of any training-test strategy: • Prior knowledge of disease-gene associations. • Assumption that the disease genes yet to discover will be consistent with what is already known about a disease. • Depend on the accuracy and completeness of the functional annotations. • Only one-fifth of the known human genes have pathway or phenotype annotations and there are still more than 40% genes whose functions are not defined! Chen et al., 2007; BMC Bioinformatics

  45. Mouse Phenotype - Limitations MP is not a disease-centric ontology and the phenotype of a same gene mutation can vary depending on specific mouse strains or their genetic backgrounds. Orthologous genes need not necessarily result in orthologous phenotypes. Possible Solutions - Future Directions More efficient cross-species phenome extrapolation where in the mouse phenotype terms are mapped to human phenotype concepts (from UMLS) semantically (“orthologous phenotype”) and the resultant orthologous genes associated with an orthologous phenotype are identified. Chen et al., 2007; BMC Bioinformatics

  46. PPIs for disease gene identification Limitations • Noisy interactome data • In vitro Vs in vivo (for e.g. only 5.8% of yeast two-hybrid predicted interactions were confirmed by HPRD) • Extrapolation of interactions from one species to another • Bias towards “well-studied” genes/proteins • Too many interactants! Hub proteins • Two interacting proteins need not lead to similar phenotype when mutated • Disease proteins may lie at different points in a pathway and need not interact directly • Lastly, disease mutations need not always involve proteins Oti et al., 2006; J Med Gen

  47. http://anil.cchmc.org (under presentations) And PRIORITIZATION too! Thank You! http://sbw.kgi.edu/

More Related