Lecture 6: Gene ontology and Gene Annotation. June 19 , 2014. What is gene annotation. Process of assigning descriptions to a known gene that represent: Assigned gene name Molecular function, process and cellular location
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Lecture 6:Gene ontology and Gene Annotation June 19, 2014
What is gene annotation • Process of assigning descriptions to a known gene that represent: • Assigned gene name • Molecular function, process and cellular location • Protein features: domains, functional elements such as nuclear localization signals
What is the Gene Ontology? • Set of standard biological phrases (terms) which are applied to genes/proteins: • protein kinase • apoptosis • Membrane • Standardizing representation of gene and gene product attributes across species and databases
Who annotates the genes? • Curators at the major databases • NCBI, EBI, MGI, model organism databases • Uniprot • Protein domain databases (PFAM, SMART, Interpro) • Older sources (SwissProt, PIR) • Gene ontology groups
Why use gene ontology? • Allows biologists to make queries across large numbers of genes without researching each one individually • Can find all the PI3 kinases in a given genome or find all proteins involved in oxidative stress response without prior knowledge of every gene
From the Ex 1 gene list • Vha-6 • C. elegans gene called vacuolar H ATPase • What is its role in the cell? • Gene ontology biological process: • body morphogenesis & determination of adult lifespan; lipid storage • GO molecular function: • H ion transmembrane transporter • GO cellular component • Apical plasma membrane, vacuolar ATPase complex
Asparagine utilization Lysine biosynthesis Cell wall catabolism Oxidative stress response Glucose repression Aging Ribose metabolism Protein folding Ubiquinone biosynthesis A long list of genes...how do you make sense of them? By using gene ontology Eisen, Michael B. et al. (1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868
GO structure Nucleic acid binding is a type of binding. • GO isn’t just a flat list of biological terms • terms are related within a hierarchy is_a is_a DNA binding is a type of nucleic acid binding.
gene A GO structure A single gene associated with with a particular term is automatically annotated to all of the parent terms
GO structure • This means genes can be grouped according to user-defined levels • Allows broad overview of gene set or genome
How does GO work? • What does the gene product do? • Where and does it act? • Why does it perform these activities? What information might we want to capture about a gene product?
GO structure • GO terms divided into three parts: • cellular component • molecular function • biological process
Cellular Component • where a gene product acts Mitochondria
Cellular Component Cellular components of a virus different than a cell
Cellular Component Enzyme complexes in the component ontology refer to places, not activities.
Molecular Function • activities or “jobs” of a gene product glucose-6-phosphate isomerase activity
Molecular Function insulin binding insulin receptor activity
Molecular Function • A gene product may have several functions • Sets of functions make up a biological process.
cell division Biological Process a commonly recognized series of events
Biological Process transcription
Biological Process regulation of gluconeogenesis
Biological Process limb development
Biological Process courtship behavior
Ontology Structure • Terms are linked by two relationships • is-a • part-of
cell membrane chloroplast mitochondrial chloroplast membrane membrane is-a part-of Ontology Structure
term: transcription initiation id:GO:0006352 definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter. • a name • an ID number • a definition GO terms Each concept has:
GO terms • Where do GO terms come from? • GO terms are added by editors at EBI and annotating databases • new terms are usually only added when they are asked for by annotators • GO editors work with experts to make major ontology developments • metabolism • pathogenesis • cell cycle
Species coverage • All major eukaryotic model organism species • Human via gene ontology annotation (GOA) group at UniProt • Several bacterial and parasite species through TIGR and GeneDB at Sanger ~80 species in the Gene Ontology database
Anatomy of a GO annotation • Three key parts: • gene name/id • GO term(s) • evidence for association
Example annotation Human BRCA1 protein – molecular function GO terms
Types of evidence codes Experimental codes Other evidence codes Computational codes
Manual annotation Molecular function In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response… Cellular component Biological process
Electronic Annotation • Annotation derived without human validation • mappings file e.g. interpro2go, ec2go. • Blast search ‘hits’ • Lower ‘quality’ than manual codes • Used in non-model organisms
GO & microarray analysis • Many tools exist that use GO to find common biological functions from a list of genes • GoMiner, GOstat, Onto-express, FatiGO and GSEA to name a few • We’ll use the DAVID Bioinformatics Resource
GO tools • input a gene list • shows which GO categories have most genes associated with them or are “enriched” • provides a statistical measure to determine whether enrichment is significant
Gene 1 Apoptosis Cell-cell signaling Protein phosphorylation Mitosis … Gene 2 Growth control Mitosis Oncogenesis Protein phosphorylation … Gene 3 Growth control Mitosis Oncogenesis Protein phosphorylation … Gene 4 Nervous system Pregnancy Oncogenesis Mitosis … Gene 100 Positive ctrl. of cell prolif Mitosis Oncogenesis Glucose transport … Traditional analysis
GO:0006915 : apoptosis Using GO annotations • But by using GO annotations, this work has already been done for you!
Grouping by process Mitosis Gene 2 Gene 5 Gene45 Gene 7 Gene 35 … Glucose transport Gene 7 Gene 3 Gene 6 … Apoptosis Gene 1 Gene 53 Positive control of cell proliferation Gene 7 Gene 3 Gene 12 … Growth Gene 5 Gene 2 Gene 6 …
GO for microarray analysis • Annotations give ‘function’ label to genes • Ask meaningful questions of microarray data: • Do the genes involved in the same process have the same or different expression patterns?
mitosis – 80/100 apoptosis – 40/100 Cell proliferation – 30/100 glucose transport – 20/100 microarray 1000 genes 100 genes differentially regulated experiment Using GO in practice • statistical measure • how likely your differentially regulated genes fall into that category by chance
Using GO in practice • However, when you look at the distribution of all genes on the microarray:
Other sources of annotation • Uniprot (Swiss-Prot) keywords • Protein domain databases • PFAM, Panther, PDB, PROSITE, ect • GeneDB summaries from NCBI • Protein-protein interactions databases • Pathway databases • KEGG, BioCarta, BBID, Reactome DAVID incorporates annotation from all of these and clusters the redundant terms
Limitations of GO analysis ~40% of the C. neoformans predicted proteins are similar only to other C. neoformans and have no identifiable protein domain Difficult to do enrichment analysis on only 60% of the coded proteins
Today in computer lab Tutorial on using DAVID for GO enrichment analysis Analyze the gene lists from Exercise 1 and 2 Create a sub-list that you will use in Exercise 7