420 likes | 668 Views
Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of Biochemistry, UCL rison@biochem.ucl.ac.uk. Outline. Collecting functional information: Small scale (single gene) Large scale (sets of genes) Function annotation schemes
E N D
Protein functionWhere to find it.How to predict it.How to classify it. Stuart Rison Department of Biochemistry, UCL rison@biochem.ucl.ac.uk
Outline • Collecting functional information: • Small scale (single gene) • Large scale (sets of genes) • Function annotation schemes • Problems with functional assignments • [Comparing current schemes]
Collecting information for single genes • from 1° databases • from 2° databases • from Genome Databases (Model organisms) • by homology • not by homology
Annotation in databases: 1° and 2° databases • Some information can be found in 'primary' databases (sequence and structure databases) • Usually limited although sometimes can be quite informative (e.g. SwissProt) • Core data: sequence, citation information and taxonomic data • Annotation: Protein function; post-translational modifications; domains and sites; Associated diseases; Sequence conflicts/Variant • Most primary databases link to a number of value-added (2°) databases (e.g. motif databases or disease databases) which are often rich in information
Annotation in 1° databases: SwissProt ID HEM3_HUMAN STANDARD; PRT; 361 AA. AC P08397; P08396; Q16012; … DE PORPHOBILINOGEN DEAMINASE (EC 4.3.1.8) (HYDROXYMETHYLBILANE SYNTHASE) DE (HMBS) (PRE-UROPORPHYRINOGEN SYNTHASE) (PBG-D). GN HMBS OR PBGD. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. …(literature references)… CC FUNCTION: TETRAPOLYMERIZATION OF THE MONOPYRROLE PBG INTO THE CC HYDROXYMETHYLBILANE PREUROPORPHYRINOGEN IN SEVERAL DISCRETE STEPS. CC CATALYTIC ACTIVITY: 4 PORPHOBILINOGEN + H(2)O = HYDROXYMETHYLBILANE + 4 NH(3). CC COFACTOR: BINDS A DIPYRROMETHANE COFACTOR TO WHICH PORPHOBILINOGEN SUBUNITS… CC PATHWAY: THIRD STEP IN PORPHYRIN BIOSYNTHESIS BY THE SHEMIN PATHWAY. INVOLVED… CC ALTERNATIVE PRODUCTS: THERE ARE TWO ISOZYMES OF THIS ENZYME IN MAMMALS; THEY CC AREPRODUCED BY THE SAME GENE FROM ALTERNATIVE SPLICING… CC DISEASE: DEFECTS IN HMBS ARE THE CAUSE OF ACUTE INTERMITTENT PORPHYRIA (AIP); AN CC AUTOSOMAL DOMINANT DISEASE CHARACTERIZED BY ACUTE ATTACKS OF NEUROLOGICAL CC DYSFUNCTION… CC SIMILARITY: BELONGS TO THE HMBS FAMILY. … (links to related databases - secondary databases) … KW Porphyrin biosynthesis; Heme biosynthesis; Lyase; KW Alternative splicing; Disease mutation. … (Sequence variations/Sequence)
Annotation in Motif databases: INTERPRO http://interpro.ebi.ac.uk/servlet/IEntry?ac=IPR000860
Genome databases • Some deal with single organisms (e.g. SubtiList for B. subtilis; Sanger Centre M. tuberculosis) • Some deal with multiple genomes (e.g. TIGR microbial genomes database) • The level of annotation can be extensive • Many are much more than sequence repositories extending the sequence with tons of information (e.g. mutants; strains; complementation plasmids etc.) • If you are working with a model organism, chances of obtaining reliable functional annotations are improved
Genome database: YPD http://www.proteome.com/databases/YPD/reports/HEM3.html
Function assignment by homology I • If you just have a sequence • The most common bioinformatics procedure • Search your protein of interest against primary databases; chances are if you find a homologue with high-identity, it performs a similar function • Many, many tools (BLAST, FASTA, S-W Search) • Beware of annotation by homology • relationship between seq. similarity and function not straightforward • danger of propagation of incorrect functional information
Function assignment by homology II • Consider databases which distinguish experimental function assignments from homology based ones (e.g. YPD/WormPD, EcoCyc) • Or use databases which employ more rigorous automated annotation tools (e.g. HAMAP @ SwissProt) “Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated.”
Genome database: YPD http://www.proteome.com/databases/YPD/reports/HEM3.html
Functional assignment “without homology” • Novel functional assignment methods now exists which don’t make use of ‘direct’ homology searches • They exploit other relationships between proteins which are used as indicators of shared function • Phylogenetic profiles • “Rosetta stone genes”
Phylogenetic profiles Pellegrini M et al., “Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.” PNAS (1999) 96(8):4285-8
More methods… Marcotte EM, et al., Nature (1999) 402:83-86 Enright AJ, et al., Nature (1999) 404:86-90
Functional assignment “without homology” • Some access over the WWW • but experiemental • and only for certain organisms (Yeast, E. coli, M. tuberculosis) • many proprietary methods • Considered one of the most promising solution for preliminary annotation of “unknown function” proteins in genome sequencing projects
Collecting information for many genes • Usually for “large-scale biology” (e.g. micro-array experiments) • Genome Databases • Functional classification schemes
Genome Databases • Genome sequencing project are now the primary driving force for extensive functional annotation • We have the genes (ORFs), we want the functions FUNCTIONAL GENOMICS
Functional classification schemes I • Dealing with large sets of genes functional classification schemes • Tentative schemes as early as 1983; use driven by genome sequencing projects • First extensive scheme published in 1993 by Monica Riley [regularly updated (GenProtEC; EcoCyc)] • The majority of current schemes are heavily influenced by the ‘Riley scheme’ • ‘2nd generation’ schemes are now being developed
Functional classification schemes II • Most schemes can be thought of as trees • Progression along the tree (root to leaves) represents increasingly specific functions • ORFs are generally associated with leaf nodes (but of course, they are also associated with intermediary nodes) • Examples of use: • create gene sets linked by functionality (e.g. to detect functional motifs) • validate a functional connection between genes (e.g. gene expression studies)
An example scheme… GeneProtEC Metabolism of small molecules Amino Acids Alanine 2 ORFs (112 ORFs) etc. (900 ORFs) Central Intermediary Metabolism Amino sugars 8 ORFs etc. Energy Metabolism Aerobic respiration 32 ORFs Fermentation 22 ORFs etc. Glycolysis 18 ORFs etc.
Issues • Functions: Apple and Oranges • Multi-dimensionality • Multi-functionality
Issues: Apples and Oranges • Function is an umbrella catch-all term • Schemes do not distinguish between aspects of functions • Most commonly they mix gene product type (T), activity (A) and cellular role (R) Cell division (R) : DNA replication (A) Osmotic adaptation (R) : Ion channel (T,A)
Issues - Multi-dimensionality I • Human trypsin functions: • Biochemical: peptide bond hydrolysis • Molecular: proteolytic enzyme • Cellular: protein degradation • Physiological: digestion • Could conceive a number of other dimensions • Cellular location • Regulation
Issues - Multi-dimensionality II • Why differentiate function and process? • Figure of cell cycle-dependent Yeast gene expression clusters (Pat Brown lab - Stanford)
Issues - Multi-functionality • Inherent: e.g. lac repressor; carbohydrate metabolism and osmoprotection • Multi-subunit: e.g. succinic dehydrogenase; whole - enzyme in TCA; subunit 1 - electron transport chain; subunit 2 - cell structure • Circumstantial: e.g. acetate kinase; acetate only environment - acetate metabolism; acetate absent - fermentation enzyme
Gene Ontology - a collaboration • Drosophila (fruit fly) - FlyBase • Saccharomyces Genome Database (SGD) • Mus (mouse) - Mouse Genome Database (MGD)
Gene Ontology - the next generation • Multi-dimensional: • functional primitive: “a capability that a physical gene product (or gene product group) carries as a potential” (e.g. transporter or adenylate cyclase) • process: “a biological objective accomplished via one or more ordered assemblies of functions” (e.g. cell growth and maintenance or purine metabolism) • cellular component • Extensive: depth 11; nearly 4000 terms • More complex organisation: away from tree structure • Theoretically applicable to all species (designed for multicellular eukaryotes)
Gene Ontology - current status http://www.geneontology.org/
Where to look for functional information - single protein • With 1 or a few genes: • Primary databases (e.g. SwissProt) • Model organism databases (e.g. GenProtEC; SGD; WormPD) • Metabolic/Pathway databases (e.g. KEGG) • Value-added databases (e.g. Motif databases; Disease databases) • By homology • Not by homology
Where to look for functional information - protein sets • Need some sort of functional classification scheme: • Tree like schemes (e.g. TIGR, GenProtEC) • Gene Ontology (FlyBase, MGD, SGD) • For comparative genomics, need schemes applied to multiple organisms (e.g. PEDANT, TIGR) • Currently, greatest genome coverage is by PEDANT (but non-manually curated)
Conclusions • Functional information is available but it is rarely centralised • Function is a very broad definition; hard to know if the information you need will be available at the level you need it • New schemes (e.g. GO) are emerging which try and cope with functional annotation better • And new automated functional annotation tools are emerging (‘intelligent systems’; non-homology based) • You still need to validate predictions experimentally
A survey of (some) current schemes • 1) EcoCyc/GenProtEC:E. coli scheme (Riley scheme, MBL) • 2) SubtiList:Bacillus subtilis scheme (Institut Pasteur) • 3) MIPS/PEDANT: yeast scheme (applied to other organisms in PEDANT) (Munich Institute for Protein Science) • 4) TIGR: microbial genomes scheme (The Institute for Genome Research) • 5) KEGG: multi-organism scheme (metabolic and regulatory pathways) (Kyoto Encyclopaedia for Genes and Genomes) • 6) WIT: multi-organism scheme (metabolic reconstruction) (What is There; ANL) • 7) Gene Ontology: a 2nd generation functional classification scheme (EBI; FlyBase; MGD; SGD)
Conclusions - Scheme comparison I • Similar in the coverage of function (although very varying ‘granularity’) • ...yet different enough that direct comparison complex • Essentially deal with unicellular microbial organisms (MIPS is tackling this) • Certain ‘niche’ schemes (e.g. WIT/KEGG) • ...or user community tailored schemes (e.g. SubtiList)
WWW sites I • Primary databases (Sequence): • SwissProt: • http://www.expasy.ch/sprot • PIR: • http://www-nbrf.Georgetown.edu/ • NCBI databases: • http://www.ncbi.nlm.nih.gov/Database/index.html • Primary databases (Structure) • Protein Data Bank: • http://www.rcsb.org/ • Macromolecular Structure Database: • http://msd.ebi.ac.uk/ • Value added: • INTERPRO: • http://interpro.ebi.ac.uk/
WWW sites II • Single genome databases: • Subtilist: • http://genolist.pasteur.fr/SubtiList/ • Saccharomyces Genome Database: • http://genomewww.stanford.edu/Saccharomyces/ • EcoCyc: • http://ecocyc.pangeasystems.com/ • GenProtEC: • http://genprotec.mdbl.edu/ • FlyBase: • http://flybase.bio.indiana.edu/ • Mouse Genome Database (MGD): • http://www.informatics.jax.org/ • Yeast Protein Database (YPD) and WormPD: • http://www.proteome.com/
WWW sites III • Multiple genome databases • The Institute for Genome Research: • http://www.tigr.org/microbialdb • MIPS/PEDANT: • http://pedant.mips.biochem.mpg.de/ • HAMAP: • http://www.expasy.ch/sprot/hamap/ • Pathway databases • KEGG: • http://www.genome.ad.jp/kegg/ • WIT: • http://igweb.integratedgenomics.com/IGwit/ • Non-homology based function prediction • Mycobacterium tuberculosis: • http://www.doe-mbi.ucla.edu/people/sergio/TB/tb.html • Yeast: • http://www.doe-mbi.ucla.edu/people/marcotte/yeast.html • A relevant paper • http://www.biochem.ucl.ac.uk/~rison/Publications/index.html