Protein function Where to find it. How to predict it. How to classify it.

Protein functionWhere to find it.How to predict it.How to classify it. Stuart Rison Department of Biochemistry, UCL rison@biochem.ucl.ac.uk

Outline • Collecting functional information: • Small scale (single gene) • Large scale (sets of genes) • Function annotation schemes • Problems with functional assignments • [Comparing current schemes]

Collecting information for single genes • from 1° databases • from 2° databases • from Genome Databases (Model organisms) • by homology • not by homology

Annotation in databases: 1° and 2° databases • Some information can be found in 'primary' databases (sequence and structure databases) • Usually limited although sometimes can be quite informative (e.g. SwissProt) • Core data: sequence, citation information and taxonomic data • Annotation: Protein function; post-translational modifications; domains and sites; Associated diseases; Sequence conflicts/Variant • Most primary databases link to a number of value-added (2°) databases (e.g. motif databases or disease databases) which are often rich in information

Annotation in 1° databases: SwissProt ID HEM3_HUMAN STANDARD; PRT; 361 AA. AC P08397; P08396; Q16012; … DE PORPHOBILINOGEN DEAMINASE (EC 4.3.1.8) (HYDROXYMETHYLBILANE SYNTHASE) DE (HMBS) (PRE-UROPORPHYRINOGEN SYNTHASE) (PBG-D). GN HMBS OR PBGD. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. …(literature references)… CC FUNCTION: TETRAPOLYMERIZATION OF THE MONOPYRROLE PBG INTO THE CC HYDROXYMETHYLBILANE PREUROPORPHYRINOGEN IN SEVERAL DISCRETE STEPS. CC CATALYTIC ACTIVITY: 4 PORPHOBILINOGEN + H(2)O = HYDROXYMETHYLBILANE + 4 NH(3). CC COFACTOR: BINDS A DIPYRROMETHANE COFACTOR TO WHICH PORPHOBILINOGEN SUBUNITS… CC PATHWAY: THIRD STEP IN PORPHYRIN BIOSYNTHESIS BY THE SHEMIN PATHWAY. INVOLVED… CC ALTERNATIVE PRODUCTS: THERE ARE TWO ISOZYMES OF THIS ENZYME IN MAMMALS; THEY CC AREPRODUCED BY THE SAME GENE FROM ALTERNATIVE SPLICING… CC DISEASE: DEFECTS IN HMBS ARE THE CAUSE OF ACUTE INTERMITTENT PORPHYRIA (AIP); AN CC AUTOSOMAL DOMINANT DISEASE CHARACTERIZED BY ACUTE ATTACKS OF NEUROLOGICAL CC DYSFUNCTION… CC SIMILARITY: BELONGS TO THE HMBS FAMILY. … (links to related databases - secondary databases) … KW Porphyrin biosynthesis; Heme biosynthesis; Lyase; KW Alternative splicing; Disease mutation. … (Sequence variations/Sequence)

Annotation in Motif databases: INTERPRO http://interpro.ebi.ac.uk/servlet/IEntry?ac=IPR000860

Genome databases • Some deal with single organisms (e.g. SubtiList for B. subtilis; Sanger Centre M. tuberculosis) • Some deal with multiple genomes (e.g. TIGR microbial genomes database) • The level of annotation can be extensive • Many are much more than sequence repositories extending the sequence with tons of information (e.g. mutants; strains; complementation plasmids etc.) • If you are working with a model organism, chances of obtaining reliable functional annotations are improved

Genome database: YPD http://www.proteome.com/databases/YPD/reports/HEM3.html

Function assignment by homology I • If you just have a sequence • The most common bioinformatics procedure • Search your protein of interest against primary databases; chances are if you find a homologue with high-identity, it performs a similar function • Many, many tools (BLAST, FASTA, S-W Search) • Beware of annotation by homology • relationship between seq. similarity and function not straightforward • danger of propagation of incorrect functional information

Function assignment by homology II • Consider databases which distinguish experimental function assignments from homology based ones (e.g. YPD/WormPD, EcoCyc) • Or use databases which employ more rigorous automated annotation tools (e.g. HAMAP @ SwissProt) “Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated.”

Genome database: YPD http://www.proteome.com/databases/YPD/reports/HEM3.html

Functional assignment “without homology” • Novel functional assignment methods now exists which don’t make use of ‘direct’ homology searches • They exploit other relationships between proteins which are used as indicators of shared function • Phylogenetic profiles • “Rosetta stone genes”

Phylogenetic profiles Pellegrini M et al., “Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.” PNAS (1999) 96(8):4285-8

Rosetta Stone method

More methods… Marcotte EM, et al., Nature (1999) 402:83-86 Enright AJ, et al., Nature (1999) 404:86-90

Functional assignment “without homology”

Functional assignment “without homology” • Some access over the WWW • but experiemental • and only for certain organisms (Yeast, E. coli, M. tuberculosis) • many proprietary methods • Considered one of the most promising solution for preliminary annotation of “unknown function” proteins in genome sequencing projects

Collecting information for many genes • Usually for “large-scale biology” (e.g. micro-array experiments) • Genome Databases • Functional classification schemes

Genome Databases • Genome sequencing project are now the primary driving force for extensive functional annotation • We have the genes (ORFs), we want the functions FUNCTIONAL GENOMICS

(… more ’omes)

Functional classification schemes I • Dealing with large sets of genes  functional classification schemes • Tentative schemes as early as 1983; use driven by genome sequencing projects • First extensive scheme published in 1993 by Monica Riley [regularly updated (GenProtEC; EcoCyc)] • The majority of current schemes are heavily influenced by the ‘Riley scheme’ • ‘2nd generation’ schemes are now being developed

Functional classification schemes II • Most schemes can be thought of as trees • Progression along the tree (root to leaves) represents increasingly specific functions • ORFs are generally associated with leaf nodes (but of course, they are also associated with intermediary nodes) • Examples of use: • create gene sets linked by functionality (e.g. to detect functional motifs) • validate a functional connection between genes (e.g. gene expression studies)

An example scheme… GeneProtEC Metabolism of small molecules Amino Acids Alanine 2 ORFs (112 ORFs) etc. (900 ORFs) Central Intermediary Metabolism Amino sugars 8 ORFs etc. Energy Metabolism Aerobic respiration 32 ORFs Fermentation 22 ORFs etc. Glycolysis 18 ORFs etc.

Issues • Functions: Apple and Oranges • Multi-dimensionality • Multi-functionality

Issues: Apples and Oranges • Function is an umbrella catch-all term • Schemes do not distinguish between aspects of functions • Most commonly they mix gene product type (T), activity (A) and cellular role (R) Cell division (R) : DNA replication (A) Osmotic adaptation (R) : Ion channel (T,A)

Issues - Multi-dimensionality I • Human trypsin functions: • Biochemical: peptide bond hydrolysis • Molecular: proteolytic enzyme • Cellular: protein degradation • Physiological: digestion • Could conceive a number of other dimensions • Cellular location • Regulation

Issues - Multi-dimensionality II • Why differentiate function and process? • Figure of cell cycle-dependent Yeast gene expression clusters (Pat Brown lab - Stanford)

Issues - Multi-functionality • Inherent: e.g. lac repressor; carbohydrate metabolism and osmoprotection • Multi-subunit: e.g. succinic dehydrogenase; whole - enzyme in TCA; subunit 1 - electron transport chain; subunit 2 - cell structure • Circumstantial: e.g. acetate kinase; acetate only environment - acetate metabolism; acetate absent - fermentation enzyme

Gene Ontology - a collaboration • Drosophila (fruit fly) - FlyBase • Saccharomyces Genome Database (SGD) • Mus (mouse) - Mouse Genome Database (MGD)

Gene Ontology - the next generation • Multi-dimensional: • functional primitive: “a capability that a physical gene product (or gene product group) carries as a potential” (e.g. transporter or adenylate cyclase) • process: “a biological objective accomplished via one or more ordered assemblies of functions” (e.g. cell growth and maintenance or purine metabolism) • cellular component • Extensive: depth 11; nearly 4000 terms • More complex organisation: away from tree structure • Theoretically applicable to all species (designed for multicellular eukaryotes)

Gene Ontology - Process

Gene Ontology - current status http://www.geneontology.org/

Where to look for functional information - single protein • With 1 or a few genes: • Primary databases (e.g. SwissProt) • Model organism databases (e.g. GenProtEC; SGD; WormPD) • Metabolic/Pathway databases (e.g. KEGG) • Value-added databases (e.g. Motif databases; Disease databases) • By homology • Not by homology

Where to look for functional information - protein sets • Need some sort of functional classification scheme: • Tree like schemes (e.g. TIGR, GenProtEC) • Gene Ontology (FlyBase, MGD, SGD) • For comparative genomics, need schemes applied to multiple organisms (e.g. PEDANT, TIGR) • Currently, greatest genome coverage is by PEDANT (but non-manually curated)

Conclusions • Functional information is available but it is rarely centralised • Function is a very broad definition; hard to know if the information you need will be available at the level you need it • New schemes (e.g. GO) are emerging which try and cope with functional annotation better • And new automated functional annotation tools are emerging (‘intelligent systems’; non-homology based) • You still need to validate predictions experimentally

A survey of (some) current schemes • 1) EcoCyc/GenProtEC:E. coli scheme (Riley scheme, MBL) • 2) SubtiList:Bacillus subtilis scheme (Institut Pasteur) • 3) MIPS/PEDANT: yeast scheme (applied to other organisms in PEDANT) (Munich Institute for Protein Science) • 4) TIGR: microbial genomes scheme (The Institute for Genome Research) • 5) KEGG: multi-organism scheme (metabolic and regulatory pathways) (Kyoto Encyclopaedia for Genes and Genomes) • 6) WIT: multi-organism scheme (metabolic reconstruction) (What is There; ANL) • 7) Gene Ontology: a 2nd generation functional classification scheme (EBI; FlyBase; MGD; SGD)

FuncWheel for the Combination Scheme

Conclusions - Scheme comparison I • Similar in the coverage of function (although very varying ‘granularity’) • ...yet different enough that direct comparison complex • Essentially deal with unicellular microbial organisms (MIPS is tackling this) • Certain ‘niche’ schemes (e.g. WIT/KEGG) • ...or user community tailored schemes (e.g. SubtiList)

WWW sites I • Primary databases (Sequence): • SwissProt: • http://www.expasy.ch/sprot • PIR: • http://www-nbrf.Georgetown.edu/ • NCBI databases: • http://www.ncbi.nlm.nih.gov/Database/index.html • Primary databases (Structure) • Protein Data Bank: • http://www.rcsb.org/ • Macromolecular Structure Database: • http://msd.ebi.ac.uk/ • Value added: • INTERPRO: • http://interpro.ebi.ac.uk/

WWW sites II • Single genome databases: • Subtilist: • http://genolist.pasteur.fr/SubtiList/ • Saccharomyces Genome Database: • http://genomewww.stanford.edu/Saccharomyces/ • EcoCyc: • http://ecocyc.pangeasystems.com/ • GenProtEC: • http://genprotec.mdbl.edu/ • FlyBase: • http://flybase.bio.indiana.edu/ • Mouse Genome Database (MGD): • http://www.informatics.jax.org/ • Yeast Protein Database (YPD) and WormPD: • http://www.proteome.com/

WWW sites III • Multiple genome databases • The Institute for Genome Research: • http://www.tigr.org/microbialdb • MIPS/PEDANT: • http://pedant.mips.biochem.mpg.de/ • HAMAP: • http://www.expasy.ch/sprot/hamap/ • Pathway databases • KEGG: • http://www.genome.ad.jp/kegg/ • WIT: • http://igweb.integratedgenomics.com/IGwit/ • Non-homology based function prediction • Mycobacterium tuberculosis: • http://www.doe-mbi.ucla.edu/people/sergio/TB/tb.html • Yeast: • http://www.doe-mbi.ucla.edu/people/marcotte/yeast.html • A relevant paper • http://www.biochem.ucl.ac.uk/~rison/Publications/index.html

Protein function Where to find it. How to predict it. How to classify it.