Predicting Active Site Residue Annotations in the Pfam Database

[ Publication date: 9 August 2007 ] Authors: Jaina Mistry; Alex Bateman; Robert D Finn [Authors of this paper &the PFam database] [ BMC Bioinformatics ] Predicting Active Site Residue Annotations in the Pfam Database Presentation by: KEYUR MALAVIYA

TOPICS COVERED • Introduction • Background • Construction and content • Data output and file formats (Utility) • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Assessing sensitivity and specificity • Comparison: • Comparing Pfam to PROSITE • Comparing Pfam with MEROPS • PROSITE with MEROPS • Conclusion

TOPICS COVERED Background Construction and content Data output and file formats (Utility) Transfer of experimental data within Pfam alignments UniProtKB data CSA data Assessing sensitivity and specificity Comparison: Comparing Pfam to PROSITE Comparing Pfam with MEROPS PROSITE with MEROPS Conclusion • Introduction

Introduction: Goal of this Paper: To increase the active site annotations • Approach: A strict set of rules are chosen to reduce the rate of false positives  enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family • Results: • Only 3% of predicted sequences are false positives • Predicted 606110 active site residues, of which 94% are not found in UniProtKB • The developed tool for transferring the data can be applied to any alignment with associated experimental active site data and is available for download • This tool is useful in proteome annotation, comparative genomics, protein evolution and active site characterization

TOPICS COVERED Background • Introduction • Construction and content • Data output and file formats (Utility) • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Assessing sensitivity and specificity • Comparison: • Comparing Pfam to PROSITE • Comparing Pfam with MEROPS • PROSITE with MEROPS • Conclusion • This tool is usefulness to proteome annotation, comparative genomics, protein evolution and active site characterization

Background: Predicting Active Site Residue Annotations in the PFam Database

Pfam is a collection of protein familiesand domains Pfam contains multiple protein alignments &profile-HMMs of these families Background: PFam Database • Function: To view the domain organization of proteins • 74% of protein sequences have at least one match to Pfam. (Sequence coverage is 74% ) • 5% Pfam families are enzymatic • From these, a small fraction (<0.5%) have had the residues responsible for catalysis determined • The structure and chemical properties of these residues (the active site) determine the chemistry of the enzyme

Background: Active site: The active site of an enzyme contains the catalytic and binding sites • Binding site is a region on a protein (also DNA or RNA) to which specific other molecules & ions — called ligands • Ligand: Binds to & form a complex with a biomolecule to serve a biological purpose. i.e: it is an effector molecule binding to a site on a target protein • Enzymes: Controls the flow of metabolites within a cell Catalyze virtually all reactions that make/modify molecules

TO DO:Information about other databases: NCBI BLAST: • Catalytic Site Atlas (CSA): • UniProtKB: • PROSITE: • SMART and MEROPS:

The problem and the solution Pfam[1] release 20.0: 8296 protein families % Active site residues experimentally determined: Only ~0.4% sequences in enzymatic Pfam families Need to overcome the lack of experimental data HOW? • Computationally predict active sites in protein sequences • Two broad categories: 1) computational methods that transfer experimentally characterized active site data by similarity 2) those that predict active site residues ab initio • ab initio methods:they exploit known properties like: Active sites are usually found buried within a cleft of a protein, Mutations in them can often increase the stability of an enzyme Active sites residues are highly conserved

TO DO:ab initio methods: Geometry data, stability profiles and sequence conservation Evolutionary trace (ET) Neural networks [19] and support vector machines [20, 21] All have a relatively high rate of FPs

TO DO:Similarity transfer based methods:

Where we are: Background • Introduction • Construction and content • Data output and file formats (Utility) • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Assessing sensitivity and specificity • Comparison: • Comparing Pfam to PROSITE • Comparing Pfam with MEROPS • PROSITE with MEROPS • Conclusion

Construction and content: The Pfam database is renowned for having no known false positives in its alignments Achieved by: a set of rules that allows conservative transfer of active site annotation from one protein to another protein in the same Pfam alignment To predict active site residues: identify sequences with experimentally verified active site residues use this information to predict active site residues in other members of that family Next the Algorithm steps for this rule based methodology are mentioned in which steps 1 and 2 are already present in Pfam

Logic of the rule based methodology find a homologous set of proteins & generate a protein alignment:

Logic of the rule based methodology Identify the positions of all experimentally verified active sites in the alignment:

Logic of the rule based methodology

Logic of the rule based methodology Seq1 contains 3 experimental active sites (D, E & H) Seq2 contains 2 experimentally defined active site residues (D & E) Apply step3: H in seq2 is predicted to be an active site residue

Logic of the rule based methodology D in column 13, E in column 43 and H in column 45.

Logic of the rule based methodology Each unannotated sequence in the alignment is analyzed to see if it contains an exact match to the active site pattern Seq1 and Seq 2 now contains 3 experimental active sites (D, E & H) Seq3 contains residues D, E & H in the active site residue columns Apply step5: D, E & H in seq3 are predicted to be active site residues

TO DO:Logic of the rule based methodology when there are two distinct experimentally determined active site patterns within a family, each unannotated sequence is compared as before. There are cases where an unannotated sequence matches more than one active site pattern.

TO DO:To TEST this rule based methodology: 8296 alignments from Pfam 20.0 & experimentally verified active site residues from two different databases, UniProtKB & CSA were used Compared the results to each of the database predictions

TO DO:Data output and file formats

Transfer of UniProtKB experimental data within Pfam alignments • Use of ‘UniProtKB 8.0’ 2735 experimentally determined active site annotations & alignments in Pfam 20.0 • Pfam  predicts 6,06,110 active site residues • UniProtKB  predicts 45,685 A-S-R • Overlap of predicted A-S-R annotation between ‘Pfam predicted’, & UniProtKB • Unable to predict the remaining 23% (10312 residues)? • 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position

Predictions are based on transferring known experimental data within a Pfam alignment while this 55% doesn’t Transfer of UniProtKB experimental data within Pfam alignments • And this constitutes the 10312 sequences . . • 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position

Transfer of UniProtKB experimental data within Pfam alignments ….. TO DO: A substantial proportion (96%, 570765 residues) of our active site predictions are not present in UniProtKB. This is due to the fact that unlike UniProtKB, which only makes predictions for sequences in UniProtKB/Swiss-Prot, we also make predictions for the automatically generated UniProtKB/TrEMBL entries. Comparing the active site residue prediction for UniProtKB/Swiss-Prot alone, our methodology predicts 48943 residues compared with the 45685 predicted by UniProtKB. Thus, we have 12570 additional active site predictions for the sequences in UniProtKB/Swiss-Prot. In the reverse comparison of UniProtKB against Pfam, UniProtKB only contains 6% of the active site information contained within Pfam.

Transfer of CSA experimental data within Pfam alignments …TO DO: CSA  predicts 5517 active site annotations Pfam  predicts 3523 active site annotations Analysis revealed: For 1376 residues, (49% of the cases) there were no CSA experimental active sites within the Pfam alignments http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=SUFS_ECOLI&pager.offset=null

Predicting Active Site Residue Annotations in the Pfam Database