1 / 42

Predicting Active Site Residue Annotations in the Pfam Database

[ Publication date: 9 August 2007 ]. Authors: Jaina Mistry; Alex Bateman; Robert D Finn [Authors of this paper &the PFam database]. [ BMC Bioinformatics ]. Predicting Active Site Residue Annotations in the Pfam Database. Presentation by: KEYUR MALAVIYA. TOPICS COVERED. Background

Download Presentation

Predicting Active Site Residue Annotations in the Pfam Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. [ Publication date: 9 August 2007 ] Authors: Jaina Mistry; Alex Bateman; Robert D Finn [Authors of this paper &the PFam database] [ BMC Bioinformatics ] Predicting Active Site Residue Annotations in the Pfam Database Presentation by: KEYUR MALAVIYA

  2. TOPICS COVERED • Background • Introduction • Construction and content • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Conclusion

  3. TOPICS COVERED Introduction Construction and content Transfer of experimental data within Pfam alignments UniProtKB data CSA data Conclusion • Background

  4. Background: Predicting Active Site Residue Annotations in the PFam Database

  5. Pfam is a collection of protein familiesand domains Pfam contains multiple protein alignments &profile-HMMs of these families Background: PFam Database • Function: To view the domain organization of proteins • 5% Pfam families are enzymatic • From these, a small fraction (<0.5%) have had the residues responsible for catalysis determined • The structure and chemical properties of these residues (the active site) determine the chemistry of the enzyme

  6. Background: Active site:The active site of an enzyme contains the catalytic and binding sites • Binding siteis a region on a protein (also DNA or RNA) to which specific other molecules & ions — called ligands • Ligand:Binds to & form a complex with a biomolecule to serve a biological purpose. i.e: it is an effector molecule binding to a site on a target protein • Enzymes:Controls the flow of metabolites within a cell Catalyze virtually all reactions that make/modify molecules

  7. Information about other databases: NCBI BLAST: Finds regions of local similarity between sequences (homologs). BLAST can be used to infer functional and evolutionary relationships between sequences NCBI Blast:

  8. Information about other databases: NCBI Blast: • UniProtKB: Curated protein sequence database (i.e. literature collated A-S-R) & predicted A-S-R. Only predicts A-S-R by similarity for sequences in UniProtKB/Swiss-Prot

  9. Information about other databases: UniProtKB: • UniProtKB: Curated protein sequence database (i.e. literature collated A-S-R) & predicted A-S-R. Only predicts A-S-R by similarity for sequences in UniProtKB/Swiss-Prot

  10. Information about other databases: UniProtKB: • PROSITE: consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them

  11. Information about other databases: • SMART and MEROPS: collate active site data from the literature and use sequence similarity based transfer to annotate active site residues onto the sequences in their protein families • Catalytic Site Atlas (CSA): documents enzyme active sites and catalytic residues in enzymes of 3D structure Collates A-S-R from literature for proteins with known structure A-S-R predictions made for proteins with a known structure which it infers on the basis of PSIBLAST hits One of the largest resources for catalytic sites

  12. Uniprot – Universal protein knowledgebase:

  13. Uniprot – Universal protein knowledgebase: • PFam and UniprotKB: 74% of protein sequences in UniprotKB have at least one match to Pfam. (Sequence coverage is 74% )

  14. TOPICS COVERED Introduction • Background • Construction and content • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Conclusion

  15. Introduction: Goal of this Paper: To increase the active site annotations • Problem: Low active site annotations • Approach:Strict set of rules to reduce the rate of FPs  transfer experimentally determined active site residue data to other sequences within the same Pfam family • Results: • Only 3% of predicted sequences are false positives • Predicted 606110 active site residues, of which 94% are not found in UniProtKB • The developed tool for transferring the data can be applied to any alignment with associated experimental active site data and is available for download • This tool is useful in proteome annotation, comparative genomics, protein evolution and active site characterization

  16. The problem and the solution Pfam[1] release 20.0: 8296 protein families % Active site residues experimentally determined in enzymatic Pfam families : Only ~0.4% sequences • To do better: Need to overcome the lack of experimental data HOW? • Computationally predict active sites in protein sequences • Two broad categories: 1) computational methods that transfer experimentally characterized active site data by similarity 2) those that predict active site residues ab initio

  17. ab initio methods: Exploit known properties like: Active sites usually found buried within a cleft of a protein Mutations in them increase stability of an enzyme Active sites residues are highly conserved Methods: Geometry data, stability profiles and sequence conservation in active site prediction

  18. ab initio methods: Evolutionary trace (ET): - Identify most highly conserved residues in related sequences, - Map them onto the structure of protein, - Then examines the structure for clusters of residues which could correspond to active sites or other functional sites. - Successful prediction60-80% of test cases Other methods: Neural networks and support vector machines • Problem: These methods are hard to compare to each other in terms of accuracy • All have a relatively high rate of False Positives

  19. Similarity transfer based methods: Transfer A-S-R from the characterized sequences to the uncharacterized sequences First identify homologous sequences: Use tools such as BLAST searches, hidden Markov models (HMMs), pattern matching and structural templates • Transfer active site residues • Transfer A-S-R • Pfam with this rule based methodology = Pfam+

  20. Where we are: Introduction • Background • Construction and content • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Conclusion

  21. Construction and content: The Pfam database is renowned for having no known false positives in its alignments The active site Pfam families contain both active and inactive homologues Known active site residues from UniProtKB/Swiss-Prot in a Pfam alignment, are conserved in many of the sequences without active site annotation Construction: A set of rules that allows conservative transfer of active site annotation from one protein to another protein in the same Pfam alignment To predict active site residues: identify sequences with experimentally verified active site residues use this information to predict active site residues in other members of that family

  22. Logic of the rule based methodology find a homologous set of proteins & generate a protein alignment:

  23. Logic of the rule based methodology Identify the positions of all experimentally verified active sites in the alignment:

  24. Logic of the rule based methodology

  25. Logic of the rule based methodology Seq1 contains 3 experimental active sites (D, E & H) Seq2 contains 2 experimentally defined active site residues (D & E) Apply step3: H in seq2 is predicted to be an active site residue

  26. Logic of the rule based methodology D in column 13, E in column 43 and H in column 45.

  27. Logic of the rule based methodology

  28. Logic of the rule based methodology Each unannotated sequence in the alignment is analyzed to see if it contains an exact match to the active site pattern Seq1 and Seq 2 now contains 3 experimental active sites (D, E & H) Seq3 contains residues D, E & H in the active site residue columns Apply step5: D, E & H in seq3 are predicted to be active site residues IS THIS ENOUGH? WILL THIS WORK? If the prediction is wrong: then there will be false positives BUT they will not be “KNOWN” false positives

  29. Logic of the rule based methodology

  30. Logic of the rule based methodology Two distinct experimentally determined active site patterns within a family Unannotated sequence matches more than one active site pattern Seq5 experimentally verified active site residues: H(col:9) E(col:42) Seq6 experimentally verified active site residues: T(col:11) E (col:42) True active site pattern for the family should be union of active sites of Seq5 and Seq6 Predict H (col: 9) for seq6 and similarly T (col:11) for seq5 ??? NO. Don’t combine since the union of the two active site patterns has not been experimentally observed

  31. Logic of the rule based methodology Two distinct experimentally determined active site patterns within a family Unannotated sequence matches more than one active site pattern Seq5 experimentally verified active site residues: H(col:9) E(col:42) Seq6 experimentally verified active site residues: T(col:11) E (col:42) What about Seq7??? Seq 7 contains active site patterns found in both seq5 & seq6 Seq7 has a higher % identity to seq6 than seq5  T in column 11 & E in column 42 of seq7 are predicted to be A-S-R

  32. Data source: UniProtKB chosen as preferred source of experimental active sites for Pfam - why?: Using UniprotKB  gives a low false positive rate UniProtKB experimental active sites are more comprehensive than the CSA (they cover sequences with both known and unknown structure)

  33. Where we are: • Background • Introduction • Construction and content • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Conclusion

  34. Transfer of UniProtKB experimental data within Pfam alignments • Use of ‘UniProtKB 8.0’ 2735 experimentally determined active site annotations & alignments in Pfam 20.0 • Pfam+  predicts 6,06,110 active site residues • UniProtKB  predicts 45,685 A-S-R • Overlap of predicted A-S-R annotation between ‘Pfam predicted’, & UniProtKB • Unable to predict the remaining 23% (10312 residues)? • 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position

  35. Predictions are based on transferring known experimental data within a Pfam alignment while this 55% doesn’t Transfer of UniProtKB experimental data within Pfam alignments • And this constitutes the 10312 sequences . . • 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position

  36. Transfer of UniProtKB experimental data within Pfam alignments 96% i.e. 570765 residues of PFam+ active site predictions are not present in UniProtKB – Why: UniProtKB makes predictions for sequences in UniProtKB/Swiss-Prot, PFam+ makes predictions for the automatically generated UniProtKB/TrEMBL entries A-S-R prediction for UniProtKB/Swiss-Prot alone, PFam+ predicts 12570 additional residues than UniProtKB/Swiss-Prot. Reverse comparison - UniProtKB against Pfam: UniProtKB only contains 6% of the active site information contained within Pfam

  37. Where we are: • Background • Introduction • Construction and content • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Conclusion

  38. Transfer of CSA experimental data within Pfam alignments CSA  predicts 5517 active site annotations Pfam  predicts 3523 active site annotations Analysis revealed: For 1376 residues, (49% of the cases) there were no CSA experimental active sites within the Pfam alignments Experimental CSA active site sequence and the CSA predicted active site sequence are too divergent for both to belong to the same Pfam family • Removing CSA predicted active site sequences that did not contain experimental active sites still PFam failed to predict 1446 CSA predicted active sites • Why: The criteria did not match and the broader definition of an active site residue in CSA * UniProtKB sequence “P77444” has residue 364  A-S-R & residue 226 binding site for pyridoxal phosphate * CSA defines both residues 226 & 364 as A-S-Rs

  39. Where we are: • Background • Introduction • Construction and content • Transfer of experimental data within Pfam alignments • UniProtKB data • CSA data • Conclusion

  40. Conclusion: Automated rule based methodology accurately transfer active site annotation between sequences within a Pfam alignment & other members within the same Pfam family Substantially increased the number of active site annotations in Pfam Source of experimental data (different for UniProtKB & CSA) determines the success & coverage of any method that uses similarity for transferring active site information Comparing Pfam+ data to PROSITE patterns: this methodology detects three times more active site sequences Comparison with the MEROPS data showed the methodology to have a low FP rate (3%), a good specificity (82%), and a reasonable sensitivity (62%)  automated methodology predicts a substantial number of active site residues at the expense of losing some sensitivity

  41. The forthcoming release  Pfam 22.0 contains 100,000 more Pfam active sites than Pfam 20.0. This active site dataset is the largest single resource of active site annotation currently available Conclusion:

  42. THANK YOU Question / s?

More Related