1 / 1

50%, guessing 100%, all correct

Prediction of catalytic residues in proteins using machine learning techniques. Natalia V. Petrova and Cathy H. Wu Protein Information Resource, Georgetown University, Washington, DC 20007. INTRODUCTION. METHODS & RESULTS.

galatea
Download Presentation

50%, guessing 100%, all correct

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prediction of catalytic residues in proteins using machine learning techniques Natalia V. Petrova and Cathy H. Wu Protein Information Resource, Georgetown University, Washington, DC 20007 INTRODUCTION METHODS & RESULTS The growing gap between experimentally characterized and uncharacterized proteins necessitates the development of new computational methods for functional prediction. Although computational methods to predict catalytic residues/active sites are rapidly developing, their accuracy remains low (60 - 70%), with a significant number of false positives. We present a novel method for the prediction of catalytic sites, using a machine learning approach, and analyze the results in a case study of a large evolutionarily diverse group of proteins. We used a dataset of enzymes with experimentally identified catalytic sites (79 enzymes) from the CATRES database as our benchmarking dataset [Table 1 ] for the initial analysis. In the 10-fold cross-validation analysis, the best result was achieved with SMO [Figure 1 ] – a support vector machine algorithm that builds an optimal hyperplane in the multidimensional space of attributes in order to achieve maximal separation of the positively and negatively labeled samples. The Scorecons conservation score is a key attribute in the prediction, as shown by the performance of the SMO algorithm using individual attributes [Figure 2 ]. Seven out of 24 attributes were chosen by the Wrapper Subset Selection algorithm as an optimal subset of attributes for the SMO algorithm, and no further reduction of the set is possible [Table 2 ]. Table 1 Benchmarking Dataset Table 2 Final Attribute Set # proteins 79 # residues 23,664 # catalytic residues 254 PIR curated 100% Algorithm Performance Measurements PDB X-ray crystallography 100% SCOP all a 10.1% all b 10.1% a + b 30.4% a/b 48.1% small proteins 1.3% CASE STUDY – a/b hydrolases (Further Optional Analytical Step) EC number oxidoreductases 20.3% transferases 26.6% hydrolases 27.8% lyases 17.7% The prediction capabilities of our method were tested on a diverse superfamily of hydrolytic enzymes with a/b hydrolase fold and different catalytic functions (Pfam domain – PF00561). All enzymes have a catalytic triad and conserved structural features [Figure 3A ]. Even though the algorithm predicted a large number of false positives for each individual protein [Figure 3C ], further improvement can be achieved by merging the results for a group of related proteins. For 16 out of 17 enzymes, the method correctly predicted all 3 residues of the triad with 3 false positives (1.06%) out of 282 residues on average. For one protein, 1cv2, the method missed a substituted catalytic residue of the triad [Figure 3 ]. Two out of 3 false positive residues (His and Gly ) are believed to be important for enzymatic activity, while all three of them are essential for protein structural stability. isomerases 2.5% ligases 5.1% Figure 1 Performance of each algorithm measured by MCC 0, guessing 1, all correct MCC = Figure 3Prediction Results for 9 Curated Protein Families of a/b Hydrolases Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately LID A B Consensus of False Positives (FP) Asp/Asn Catalytic Triad, True Positives (TP) 50%, guessing 100%, all correct Accuracy = His Gly His Asp C Sep/Asp Glu Aligned Regions Not Aligned Regions – length (FP) False Negative (FN) CONCLUSIONS 1. The prediction accuracy of our method is > 86% [Table 2 ].2. An additional analytical step correctly identified the catalytic triad of a/b hydrolases and reduced false positives to 1.06%. 3. The method can be used to identify candidate catalytic residues for proteins with known structure but unknown function. ACKNOWLEDGEMENTS: This work would not have been complete without the wise help and guidance that was provided by our colleagues at PIR: W. C. Barker, H. Huang, A. Nikolskaya, S. Vasudevan, and C.R. Vinayaka. CITATION: Petrova NV, Wu CH: Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC Bioinformatics, 7:312, 2006 Contact np6@georgetown.edu http://pir.georgetown.edu

More Related