1 / 1

Christian Kramer , Peter Gedeck Novartis Institutes for Biomedical Research, Basel, Switzerland

Leave-cluster-out crossvalidation is appropriate for scoring functions derived on diverse protein datasets. Cluster 1 out. Cluster 2 out. Cluster 3 out. Cluster 4 out. Cluster 5 out. Complete Set. Christian Kramer , Peter Gedeck

fuller
Download Presentation

Christian Kramer , Peter Gedeck Novartis Institutes for Biomedical Research, Basel, Switzerland

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Leave-cluster-out crossvalidation is appropriate for scoring functions derived on diverse protein datasets Cluster 1 out Cluster 2 out Cluster 3 out Cluster 4 out Cluster 5 out Complete Set Christian Kramer, Peter Gedeck Novartis Institutes for Biomedical Research, Basel, Switzerland Validation 1 Validation 2 Validation 3 Validation 4 Validation 5 Train 1 Train 2 Train 3 Train 4 Train 5 Performance for each cluster after leave-cluster-out crossvalidation Introduction Leave-cluster-out crossvalidation Empirical rescoring functions for predicting Protein-Ligand interaction energies can be trained based on large diverse collections of crystal structure geometries augmented with binding data, such as the PDBbind or the BindingMOAD database. In a recent publication remarkable success has been demonstrated in predicting the free energy of interaction based on atom counts in a 12 Å radius around the ligand.[1] However the quality of prediction depends strongly on the composition of training and validation set. We suggest a generally applicable validation strategy that is not prone to protein-family recognition pitfalls. The range of activities within protein families is smaller than the total range of activities. To avoid predictions that benefit from protein-family we suggest to do leave-cluster-out crossvalidation Dataset & methods Dataset PDBbind07[2] was used for reproducing RF-score results. The PDBbind09 refined set was used for demonstration of leave-cluster-out crossvalidation Descriptors The RFscore descriptors as published by Ballester and Mitchell were used for all models. For every ligand atom [C,N,O,F,P,S,Cl,Br,I] all protein atoms [C,N,O,S] within 12 Å distance are counted and summed up to give 4x9 atom pair descriptors Learning algorithm The Random Forest as implemented in R with default settings was used. Composition of the PDBbind09 database The PDBbind09 refined set consists of 1741 complexes in 561 clusters (90% BLAST similarity). The distribution of cluster population is shown below. PDBbind core set & RFscore performance Table 1: Leave-cluster-out crossvalidation results on the PDBbind09 refined set. Average R2 = 0.21, average RMSE = 1.60 The PDBbind09 cluster alphabet For the leave-cluster-out crossvalidation we suggest the following clustering scheme: All clusters with more than nine members are kept (A-W). Clusters with four to nine members are united (X), clusters with two and three members are united (Y) and all singletons are united (Z). Target specific scoring functions The PDBbind07 core set can be predicted with RMSE = 1.58, R2 = 0.59 and R = 0.77. It has been assembled from a clustering of the PDBbind07 database according to BLAST similarities. The most active, the least active and the complex closest to the average activity have been extracted from each cluster with at least 4 members. This means that for every validation set entry there is at least one entry from the same protein family in the training set. If crystal structures with corresponding activities are available, target specific scoring functions can be generated. We generated scoring functions within the clusters with standard out-of-bag crossvalidation for the four largest clusters. Cluster proximities Multidimensional scaling of the RFscore space shows that complexes from the same protein family indeed cluster. A flexible learning algorithm should well be able to recognize protein family membership Conclusion • The advent of large diverse datasets of protein-ligand complexes allows to generate scoring functions with a QSAR-type fitting procedure • Global scoring functions must be validated with protein-ligand complexes that stem from protein families that are not present in the training set. Else the validation will look overoptimistic (R2= 0.59 vs R2 = 0.21) • Target specific scoring functions can be much more predictive than global scoring functions, even when trained with the same descriptors. References [1] Ballester, P.J. & Mitchell, J.B.O. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics26, 1169-1175 (2010) [2] Cheng, T., Li, X., Li, Y., Liu, Z. & Wang, R. Comparative Assessment of Scoring Functions on a Diverse Test Set. Journal of Chemical Information and Modeling49, 1079-1093 (2009). Acknowledgments CK thanks the Novartis Education Office for a Presidential Postdoc Fellowship.

More Related