Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong

Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong 17 January 2013hjli@cse.cuhk.edu.hk http://www.cse.cuhk.edu.hk/~hjli

Protein-Ligand Complex, e.g. 1HCL • Intermolecular interactions • Van der Waals force • Hydrogen bonds • π ‒ π interactions • etc. • Binding affinity • Kd, Ki, IC50 • The lower, the better • ‒log10K • The higher, the better

PDBbind v2012 • A: experimentally determined structures as of 2012 • B: protein-ligand, DNA/RNA-ligand, protein-DNA/RNA, prot-prot complexes • C: Complexes with Kd, Ki, or IC50 • 7,121 protein-ligand complexes • D: Protein-ligand complexes with Kd or Ki • E: 90% BLAST, 67 non-redundant clusters • With highest, lowest, medium affinity

Docking’s Two Purposes • Redocking, i.e. pose identification • RMSD < 2Å • 78% success rate • Scoring • Hard to recover Kd, Ki, IC50 • R ϵ [0.2, 0.5] • R = 0.531 for Vina, R = 0.528 for idock Ref. Pred. Pred.

Existing Scoring Functions • Assume a predetermined theory-inspired functional form, e.g. Vina/idock

Previous Work • Non-parametric machine learning • Implicitly capture binding effects that are hard to model explicitly • Deng et al., 2004 • Distance-dependent interaction frequencies • Kernel Partial Least Squares • Small external test sets, 6 or 10 compounds • Aminiet al., 2007 • Support vector regression (SVR) • Family-specific scoring functions • 26 to 72 complexes, cross validation

RF-Score • The first application of Random Forests (RFs) to predicting protein–ligand binding affinity • 9 common atom types for protein and ligand • Occurrences for a particular j–iatom type pair • dcutoff=12, Z(C)=6,Z(N)=7,Z(O)=8,Z(P)=15,Z(S)=16

Visualization of Feature Vector

RF-Score • RF trains binary trees using the CART algorithm • RF grows tree without pruning from a bootstrap sample of the training data • RF selects the best split at each node of the tree from a typically small number (mtry) of randomly chosen features. mtryϵ {2,…,36} • RF stops splitting a node with <=5 samples • Prediction from an individual tree is the arithmetic mean of its samples in a leaf node • P = 500 trees, RF prediction is arithmetic mean

RF-Score • Out-of-bag (OOB) data as internal validation • Possible mtryvalues cover all the feature subset sizes, i.e. {2,…,36} • PDBbind v2007 refined set – core set, N = 1105

Prediction Accuracy • R = 0.953, RMSE = 0.74 • ROOB = 0.699 • RMSEOOB = 1.52 • R = 0.776, RMSE = 1.58 • Close to OOB

y-Scrambling Validation • To eliminate chance correlation • Random permutation of y-data • Over 10 independent trials on the test set • R = −0.018 with standard deviation SR = 0.095 • RMSE = 2.42 with SRMSE= 0.04 • Conclusion: chance correlation is negligible

Data Size Matters • RMSEOOB gets closer to RMSE as Ntrain increases

Variable Importance • x-Scrambling • %IncMSE > 20 • X6,6 • Hydrophobic interactions • X7,6, X8,6, X16,6, X6,8 • Polar–non-polar contacts • X8,8, X7,8, X7,7, X8,7 • Hydrogen bonds • Z(C)=6,Z(N)=7,Z(O)=8,Z(P)=15,Z(S)=16

Less Coverage of Atom Types • Only the 3 most common atom types {C,N,O} • 36 features  9 features • Conclusion • No large performance decrease • Test cases with {F,P,S,Cl,Br,I} contribute the diff.

Comparison w/ the State of the Art • Test set: PDBbind v2007 core set, N = 195 • Same training and test sets for top 3 scores

Conclusion • RF-Score via non-parametric machine learning • Circumvent the need for modelling assumptions • High correlation on a diverse test set • Drawback • Interpretability of features in terms of interactions • Future work • Distance-dependent features • Atom’s hybridization state & bonding environment

Non-Standard AAs & Metal Ions Ref. Vina idock

Q & A

Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong