Informed by Informatics?

Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

Modelling in Chemistry PHYSICS-BASED ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA EMPIRICAL NON-ATOMISTIC ATOMISTIC

LOW THROUGHPUT ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA HIGH THROUGHPUT

THEORETICAL CHEMISTRY ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD NO FIRM BOUNDARIES! Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA INFORMATICS

ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA

Informatics and Empirical Models • In general, Informatics methods represent phenomena mathematically, but not in a physics-based way. • Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. • Do not attempt to simulate reality. • Usually High Throughput.

QSPR • Quantitative Structure  Property Relationship • Physical property related to more than one other variable • Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s). • Property-property relationships from 1860’s • General form (for non-linear relationships): y = f (descriptors)

QSPR Y = f (X1, X2, ... , XN ) • Optimisation of Y = f(X1, X2, ... , XN) is called regression. • Model is optimised upon N “training molecules” and then • tested upon M “test” molecules.

QSPR • Quality of the model is judged by three parameters:

QSPR • Different methods for carrying out regression: • LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc. • NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

QSPR • However, this does not guarantee a good predictive model….

QSPR • Problems with experimental error. • QSPR only as accurate as data it is trained upon. • Therefore, we are need accurate experimental data.

QSPR • Problems with “chemical space”. • “Sample” molecules must be representative of “Population”. • Prediction results will be most accurate for molecules similar • to training set. • Global or Local models?

1. Solubility • Dave Palmer • Pfizer Institute for Pharmaceutical Materials Science • http://www.msm.cam.ac.uk/pfizer D.S. Palmer, et al., J. Chem. Inf. Model., 47, 150-158 (2007) D.S. Palmer, et al., Molecular Pharmaceutics, ASAP article (2008)

Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the industry A good model for predicting the solubility of druglike molecules would be very valuable.

Datasets • Phase 1 – Literature Data • Compiled from Huuskonen dataset and AquaSol database – pharmaceutically relevant molecules • All molecules solid at room temperature • n = 1000 molecules • Phase 2 – Our Own Experimental Data • ● Measured by Toni Llinàs using CheqSol Machine • ●pharmaceutically relevant molecules • ● n = 135 molecules ●Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)

Diversity-Conserving Partitioning ●MACCS Structural Key fingerprints ●Tanimoto coefficient ●MaxMin Algorithm Full dataset n = 1000 molecules Training n = 670molecules Test n = 330 molecules

Diversity Analysis

Structures & Descriptors ●3D structures from Concord ● Minimised with MMFF94 ● MOE descriptors 2D/ 3D ● Separate analysis of 2D and 3D descriptors ● QuaSAR Contingency Module (MOE) ● 52 descriptors selected

Random Forest Machine Learning Method

Random Forest ● Introduced by Briemann and Cutler (2001) ● Development of Decision Trees (Recursive Partitioning): ● Dataset is partitioned into consecutively smaller subsets ● Each partition is based upon the value of one descriptor ● The descriptor used at each split is selected so as to optimise splitting ● Bootstrap sample of N objects chosen from the N available objects with replacement

Random Forest ● Random Forest is a collection of Decision or Regression Trees grown with the CART algorithm. ● Standard Parameters: ● 500 decision trees ● No pruning back: Minimum node size = 5 ● “mtry” descriptors (square root of number of total number) tried at each split Important features: ● Incorporates descriptor selection ● Incorporates “Out-of-bag” validation – using those molecules not in the bootstrap samples

Random Forest for Solubility Prediction A Forest of Regression Trees • Dataset is partitioned into consecutively • smaller subsets (of similar solubility) • Each partition is based upon the value of • one descriptor • The descriptor used at each split is • selected so as to minimise the MSE Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).

Random Forest for Predicting Solubility • A Forest of Regression Trees • Each tree grown until terminal nodes contain specified number of molecules • No need to prune back • High predictive accuracy • Includes method for descriptor selection • No training problems – largely immune from overfitting.

Random Forest: Solubility Results RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 RMSE(tr)=0.27 r2(tr)=0.98 Bias(tr)=0.005 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)

Drug Disc.Today, 10 (4), 289 (2005)

Can we use theoretical chemistry to calculate solubility via a thermodynamic cycle?

Gsubfrom lattice energy & an entropy term Ghydrfrom a semi-empirical solvation model (i.e., different kinds of theoretical/computational methods)

… or this alternative cycle?

Gsubfrom lattice energy (DMAREL) plus entropy Gsolvfrom SCRF using the B3LYP DFT functional Gtrfrom ClogP (i.e., different kinds of theoretical/computational methods)

● Nice idea, but didn’t quite work – errors larger than QSPR ● “Why not add a correction factor to account for the difference between the theoretical methods?” …

… ● Within a week this had become a hybrid method, essentially a QSPR with the theoretical energies as descriptors

This regression equation gave r2=0.77 and RMSE=0.71

Solubility by TD Cycle: Conclusions ● We have a hybrid part-theoretical, part-empirical method. ● An interesting idea, but relatively low throughput as a crystal structure is needed. ● Slightly more accurate than pure QSPR for a druglike set. ● Instructive to compare with literature of theoretical solubility studies.

2. Bioactivity Florian Nigsch F. Nigsch, et al., J. Chem. Inf. Model., 48, 306-318 (2008)

Feature Space - Chemical Space m = (f1,f2,…,fn) f3 f3 f2 COX2 CDK2 f1 Feature spaces of high dimensionality CDK1 f2 DHFR f1

High affinity to protein target Soluble Permeable Absorbable High bioavailability Specific rate of metabolism Renal/hepatic clearance? Volume of distribution? Low toxicity Plasma protein binding? Blood-Brain-Barrier penetration? Dosage (once/twice daily?) Synthetic accessibility Formulation (important in development) Properties of Drugs

Multiobjective Optimisation Synthetic accessibility Bioactivity Solubility Toxicity Permeability Metabolism Huge number of candidates …

Multiobjective Optimisation Synthetic accessibility Bioactivity Drug Solubility Toxicity U S E L E S S Permeability Metabolism Huge number of candidates most of which are useless!

Spam • Unsolicited (commercial) email • Approx. 90% of all email traffic is spam • Where are the legitimate messages? • Filtering

Analogy to Drug Discovery • Huge number of possible candidates • Virtual screening to help in selection process

Winnow Algorithm • Invented in late 1980s by Nick Littlestone to learn Boolean functions • Name from the verb “to winnow” • High-dimensional input data • Natural Language Processing (NLP), text classification, bioinformatics • Different varieties (regularised, Sparse Network Of Winnow - SNOW, …) • Error-driven, linear threshold, online algorithm

Winnow (“Molecular Spam Filter”) Machine Learning Methods

Training Example

Features of Molecules Based on circular fingerprints

Combinations of Features Combinations of molecular features to account for synergies.

Pairing Molecular Features Gives Bigrams

Orthogonal Sparse Bigrams • Technique used in text classification/spam filtering • Sliding window process • Sparse - not all combinations • Orthogonal - non-redundancy

Workflow

Protein Target Prediction • Which proteins does a given molecule bind to? • Virtual Screening • Multiple endpoint drugs - polypharmacology • New targets for existing drugs • Prediction of adverse drug reactions (ADR) • Computational toxicology

Informed by Informatics?

Informed by Informatics?

Presentation Transcript

Being informed vs. feeling informed

Informed Decision

informed Search

Informed Consent

Informed Consent

INFORMED CONSENT .

INFORMED CONSENT

Informed Consent

Informed Consent

Informed Choice

INFORMED CONSENT

Informed Consent

Informed Search

From Informed Consent to Informed Choice

Informed Search

Informed Consent

Informed Search

Informed Search

Informed Consent

INFORMED CONSENT

Imaging Informatics Market by 2018

Informatics