1 / 80

Informed by Informatics?

Informed by Informatics?. Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K. Modelling in Chemistry. PHYSICS-BASED. ab initio. Density Functional Theory. Car-Parrinello. Fluid Dynamics. AM1, PM3 etc.

Download Presentation

Informed by Informatics?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

  2. Modelling in Chemistry PHYSICS-BASED ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA EMPIRICAL NON-ATOMISTIC ATOMISTIC

  3. LOW THROUGHPUT ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA HIGH THROUGHPUT

  4. THEORETICAL CHEMISTRY ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD NO FIRM BOUNDARIES! Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA INFORMATICS

  5. ab initio Density Functional Theory Car-Parrinello Fluid Dynamics AM1, PM3 etc. Molecular Dynamics DPD Monte Carlo Docking 2-D QSAR/QSPR Machine Learning CoMFA

  6. Informatics and Empirical Models • In general, Informatics methods represent phenomena mathematically, but not in a physics-based way. • Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. • Do not attempt to simulate reality. • Usually High Throughput.

  7. QSPR • Quantitative Structure  Property Relationship • Physical property related to more than one other variable • Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s). • Property-property relationships from 1860’s • General form (for non-linear relationships): y = f (descriptors)

  8. QSPR Y = f (X1, X2, ... , XN ) • Optimisation of Y = f(X1, X2, ... , XN) is called regression. • Model is optimised upon N “training molecules” and then • tested upon M “test” molecules.

  9. QSPR • Quality of the model is judged by three parameters:

  10. QSPR • Different methods for carrying out regression: • LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc. • NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

  11. QSPR • However, this does not guarantee a good predictive model….

  12. QSPR • Problems with experimental error. • QSPR only as accurate as data it is trained upon. • Therefore, we are need accurate experimental data.

  13. QSPR • Problems with “chemical space”. • “Sample” molecules must be representative of “Population”. • Prediction results will be most accurate for molecules similar • to training set. • Global or Local models?

  14. 1. Solubility • Dave Palmer • Pfizer Institute for Pharmaceutical Materials Science • http://www.msm.cam.ac.uk/pfizer D.S. Palmer, et al., J. Chem. Inf. Model., 47, 150-158 (2007) D.S. Palmer, et al., Molecular Pharmaceutics, ASAP article (2008)

  15. Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the industry A good model for predicting the solubility of druglike molecules would be very valuable.

  16. Datasets • Phase 1 – Literature Data • Compiled from Huuskonen dataset and AquaSol database – pharmaceutically relevant molecules • All molecules solid at room temperature • n = 1000 molecules • Phase 2 – Our Own Experimental Data • ● Measured by Toni Llinàs using CheqSol Machine • ●pharmaceutically relevant molecules • ● n = 135 molecules ●Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)

  17. Diversity-Conserving Partitioning ●MACCS Structural Key fingerprints ●Tanimoto coefficient ●MaxMin Algorithm Full dataset n = 1000 molecules Training n = 670molecules Test n = 330 molecules

  18. Diversity Analysis

  19. Structures & Descriptors ●3D structures from Concord ● Minimised with MMFF94 ● MOE descriptors 2D/ 3D ● Separate analysis of 2D and 3D descriptors ● QuaSAR Contingency Module (MOE) ● 52 descriptors selected

  20. Random Forest Machine Learning Method

  21. Random Forest ● Introduced by Briemann and Cutler (2001) ● Development of Decision Trees (Recursive Partitioning): ● Dataset is partitioned into consecutively smaller subsets ● Each partition is based upon the value of one descriptor ● The descriptor used at each split is selected so as to optimise splitting ● Bootstrap sample of N objects chosen from the N available objects with replacement

  22. Random Forest ● Random Forest is a collection of Decision or Regression Trees grown with the CART algorithm. ● Standard Parameters: ● 500 decision trees ● No pruning back: Minimum node size = 5 ● “mtry” descriptors (square root of number of total number) tried at each split Important features: ● Incorporates descriptor selection ● Incorporates “Out-of-bag” validation – using those molecules not in the bootstrap samples

  23. Random Forest for Solubility Prediction A Forest of Regression Trees • Dataset is partitioned into consecutively • smaller subsets (of similar solubility) • Each partition is based upon the value of • one descriptor • The descriptor used at each split is • selected so as to minimise the MSE Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).

  24. Random Forest for Predicting Solubility • A Forest of Regression Trees • Each tree grown until terminal nodes contain specified number of molecules • No need to prune back • High predictive accuracy • Includes method for descriptor selection • No training problems – largely immune from overfitting.

  25. Random Forest: Solubility Results RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 RMSE(tr)=0.27 r2(tr)=0.98 Bias(tr)=0.005 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)

  26. Drug Disc.Today, 10 (4), 289 (2005)

  27. Can we use theoretical chemistry to calculate solubility via a thermodynamic cycle?

  28. Gsubfrom lattice energy & an entropy term Ghydrfrom a semi-empirical solvation model (i.e., different kinds of theoretical/computational methods)

  29. … or this alternative cycle?

  30. Gsubfrom lattice energy (DMAREL) plus entropy Gsolvfrom SCRF using the B3LYP DFT functional Gtrfrom ClogP (i.e., different kinds of theoretical/computational methods)

  31. ● Nice idea, but didn’t quite work – errors larger than QSPR ● “Why not add a correction factor to account for the difference between the theoretical methods?” …

  32. ● Within a week this had become a hybrid method, essentially a QSPR with the theoretical energies as descriptors

  33. This regression equation gave r2=0.77 and RMSE=0.71

  34. Solubility by TD Cycle: Conclusions ● We have a hybrid part-theoretical, part-empirical method. ● An interesting idea, but relatively low throughput as a crystal structure is needed. ● Slightly more accurate than pure QSPR for a druglike set. ● Instructive to compare with literature of theoretical solubility studies.

  35. 2. Bioactivity Florian Nigsch F. Nigsch, et al., J. Chem. Inf. Model., 48, 306-318 (2008)

  36. Feature Space - Chemical Space m = (f1,f2,…,fn) f3 f3 f2 COX2 CDK2 f1 Feature spaces of high dimensionality CDK1 f2 DHFR f1

  37. High affinity to protein target Soluble Permeable Absorbable High bioavailability Specific rate of metabolism Renal/hepatic clearance? Volume of distribution? Low toxicity Plasma protein binding? Blood-Brain-Barrier penetration? Dosage (once/twice daily?) Synthetic accessibility Formulation (important in development) Properties of Drugs

  38. Multiobjective Optimisation Synthetic accessibility Bioactivity Solubility Toxicity Permeability Metabolism Huge number of candidates …

  39. Multiobjective Optimisation Synthetic accessibility Bioactivity Drug Solubility Toxicity U S E L E S S Permeability Metabolism Huge number of candidates most of which are useless!

  40. Spam • Unsolicited (commercial) email • Approx. 90% of all email traffic is spam • Where are the legitimate messages? • Filtering

  41. Analogy to Drug Discovery • Huge number of possible candidates • Virtual screening to help in selection process

  42. Winnow Algorithm • Invented in late 1980s by Nick Littlestone to learn Boolean functions • Name from the verb “to winnow” • High-dimensional input data • Natural Language Processing (NLP), text classification, bioinformatics • Different varieties (regularised, Sparse Network Of Winnow - SNOW, …) • Error-driven, linear threshold, online algorithm

  43. Winnow (“Molecular Spam Filter”) Machine Learning Methods

  44. Training Example

  45. Features of Molecules Based on circular fingerprints

  46. Combinations of Features Combinations of molecular features to account for synergies.

  47. Pairing Molecular Features Gives Bigrams

  48. Orthogonal Sparse Bigrams • Technique used in text classification/spam filtering • Sliding window process • Sparse - not all combinations • Orthogonal - non-redundancy

  49. Workflow

  50. Protein Target Prediction • Which proteins does a given molecule bind to? • Virtual Screening • Multiple endpoint drugs - polypharmacology • New targets for existing drugs • Prediction of adverse drug reactions (ADR) • Computational toxicology

More Related