1 / 57

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules. Dr John Mitchell University of St Andrews. 1. Solubility is an important issue in drug discovery and a major source of attrition. This is expensive for the pharma industry.

moses-nixon
Download Presentation

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews

  2. 1. Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting the solubility of druglike molecules would be very valuable.

  3. How should we approach the prediction/estimation/calculation of the aqueous solubility of druglike molecules? Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.

  4. Theoretical Chemistry • Calculations and simulations based on real physics. • Calculations are either quantum mechanical or use parameters derived from quantum mechanics. • Attempt to model or simulate reality. • Usually Low Throughput.

  5. Dataset • The thermodynamically most stable polymorph was selected where possible. • All have experimental crystal structures. • All have experimental logS. • 10 have experimental ΔGsub and ΔGhydr • (circled in red).

  6. CheqSol Method Supersaturated Solution 8 Intrinsic solubility values Subsaturated Solution ● First precipitation – Kinetic Solubility (Not in Equilibrium) ● Thermodynamic Solubility through “Chasing Equilibrium”- Intrinsic Solubility (In Equilibrium) Supersaturation Factor SSF = Skin – S0 In Solution Powder Repeatability better than 0.05 log units ●We continue “Chasing equilibrium” until a specified number of crossing points have been reached ● A crossing point represents the moment when the solution switches from a saturated solution to a subsaturated solution; no change in pH, gradient zero, no re-dissolving nor precipitating…. SOLUTION IS IN EQUILIBRIUM “CheqSol” * A. Llinàs, J. C. Burley, K. J. Box, R. C. Glen and J. M. Goodman. Diclofenac solubility: independent determination of the intrinsic solubility of three crystal forms. J. Med. Chem. 2007, 50(5), 979-983

  7. Thermodynamic Cycle

  8. Thermodynamic Cycle Gas Solution Crystal

  9. Sublimation Free Energy Gas Crystal

  10. Sublimation Free Energy Gas (rigid molecule approximation) Crystal

  11. Sublimation Free Energy Gas Crystal Calculating ΔGsub is a standard procedure in crystal structure prediction

  12. Theoretical method for crystal Gsub from lattice energy & a phonon entropy term; DMACRYS using B3LYP/6-31G** multipoles and FIT repulsion-dispersion potential.

  13. Results for ΔGsub Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles.

  14. A 46 compound set has a larger error, mostly due to some large outliers. Error statistics vary with dataset.

  15. Thermodynamic Cycle Gas Solution Crystal

  16. Hydration Free Energy We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.

  17. Theoretical method for aqueous solution Ghydr from Reference Interaction Site Model with Universal Correction (3DRISM-KH/UC).

  18. Reference Interaction Site Model (RISM) • Combines features of explicit and implicit solvent models. • Solvent density is modelled, but no explicit molecular coordinates or dynamics. ~45 CPU mins per compound

  19. Reference Interaction Site Model (RISM) Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the Reference Interaction Site Model. Journal of Chemical Physics, 2010. 133(4): p. 044104-11.

  20. Results for ΔGhyd Perhaps surprisingly, error in Ghyd is smaller than in Gsub.

  21. logS from Thermodynamic Cycle Gas Solution Crystal Add the two terms to get ΔGsol and hence logS.

  22. Results for ΔGsol

  23. Conclusions: Solubility from Theory • Must calculate Gsub & Ghyd separately; • RISM is efficient & fairly accurate for Ghyd; • Experimental data for Gsub & Ghyd sparse and errors may be large; • Dataset size and composition make comparisons of methods hard; • Not yet matched accuracy of informatics.

  24. Informatics and Empirical Models • In general, informatics methods represent phenomena mathematically, but not in a physics-based way. • Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. • Do not attempt to simulate reality. • Usually High Throughput.

  25. What Error is Acceptable? • For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. • A RMSE > 1.0 logS unit is probably unacceptable. • This corresponds to an error range of 4.0 to 5.7 kJ/mol in Gsol.

  26. What Error is Acceptable? • A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; • The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?

  27. Random Forest Machine Learning Method

  28. Random Forest: Solubility Results RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 Ntrain = 658; Ntest = 300 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)

  29. Support Vector Machine

  30. SVM: Solubility Results et al., RMSE(te)=0.94 r2(te)=0.79 Ntrain = 150 + 50; Ntest = 87

  31. 100 Compound Cross-Validation Theoretical energies don’t seem to improve descriptor models.

  32. Replicating Solubility Challenge (post hoc) CDK descriptors: RF, RF,PLS, SVM RMSE(te)=1.09; 1.00; 0.89;1.08 r2(te)=0.39; 0.49; 0.58; 0.41 10; 12; 12;13/28 correct within 0.5 logS units Ntrain 94; Ntest 28

  33. Replicating Solubility Challenge (post hoc) CDK descriptors: RF, RF,PLS, SVM Although the test dataset is small, it is a standard set. Ntrain 94; Ntest 28

  34. Conclusions: Solubility from Informatics • Experimental data: errors unknown, but limit possible accuracy of models; • CheqSol - step in right direction; • Dataset size and composition hinder comparisons of methods; • Solubility Challenge – step in right direction.

  35. 2. Protein Target Prediction • Which protein does a given molecule bind to? • Virtual Screening • Multiple endpoint drugs - polypharmacology • New targets for existing drugs • Prediction of adverse drug reactions (ADR) • Computational toxicology

  36. Predicted Protein Targets Actually we are predicting closely target-related MDDR classes • Selection of 233 classes from the MDL Drug Data Report • ~90,000 molecules • 15 independent 50%/50% splits into training/test set

  37. Predicted Protein Targets Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

  38. ProteinTargetPrediction •Givenaspecificcompound,isitpossibletopredict computationally its biologicalinteractions with protein targets? •Veryimportantfor •Insilicoscreening(timeandmoneyefficient) •off-targetprediction(sideeffects) •Canbeusedforidentifyingsubstanceswithperformance- enhancingpotential Drugdiscovery:Predictingpromiscuity,AndrewL.Hopkins,Nature462,167-168(12November2009),doi:10.1038/462167a

  39. SubstancesProhibitedinSports •WADApublishesandmaintainsaprohibitedlistof banned compounds,updatedevery6months •Substancesaresplitintothreemaincategories: Substancesprohibitedatalltimes (inandoutofcompetition) S0.Non-Approvedsubstances S1.AnabolicAgents S2.Peptidehormones,Growth FactorsandRelatedSubstances S3.Beta-2Agonists S4.HormoneAntagonistsand Modulators S5.DiureticsandOtherMasking Agents Substancesprohibitedincompetition S6.Stimulants S7.Narcotics S8.Cannabinoids S9.Glucocorticosteroids Substancesprohibitedinparticular sports P1.Alcoholwithaviolationthreshold of0.10g/L.(Archery,Karateetc) P2.Beta-BlockersprohibitedIn- Competitiononly(Bridge,Curling, Darts,Wrestling,Archeryetc.)

  40. Methodology Stanozolol Database Rank 1 2 3 Class AnabolicAgents VitaminD Glucocorticoids

  41. ChEMBL-Activities •Eachcompoundhas experimentaldatafora numberoftargets •ActivitydatabasedonIC50, EC50,Ki,Kdetc. •Someactivitiesjustlabelled “inactive”or“active” •Eachcompoundcanhave morethanonerecordfora giventarget

  42. FilteringtheCheMBLFamilies •Eachofthe8,845targetshasa numberofcompoundsassigned •Notallcompoundshaveactual dataonthetargetorareactive •Wefiltered eachofthefamilies accordingto rules defining “active” and “inactive” •Therulesweredecidedby visualinspectionofdistributions •Rules •IC50 •≤50000nMactive&>50000nMinactive •Ki •<20000nMactive&≥20000nMinactive •Kd •≤10000nMactive&>10000nMinactive •EC50 •≤40000nMactive&>40000nMinactive •ED50 •≤10000nMactive&>10000nMinactive •Potency •≤10000nMactive&>10000nMinactive •Activity •≥40%active&<40%inactive •Inhibition •≥45%active&<45%inactive

  43. Example: Distributions ofKi Index Index Index

  44. RefinedFamilies • Filteredfamiliesconsist ofcompounds with significant experimentalactivities againstthe relevanttargets. • Manytargetshave distinctgroups ofligandswith differentscaffolds. • Maybebecausethereismore than onebindingsite,orbecause differentscaffoldscanfitthesame site. •Splittingsuchafamilyintosmaller groupsbasedonligandstructurewill allowustoidentifythedifferentsets ofligands. ???

  45. RefinedFamilies-PFClust WeselectedthePFClustalgorithmbecauseitisa parameterfreeclusteringalgorithmanddoesnot requireanykindofparametertuning. PFClust:Anovelparameterfreeclusteringalgorithm.MavridisL,NathN,MitchellJBO.BMCBioinformatics2013,14:213.

  46. DatabaseRefinement Original Families Refined Families •3563 •783690 •19639 •616600 Rule Filtering Database Clustering Database Compounds Compounds 5443 Families 1366460 Compounds Predictingtheproteintargetsforathleticperformance-enhancingsubstances.MavridisL,MitchellJBO.JCheminformatics 2013,5:31.

  47. DatabaseRefinement-Validation •MonteCarloCross-Validation •Thethreeversionsofthedatabase wereexamined(Original,Filteredand Refined) •10%ofeachfamilywererandomly removedandusedasqueries •Ifthetoppredictionwasthefamily thatthequerywasamemberof,aTP wouldbecounted;ifnot,aFP •AverageMatthewsCorrelation Coefficient(MCC) •Original:0.02 •Filtered:0.03 •Refined:0.66 2.58%(6.61%) 66.98%(87.25%) 3.18% (7.21%) TopHit(Topfour)

More Related