1 / 60

Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships. QSAR/QSPR modeling. Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE. QSAR/QSPR models. Development Validation Application. Development QSAR models.

freya
Download Presentation

Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships QSAR/QSPR modeling Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE

  2. QSAR/QSPR models Development Validation Application

  3. Development QSAR models • Selection and curation of experimental data • Preparation of training and test sets (optionaly) • Selection of an initial set of descriptors and their normalisation • Variables selection • Selection of a machine-learning method Validation of models • Training/test set • Cross-validation • internal, • external Application of the Models • ModelsApplicability Domain

  4. Development the QSAR models Experimental Data Descriptors Mathematical techniques Statistical criteria

  5. Preparation of training and test sets Training set Splitting of an initial data set into training and testsets Initial data set 10 – 15 % “Prediction” calculations using the best structure - property models Test Building of structure - property models Selection of the best models according to statistical criteria

  6. Recommendations to prepare a test set • (i) experimental methods for determination of activities in the training and test sets should be similar; • (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; • (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215

  7. Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. Rules of thumb: good “spread” , 5-6 structure points per descriptor. • Objective selection (independent variable only) • Statistical criteria of correlations • Pairwise selection (Forward or Backward Stepwise selection) • Principal Component Analysis • Partial Least Square analysis • Genetic Algorithm • ………………. • Subjective selection • Descriptors selection based on mechanistic studies

  8. Preprocessing strategy for the derivation of models for use in structure-activity relationships (QSARs) 1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with small variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40, 1160-1168

  9. Machine-Learning Methods

  10. Fitting models’ parameters Y = F(ai , Xi ) Xi - descriptors (independent variables) ai - fitted parameters The goal is to minimize Residual Sum of Squared (RSS)

  11. Multiple Linear Regression Y X Yi = a0 + a1 Xi1

  12. Multiple Linear Regression y=ax+b Residual Sum of Squared (RSS) b a

  13. Multiple Linear Regression Yi = a0 + a1 Xi1 + a2 Xi2 +…+ am Xim

  14. kNN (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Yi of its k nearest neighbors in the chemical space Descriptor 2 TRAINING SET Descriptor 1 A.Tropsha, A.Golbraikh, 2003

  15. Biological and Artificial Neuron

  16. Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer – to properties being predicted, neurons in the hidden layer – to nonlinear latent variables

  17. QSAR/QSPR models Development Validation Application

  18. actual predicted • r2 is the fraction of the total variation in the dependent variables that is explained by the regression equation. Validating the QSAR Equation • How well does the model predicts the activity of known compounds? • For a perfect model: • All data points would reside on the diagonal. • All variance existing in the original data is explained by the model.

  19. Original variance Variance around regression line Calculating r2 • Original variance = Explained variance (i.e., variance explained by the equation) + Unexplained variance (i.e., residual variance around regression line)

  20. Calculating r2 • Original variance: • Explained variance: • Improvement in predicting y from just using the mean of y • Variance around regression line:

  21. F-test • Tests the assumption that a significant portion of the original variance has been explained by the model. • In statistical terms tests that the ratio between the explained variance (ESS/k; k = number of parameters) and the original variance (RSS/N-k-1; N = number of data points) significantly differs from 0. This implies that ESS = 0, i.e., the model didn’t explain any of the variance.

  22. F-distribution • As N and k decrease, the probability of getting large r2 values purely by chance increases. • Thus, as N and k decrease, a larger F-value is required for the test to be significant. k N

  23. Calculating F Values • Calculate F according to the above equation. • Select a significance level (e.g., 0.05). • Look up the F-value from an F-distribution derived for the correct number of N and k at the selected significance level. • If the calculated F-value is larger than the listed F-value, then the regression equation is significant at this significance level. • Example: • r2 = 0.89 N = 7 k = 1 F = 40.46 • For an F-distribution with N=7, k=1, a value of 40.46 corresponds to a significance level of 0.9997 . Thus, the equation is significant at this level. The probability that the correlation is fortuitous is < 0.03%

  24. Validation of Models 5-fold external cross-validation procedure

  25. A measure of the predictive ability of the model (as opposed to the measure of fit produced by r2). • r2 always increases as more descriptors are added. • Q2 initially increases as more parameters are added but then starts to decrease indicating data over fitting. Thus Q2 is a better indicator of the model quality. Cross Validation

  26. Other Model Validation Parameters • s is the standard deviation about the regression line. This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. The smaller the value of s the better is the QSAR. • N is the number of observations and k is the number of variables. • Scrambling of y.

  27. Statistical tests for « chance correlations » • Scrambling: • to mix randomly: • Y values (Y-scrambling), or • X values (X-scrambling), or • simulteneously Y and X values (X,Y-scrambling) • Randomization: • to generat random number s: • from Yminto Ymax (Y – randomization), • from Xminto Xmax (X – randomization), • or do this simulteneously for Y and X (X, Y – randomization) Calculate statistical parameters of correlations and compare them with those obtained for the model

  28. Pro.1 Struc.1 Pro.2 Struc.2 Struc.3 Pro.3 . . . . Struc.n Pro.n Scrambling Pro.1 Struc.1 Pro.2 Struc.2 Struc.3 Pro.3 . . . . Struc.n Pro.n

  29. QSAR/QSPR models Development Validation Application

  30. Robustness of QSPR models Applicability domain of models - Descriptors type; - Descriptors selection; - Machine-learning methods; - Validation of models. Is a test compound similar to the training set compounds? Test compound QSPR Models Prediction Performance

  31. = TEST COMPOUND The new compound will be predicted by the model, only if : INSIDE THE DOMAIN OUTSIDE THE DOMAIN Will be predicted Di ≤ <Dk> Z × sk + Will not be predicted with Z, an empirical parameter (0.5 by default) Applicability domain of QSAR models Descriptor 2 TRAINING SET Descriptor 1

  32. Applicability domain of QSAR models Range –based methods • Bounding Box (BB)

  33. Should one use only one individual model or many models ? ensemble modeling

  34. Hunting season … Single hunter

  35. Hunting season … Many hunters

  36. Model 1 Model 2 Model 4 Model 3 Ensemble modelling

  37. Property (Y) predictions using best fit models Compound model 1 model 2 … mean ± s Compound 1 Y11 Y12 … <Y1> ± DY1 Compound 2 Y21 Y22 … <Y2> ± DY2 … … Compound m Ym1 Ym2 … <Ym> ± DYm Grubbs statistics is used to exclude les outliers

  38. Calculation of Descriptors C-C-C-C-C-C C-C-C-N-C-C C-N-C-C*C C-C-C-N DataSet C=O 0 10 1 5 0 0 8 1 4 0 0 4 1 2 4 Etc. ISIDA FRAGMENTOR the Pattern matrix

  39. LEARNING STAGE Building of models VALIDATION STAGE QSAR models filtering -> selection of the most predictive ones QSAR models -0.222 + 0.973 -0.066 PATTERN MATRIX PROPERTY VALUES

  40. Example : linear QSPR model Property PROPERTYcalc = -0.36 * NC-C-C-N-C-C + 0.27 * NC=O + 0.12 * NC-N-C*C + …

  41. Virtual screening with QSAR/QSPR models

  42. Database Hits Useless compounds Experimental Tests Screening and hits selection Virtual Sreening QSPR model

  43. Combinatorial Library Design

  44. Markush structure if R1, R2, R3 = and then Generation of Virtual Combinatorial Libraries

  45. R1 = Me, Et, Pr R3 = alkyl or heterocycle R2 =NH2 n = 1 – 3 The types of variation in Markush structures: • Substituent variation (R1) • Position variation (R2) • Frequency variation • Homology variation (R3) (only for patent search)

  46. IN SILICO design of new compounds

  47. - Acquisition of Data; - Acquisition of Knowledge; - Exploitation of Knowledge « In silico » design of new compounds

  48. ISIDA combinatorial module 1000 molecules/second Database 2 1 Filtering 7 Synthesis and experimental tests 3 Similarity Search ISIDA 6 4 Hits selection QSAR models Applicability domains 5 QSAR models Assessment of properties The combinatorial modulegenerates virtual libraries based on the Markush structures. Markush structure

  49. R = H, alkyl COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Binding of UO22+ by monoamides [ U ] organic phase D = [ U ] aqueous phase A. Varnek, D. Fourches, V. Solov’ev, O. Klimchuk, A. Ouadi, I. Billard J. Solv. Extr. Ion Exch., 2007, 25, N°4

  50. M2+ An- M1+ L SOLVENT EXTRACTION OF METALS

More Related