1 / 64

QSAR Modeling Development and Validation: An Overview

Learn about the development, validation, and application of Quantitative Structure-Activity Relationships (QSAR) models. Understand the importance of dataset curation, cleaning, and normalization for accurate modeling. Explore the process of preparing training/test sets, selecting descriptors, and applying machine learning methods for QSAR modeling.

sweeneyj
Download Presentation

QSAR Modeling Development and Validation: An Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships SAR/QSAR/QSPR modeling Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE

  2. SAR/QSAR/QSPR models Development Validation Application

  3. Classification and Regression models Development Validation Application

  4. Development of the models • Selection and curation of experimental data • Preparation of training and test sets (optionaly) • Selection of an initial set of descriptors and their normalisation • Variables selection (optionally) • Selection of a machine-learning method Validation of models • Training/test set • Cross-validation • internal, • external Application of the Models • ModelsApplicability Domain

  5. Development the models Experimental Data: selection and cleaning Descriptors Mathematical techniques Statistical criteria

  6. Data selection: Congenericity problem • Congenericity principle is the assumption that « similar compounds give similar responses ». This was the basic requirement of QSAR. This concerns structurally homogeneous data sets. • Nowdays, experimentalists mostly produce structurally diverse (non-congeneric) data sets

  7. Data cleaning: • Similar experimental conditions • Dublicates • Structures standardization • Removal of mixtures • …..

  8. The importance of Chemical Data Curation Dataset curation is crucial for any cheminformatics analysis (QSAR modeling, clustering, similarity search, etc.). Currently, it is uncommon to describe procedures used for curation in research papers; procedures are implemented or employed differently in different groups. We wish to emphasize the need to create and popularize standardized curation strategy, applicable for any ensemble of compounds.

  9. What about these structures? (real examples)

  10. Why duplicates are unsafe for QSAR ? Duplicates are identical compounds present in a given dataset. ID = 256 ID = 879 ID = 2346 Manual identification of duplicates is practically impossible especially when the dataset is large. Activity analysis of duplicates is also highly important to identify cases where one occurrence is identified as ‘active’ and another one as ‘weak active’ or ‘inactive’. INACTIVE ACTIVE

  11. Structural standardization For a given dataset, chemical groups have to be written in a standardized way, taking into account critical properties (like pH) of the modeled system. Aromatic compounds These two different representations of the same compound will lead to different descriptors, especially with certain fingerprint or fragmental approaches. Carboxylic acids, nitro groups etc. For a given dataset, these functional groups have to be written in a consistent way to avoid different descriptor values for the same chemical group.

  12. Normalization of carboxylic, nitro groups, etc.

  13. removal of inorganics All inorganic compounds must be removed since our QSAR modeling strategy includes the calculation of molecular descriptors for organic compounds only. This is an obvious limitation of the approach. However the total fraction of inorganics in most available datasets is relatively small. • To detect inorganics, several solutions are available: • Automatic identification using in combination Jchem (ChemAxon, cxcalc program) to output the empirical formula of all compounds and simple scripts to remove compounds with no carbon; • Manual inspection of compounds possessing no carbon atom using Notepad++ tools.

  14. removal of mixtures Fragments can be removed according to the number of constitutive atoms or the molecular weight.

  15. removal of mixtures However, some cases are particularly difficult to treat. Examples from DILI - BIOWISDOM dataset: ID=172 CLEANED FORM BY CHEMAXON The two eliminated compounds could be active ! . INITIAL FORM MANUAL INSPECTION/VALIDATION IS STILL CRUCIAL ID=1700 CLEANED FORM BY CHEMAXON Ok. INITIAL FORM

  16. removal of salts Options Remove Fragments, Neutralize and Transform of Chemaxon Standardizer. have to be used simultaneously for best results.

  17. Aromatization and 2D cleaning ChemAxon Standardizer offers two ways to aromatize benzene rings, both of them based on Hűckel’s rules. “General Style” “Basic Style” Most descriptor calculation packages recognize the “basic style” only. http://www.chemaxon.com/jchem/marvin/help/sci/aromatization-doc.html

  18. Preparation of training and test sets Training set Splitting of an initial data set into training and testsets Initial data set 10 – 15 % “Prediction” calculations using the best structure - property models Test Building of structure - property models Selection of the best models according to statistical criteria

  19. Recommendations to prepare a test set • (i) experimental methods for determination of activities in the training and test sets should be similar; • (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; • (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215

  20. Descriptors • Variables selction • Normalization descriptors molecules Pattern matrix

  21. Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. • Objective selection (independent variable only) • Statistical criteria of correlations • Pairwise selection (Forward or Backward Stepwise selection) • Principal Component Analysis • Partial Least Square analysis • Genetic Algorithm • ………………. • Subjective selection • Descriptors selection based on mechanistic studies

  22. Preprocessing strategy for the derivation of models for use in structure-activity relationships (QSARs) 1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with zero (small) variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40, 1160-1168

  23. Normalisation 1 (Unit Variance scaling): Normalisation 2 (Mean Centring Scaling): Descriptors Normalisation descriptors molecules Pattern matrix

  24. Data Normalisation Norm. 2 Norm. 1 Initial descriptors

  25. Machine-Learning Methods

  26. Fitting models’ parameters Y = F(ai , Xi ) Xi - descriptors (independent variables) ai - fitted parameters The goal is to minimize Residual Sum of Squared (RSS)

  27. Multiple Linear Regression Y X Yi = a0 + a1 Xi1

  28. Multiple Linear Regression y=ax+b Residual Sum of Squared (RSS) b a

  29. Multiple Linear Regression Yi = a0 + a1 Xi1 + a2 Xi2 +…+ am Xim

  30. kNN (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Yi of its k nearest neighbors in the chemical space Descriptor 2 TRAINING SET Descriptor 1 A.Tropsha, A.Golbraikh, 2003

  31. Biological and Artificial Neuron

  32. Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer – to properties being predicted, neurons in the hidden layer – to nonlinear latent variables

  33. SVM: Support Vector Machine Support Vector Classification (SVC)

  34. SVM: Margins The margin is the minimal distance of any training point to the separating hyperplane Margin

  35. Support Vector Regression ε-Insensitive Loss Function Only the points outside the ε-tube are penalized in a linear fashion

  36. Kernel Trick In high-dimensional feature space In low-dimensional input space Any non-linear problem (classification, regression) in the original input space can be converted into linear by making non-linear mapping Φinto a feature space with higher dimension

  37. QSAR/QSPR models Development Validation Application

  38. Preparation of training and test sets Training set Splitting of an initial data set into training and testsets Initial data set 10 – 15 % “Prediction” calculations using the best structure - property models Test Building of structure - property models Selection of the best models according to statistical criteria

  39. Validation Estimation of the models predictive performance 5- Fold Cross Validation All compounds of the dataset are predicted Dataset Fold1 Fold2 Fold3 Fold4 Fold5

  40. Leave-One Out Cross-Validation N- Fold Internal Cross Validation • Cross-validation is performed AFTER variables selection on the entire dataset. • On each fold, the “test” set contains only 1 molecule

  41. Statistical parameters for Regression

  42. Fitting vs validation LogKcalc Stabilities (logK) of Sr2+L complexes in water LogKpred Fit LOO 5-CV R2= 0.886 RMSE = 0.97 R2= 0.826 RMSE = 1.20 R2= 0.682 RMSE = 1.62 LogKexp All molecules were used for the model preparation Each molecule was “predicted” in internal CV Each molecule was predicted in external CV

  43. Regression Error Characteristic (REC) REC curves are widely used to compare of the performance of different models. The gray line corresponds to average value model (AM). For a given model, the area between AM and corresponding calculated curve reflects its quality.

  44. Statisticalparameters for Classification Confusion Matrix

  45. Classification Evaluation sensitivity = true positive rate (TPR) = hit rate = recall TPR = TP / P = TP / (TP + FN) false positive rate (FPR) FPR = FP / N = FP / (FP + TN) specificity (SPC) = True Negative Rate SPC = TN / N = TN / (FP + TN) = 1 − FPR positive predictive value (PPV) = precision PPV = TP / (TP + FP) negative predictive value (NPV) NPV = TN / (TN + FN) accuracy (ACC) ACC = (TP + TN) / (P + N) balanced accuracy (BAC) BAC = (sensitivity + sensitivity )/ 2 = (TP / (TP + FN) + TN / (FP + TN)) /2

  46. Receiver Operating Characteristic (ROC) TPR Plot of the sensitivityvs(1 − specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting the fraction of true positives (TPR = true positive rate) vs the fraction of false positives (FPR = false positive rate). FPR Ideally, Area Under Curve (AUC) => 1

  47. ROC (Receiver Operating Characteristics) 100% TP FP a b c d 0 1 2 3 e f g i j 4 5 6 7 8 9 h FN TN a b c d 0 1 2 3 e f g i j 4 5 6 7 8 9 h Ideal model: AUC=1.00 TP% AUC=0.84 g 0 5 j a 2 c 8 Useless model: AUC=0.50 1 b 3 h d 6 4 f e 9 7 i 0% 100% FP%

  48. When a model is accepted ? Classification Models RegressionModels 3 classes • Determination coefficient R2 > R02 • BA > 1/q for q classes • Here, R02= 0.5

  49. “Chance correlation” problem 2,000 1 1,500 0.75 1,000 0.5 1965 1970 1975 1980 year

More Related