1 / 45

Model Development and Validation in Chemometrics

Model Development and Validation in Chemometrics. Bahram Hemmateenejad Chemistry Department, Shiraz University Shiraz, Iran E-mail: hemmatb@sums.ac.ir. Relationships between variables Regression/Correlation?. Correlation problem We have a collect of measures All of interest in their own

tova
Download Presentation

Model Development and Validation in Chemometrics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Model Development and Validation in Chemometrics Bahram Hemmateenejad Chemistry Department, Shiraz University Shiraz, Iran E-mail: hemmatb@sums.ac.ir

  2. Relationships between variablesRegression/Correlation? • Correlation problem • We have a collect of measures • All of interest in their own • We wish to see how and how strongly they are related • Regression problem • We have a collect of measures • One of measure is of especial interest • We wish to explore its relationship with the others

  3. Mathematical Model • Y= f(X) • Y: dependent Variables • X: Independent Variables • One Y- One X • One Y-Many X • Many y-Many X • Hard modeling (Fitting data to the model) • Soft Modeling (Fitting model to the data)

  4. Hard Modeling • A pre-defined model is available • y = b0 + b1x1 + b2x2 + … • y= b0 + b1x + b2x2 + • y= b0 10 b1x + b2x2 + • Our Task • Getting data (by own experiment, or reported data from previous studies • Fitting data to the model and calculating the model constants (or coefficients)

  5. Hard Modeling • Advantages • The procedure is very simple • Both the dependent and independent variables are known • Only coefficients are unknown • No feature selection is needed • Disadvantages • It is required that we have a deep insight into the chemical system • It is restricted to some simple chemical phenomena

  6. Hard Modeling • Bear-Lambert’s Law • One-component systems • A = A0 +  b c • Multi-component systems • A = A0 + i bi ci • A = A0 + i bi ci + Ax • Ax • Non-additive absorbance problem • Complicated matrix effect • can not be simply described by a simple mathematical model

  7. Soft Modeling • No prior information about the chemical model is available • We know some chemical facts about the system • Data are taken and then different reasonable models are tried to fit the data • Many models may be fitted. What is the better? • Getting deeper into the chemical facts • Better prediction • Lower modeling error

  8. Soft Modeling • Descriptive model • Describing the chemistry of the system • Choosing useful independent variables • Chemically meaningful variables • The least number of independent variable • Very high statistical qualities are not required • They must be evaluated for correct modeling • Being careful about homogeneity and heterogeneity of the data

  9. Soft Modeling • Predictive model • The ultimate goal is predicting y for feature samples • Use as many as possible predictor variables • Feature selection becomes important • Chemical meaning is not essential for predictors • Very high statistical qualities are required • Model validation is essential part of modeling • Predictive-Descriptive model • It is a high quality chemical model

  10. Modeling Proposes • Development of new algorithms and methods • New modeling method, new scoring function, Using new validation procedure,… • Simulated data or previously reported data • Comparison with existing methods • Validation of the results • Application of models to new chemical systems • The chemical system is novel • Being familiar with the system under study or reading carefully about it • Examine the results for accuracy

  11. Modeling Proposes • Comparative studies • Comparing existing algorithms for an individual chemical system • Comparing various types of independent variables for a chemical system • Application of an individual modeling method for different chemical systems

  12. Steps in Chemical Modeling • Select the modeling propose • Careful studies about the chemical or mathematical system • Select kind of Model (Predictive or Descriptive?) • Data Preparation • Plot the data • Data splitting (calibration, validation, prediction) • Model development (MLR, PCR, PLS, ANN) • Calculate model coefficient • Validate its performances • Final model validation

  13. Data Splitting • At least two sets of data are necessary • Model development step • Final model validation step • In many cases Twos sets are also used in Model development step • Calibration set to calculate the model coefficients • Validation set to test the accuracy of the calculated constants • Calibration-Validation • Calibration-Validation-Prediction

  14. Data Splitting • Selection of appropriate training and test sets is significantly important in model building • All data sets must span the same space with regard to • Diversity in dependent variable • Diversity in independent variables • Diversity in both dependent and independent variables • Training set should contain two thirds of the total data

  15. Splitting methods J.T. Leonard, K. Roy On selection of Training and test set for the devotement of predictive QSAR models. QSAR & Combinatorial Sciences, 2006, In Press

  16. Splitting methods • Random splitting • It is not a good choice • A homogeneous data from aside of total data my be classified as test set • final model performances will be highly dependent on the training/test set data • Ranking data based on value of dependent variable (y) • It may be a good choice • Diversity in dependent variable is important here • Structural similarity is not considered • It is a risk that the training set data have different chemical structures in comparison to validation/prediction data

  17. Splitting methods

  18. Splitting methods • Selection on the basis of independent variables space • Multivariate design • Principal Component analysis • Clustering methods

  19. Model Development • Preliminary considerations • Simple models are preferred • Linear or nonlinear modeling • Produce linear model if possible • First Examine MLR and then PCA-based methods • MLR is more predictive • Choose ANN as the final trial • Variable co-linearity • Feature selection/Feature extraction

  20. Model Development • Collinear variables • Degree of collinearity • R2>0.95, 0.9, 0.85 • Correlation with y • Chemical relevance • Correlation with the other variables • Noise content • Cost of computation • Calculation accuracy

  21. Model development • Select the regression method • MLR • PCR • PLS • ANN • Select the features (variables) • Stepwise • Genetic Algorithm • Chance correlation • Support vector machines • Ant Colony

  22. Model development • Calibrating the model (Calculation of model coefficients from training data) • Evaluate the resulted model • Internal validation • Cross-validation • External validation • Calculate goodness of fit • Standard error (SE) • Correlation coefficient (R2 ) • Cross-validation correlation coefficient (Q2) • Root mean square error (RMSE) • PRESS • Variance ratio (F-value)

  23. Under/Over fitting D. M. Hawkins, The Problem of Overfitting, J. Chem. Inf. Comput. Sci. 2004, 44, 1-12. • Under-fitting • Include less terms than are necessary • Uses less complicated approaches than are necessary • Over-fitting • Include more terms than are necessary • Uses more complicated approaches than are necessary

  24. Under/Over fitting

  25. Under/Over fitting • Under-fitting: Model performance is low • Low Calibration statistics • Low generalization • Low predictivity • Over-fitting • Unstable model • Inaccurate coefficient • High calibration statistics • Low prediction statistics

  26. Overfitting • Two types of over fitting • Using model that is more flexible than it need to be • Using of model that includes irrelevant components • Why overfitting is undesirable • Worse decision • Worse prediction • Wasting the time • Non-reproducible results by the others

  27. Accessing model fit • The use of calibration statistics generally leads to overfitting • Cross-validation test on calibration data • Use of separate validation set

  28. Better predictive model? The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. Tropsha et al. QSAR Com. Sci. 2003, 22, 69. The better predictive model: High q2 for training set or low root mean square error of prediction for the test set? Aputa et al. QSAR Com. Sci. 2005, 24, 385. Accessing model fit by cross-validation. Hawkins et al. J. Chem. Inf. Comput. Sci. 2003, 43, 579. Mean squared error of prediction estimates (MSEP) for principal component regression (PCR) and partial least squares regression (PLS). Mevik and Cederkvist, Journal of Chemometrics, 2004, 18, 422-429.

  29. Why CV? Model stability Model predictivity Degree of over-fitting CV Methods Leave-one-out (LOO-CV) Leave-many-out (LMO-CV) -fold CV Cross-Validation (CV)

  30. Final Model Validation • Separate prediction set • Cross-validation • Bootstrapping • Y-randomization (Chance correlation)

  31. Cross-validation or separate test set? • it is a challenging problem • However, use of final prediction set is essential • In the model development step • Heavily depends on the sample size • Always perform cross-validation • If data size allow use another separate validation set • Never use a validation data set with very small size (i.e. 3 or 4)

  32. Bootstrap Re-sampling • Another approach to cross-validation • The basic promise is that each data set should be representative of the population from which it was drawn • K groups of size n are generated by repeated random selection of n objects • Some objects can be included in many groups • Others may never be selected • The model obtained on n randomly selected objects is used to predict the target properties

  33. Y-randomization Unscrambling, Chance correlations • Some models my be obtained by chance • Especially when number of samples are small or model has high number of constants (coefficients) • Chance correlation is a widely used technique to ensure the robustness of a model • Dependent vector is randomly shuffled and a new model is developed using original predictor variables • The resulted models must have low statistical qualities both for calibration and prediction samples

  34. Goodness of fit (Scoring Function) • The mostly referred quantity but the least significant one (R or R2) Total Sum of Squares (SST) Residual sum of squares (SSR) Regression or model sum of squares (SSM) SST =(yi - )2, SSR = (yi- )2, SSM=SST-SSR R2 = SSM/SST = 1-(SSR/SST)

  35. Goodness of fit (Scoring Function) • Some aspects of using R2 • Homogeneity or diversity of data • High sample diversity, high SST and therefore high R2 even if model is not actually predictive • High data homogeneity, low SST and therefore low R2 even if model is actually predictive • Addition of a random variable will increase the SSM and therefore increases the R2 • Using of R2 leads to obtaining over-fitted model

  36. Goodness of fit (Scoring Function) Cross-validated correlation coefficient (q2 or Q2) Correlation coefficient for prediction samples (R2P) Root mean square errors (RMSE) for calibration, prediction and cross-validation RMSE = standard deviation of residuals (y- ) Prediction residual error sum of squares (PRESS) for calibration, prediction and cross-validation PRESS = sum of square of deviation Relative error of Prediction (REP) REP =[ (y- )/y]100

  37. Goodness of fit (Scoring Function) Difference among R2, RMSE and PRESS • These quantities are already correlated • R2 measure the percent of total variances in the original data that are described by the selected model • RMSE describes the reproducibility of the model in predicting y for different samples • PRESS and REP measure total model accuracy

  38. Important notes • Data splitting • Random splitting • Diversity in y-variable • Diversity in X-variables • Diversity in both y and X • Model development • Calibrate the model by training set • Validate the model either by cross-validation or separate test set • Final Model Validation • Separate validation set • Cross-validation • Bootstrapping • Y-randomization

  39. Numerical Example • 40 samples • 7 independent variables • 1 dependent variable • Finding a linear relationship between y and X

  40. Data matrix

  41. Correlation matrix

  42. Stepwise regression

  43. Data splitting • Calibration, prediction • Calibration, validation, prediction Calibration: Two thirds of total data = 26 Remaining: 14 What is the decision? Selecting a separate test set in model development

  44. Data splitting • Validation: 8 samples • Final prediction: 6 • How to split the data? • Random? • Y-sorting • PCA on X or [x y]

  45. Random splitting

More Related