CHEMOMETRIC METHODOLOGIES FOR THE MODELLING OF HETEROGENEOUS CHEMICALS TOXICITY: DATASET REPRESENTATIVITY AS THE ABSOLUTE ESSENTIAL Paola Gramatica 1 , Viviana Consonni 2 , Manuela Pavan 2 , Pamela Pilutti 1 and Ester Papa 1
CHEMOMETRIC METHODOLOGIES FOR THE MODELLING OF HETEROGENEOUS CHEMICALS TOXICITY: DATASET REPRESENTATIVITY AS THE ABSOLUTE ESSENTIAL
Paola Gramatica1, Viviana Consonni2, Manuela Pavan2, Pamela Pilutti1 and Ester Papa1
1QSAR and Environmental Chemistry Research Unit - INSUBRIA University (Varese - ITALY)
2Milano Chemometrics & QSAR Research Group - Milano Bicocca University (Milano– ITALY)
e.mail: firstname.lastname@example.org; web: http//dipbsf.uninsubria.it/qsar/
The BEAM EU researchproject focuses on the risk assessment of mixture toxicity. A data set of 124 heterogeneous chemicals of high concern as environmental pollutants has been studied for toxicity on Scenedesmus vacuolatus. Several chemometric techniques were applied on the experimental toxicity data with the aim of developing a “universal” QSAR able to describe and predict the toxicity of structurally heterogeneous and dissimilarly acting chemical. The chemical structure of the compounds was described with several types of theoretical molecular descriptors calculated by the software DRAGON . The Genetic Algorithm approach was used as the Variable Subset Selection method applied to OLS regression. In order to verify the predictive capability of the developed QSAR models a training set selection was performed by Experimental Design. OLS models have been developed on 76 chemicals selected as training set for the two parameters “a” (correlated with EC50 values) and “b” (steepness) of the Weibull model. Counter Propagation-Artificial Neural Networks (CP-ANN) approaches were also used to verify the utility of non-linear techniques. The used methodologies, applied to the overall dataset of 124 chemicals, showed a not-satisfactory performance in validation, demonstrating that a “universal” QSAR model is not possible when chemicals are significantly different in structure and mode of action. This highlights the essential need for data set representativity for the successful application of QSAR. Moreover QSAR models on the limited data sets on the more similar compound, in both structure and mode of action, show high predictive performance.
MATERIALS and METHODS
Multiple Linear Regression analysis and Variable Selection were performed by software MOBY-DIGS , using the Ordinary Least Squares regression(OLS) method and Genetic Algorithm-VSS. In order to verify the predictive capability of the developed QSAR models a test set selection was performed by Experimental Design procedure, by the software DOLPHIN . Tools of regression diagnostics, as residual plots and Williams plots, were used to check the quality of the best models and define their applicability regarding the chemical domain. Counter Propagation Artificial Neural Networks (CP-ANN) approaches were also used to verify the utility of non-linear techniques. For a stronger evaluation of model applicability for prediction on new chemicals, the external validation (verified by Q2ext) of all models is also recommended  and was here performed.
The molecular descriptors were calculated by the software DRAGON . A total of 1500 molecular descriptors of different kinds were used to describe compound chemical diversity. The descriptor typology is:
OD: Constitutional descriptors.
1D: Empirical, Functional groups, Properties, Atom-centred fragments descriptors.
2D: Autocorrelations, Topological, Molecular walk counts, Galvez topological charge indices, BCUT descriptors.
3D: Geometrical, Randic molecular profiles, WHIM, GETAWAY, RDF, 3D-MoRSE, Charge descriptors.
In addition, five quantum-chemical descriptors (HOMO, LUMO, (HOMO-LUMO)GAP, energies, heat of formation and ionization potential Ei,v),calculated by MOPAC (PM3 method) and log Kow experimental were always added as molecular descriptors.
The QSAR models have been developed on the EC50 values of 124 chemicals (with defined mode of action, tested experimentally for toxicity on Scenedesmus vacuolatus by theresearch group of Prof. Grimme, Bremen University, EU project: BEAM EVK1-1999-00012) and on the two parameters “a” and “b” of the Weibull model (the first parameter “a” is an expression of the location of the sigmoidal toxicity curve, tightly correlated with EC50 value, while parameter ”b” is an expression of the steepness of the toxicity curve). The chemicals in this data set are currently in common use: antifouling agent, antioxidant, bactericide, chemotherapeutic, disinfectant, fungicide, herbicide, insecticide, tool in physiological research and industrial chemical.
QSAR MODELLING ON THE OVERALL DATASET
Ordinary Least Squares regression by Genetic Algorithm Variable Selection (OLS - GA)
Regression by Counter Propagation Artificial Neural Networks (CP-ANN)
Not satisfactory model
Not satisfactory model
Unfortunately, the obtained models were found to be unsatisfactory due to their low predictive capability (even after the elimination of some outliers). A “universal” QSAR model is not possible when the chemicals are significantly different in both structure and mode of action. For this reason, we decided to
A CP-ANN approach was applied on the experimental EC50 toxicity values of a selected training set of 70 chemicals in order to develop a QSAR regression model with a non-linear technique. The 13 significant principal components of the molecular descriptors were used as predictive variables. The best model was developed by a map of 8x8 neurons and 50 learning epochs. The obtained model turned out to be unsatisfactory due to its low predictive power.
model the EC50 data for a reduced data set of 101 chemicals, including only the chemicals with the more represented modes of action: amino acid biosynthesis, DNA synthesis and function, lipid biosynthesis, photosynthetic electron transport, steroid biosynthesis and unspecific action.
QSAR MODELLING ON A MORE REPRESENTATIVE DATASET
OLS model obtained on selected training set
Regression by Computer Propagation Artificial Neural Networks (CP-ANN)
Satisfactory predictive power
The best models with good predictive power, on the 101 chemicals and on the split training set, are based on the same molecular descriptors: counting of different nitrogen groups (nCONR2-nCONN-nNHRPh), calculated LogKow (KLOGKow), a 3D descriptor of shape (PJI3) and a 3D-GETAWAY of autocorrelation (HATS3u). The regression line of the externally validated model is reported (outliers for the training and test set chemicals are highlighted).
The CP-ANN approach was applied on the experimental EC50 toxicity values of a reduced data set of 101 chemicals, which includes only the chemicals with the more represented modes of action. As predicted variables we used the four ones more frequently present in the population of OLS models. The best model was developed by a map of 8x8 neurons and 100 learning epochs.
SUBSETS OF CHEMICALS WITH THE SAME MODE OF ACTION
OLS model on steroid biosynthesis inhibitors (17 chemicals)
OLS model on photosynthetic electron transport inhibitors (49 chemicals)
OLS model on compounds with unspecific mode of action (18 chemicals)
The QSAR models obtained on reduced datasets, selected for representativity and for similarity of mode of action, are all of good quality. The predictive performances and stability have been verified by internal validation (Q2 and Q2LMO). The chemical domain of applicability of the proposed models for new chemicals must be always verified by the leverage approach, taking into account that some of these models have been developed on relatively small data sets.
All the proposed models are based on different molecular descriptors, mainly theoretical, encoding different features of the chemical structures related to the modelled end-points. The logKow parameter is selected only in models for unspecific mode of action (probably as related to the baseline toxicity) and in the global models, thus demonstrating that other molecular descriptors more related to the chemical structure are able to describe and predict the toxicity.
Financially supported The Commission of the European Union (BEAM EVK1-1999-00012 )
 Todeschini R., Consonni V. and Pavan M. DRAGON, version 2.1-2002 (WINDOWS/PC); Milano, Italy. Program for the calculation of molecular descriptors from HyperChem, Tripos, MDL file, SYBYLmolfile formats from ChemOffice and Tripos molecular design software. Free download available at: http://www.disat.unimib.it/chm
 Todeschini R. Moby Digs /Evolution, rel 2.0, Talete Milano, Italy.
 Todeschini, R. and Mauri, A. 2000. DOLPHIN-Software for Optimal Distance-Based Experimental Design. rel. 1.1 for Windows, Talete srl, Milan (Italy).
 Tropsha A., Gramatica P. and Gombar V.K. 2003. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models. Quant. Struct.-Act. Relat. 22.