(Q)SAR and (Q)AAR analysis of ToxCast Dataset Using PASS and GUSAR approaches. Vladimir Poroikov , Dmitry Filimonov, Alexey Zakharov, Alexey Lagunin, Sergey Novikov*.
and GUSAR approaches
Vladimir Poroikov, Dmitry Filimonov, Alexey Zakharov, Alexey Lagunin, Sergey Novikov*
Institute of Biomedical Chemistry of Rus. Acad. Med. Sci.; *A.N. Sysin Institute of Human Ecology and Environmental Health of Rus. Acad. Med. Sci., Moscow, Russia.
Analysis of Chemical Space
The aim of the study:
(1) To estimate the possibility of prediction of ToxCast Phase 1 (TC1) in vivo data on the basis of structural formulae, physical-chemical properties and in vitrodata from TC1 dataset.
(2) To estimate the possibility of prioritization of molecules from the TC1 dataset for the toxicological testing using the integral parameter.
We calculated the integral parameter ToxDose for twenty end-points of carcinogenicity for mouse and rat. ToxDose values are correlated with other carcinogenicity end-points and may be use for prioritization of molecules from the TC1 dataset for toxicological testing. The data for cholinesterase inhibition differs significantly from all other end-points; thus they were excluded from the further analysis.
The data on in vivo and in vitro assays of chemical compounds were used for (quantitative) structure-activity relationships ((Q)SAR) and (quantitative) activity-activity relationships ((Q)AAR) analysis from the ToxCast Phase 1 dataset.
The data from CPDB dataset (CPDB, 25 October 2007) was used as the training set for the carcinogenicity prediction of in vivo ToxCast assays. The data were extracted from EPA Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network . We used 1397 compounds that were tested in the standard two-year rodent carcinogenicity bioassay. Small inorganic compounds (e.g. NO2), oils, paraffins and mixtures of compounds were excluded from the set.
PASS program. (Prediction of Activity Spectra for Substances) is a computer program for evaluation of general biological potential in a molecule on the basis of its structural formulae . MNA ("Multilevel Neighbourhoods of Atoms") descriptors are used for presentation of a compound’s structure. The list of predictable biological activities contains 3750 types (PASS 2009.1 version) including main and side pharmacological effects (antihypertensive, hepatoprotective, anti-inflammatory etc.), mechanisms of action (5-hydroxytryptamine agonist, cyclooxygenase inhibitor, adenosine uptake inhibitor, etc.), specific toxicities (mutagenicity, carcinogenicity, teratogenicity, etc.) and metabolic terms (CYP1A substrate, CYP3A4 inhibitor, CYP2C9 inducer, etc.). The mean accuracy calculated by leave-one-out cross-validation procedure is 95%. PASS predictions for TC1 molecules are presented at the ToxCast web-site, and can be used as parameters characterizing these compounds in biological space.
QNA descriptors. A molecular structure is described as a set of QNA (Quantitative Neighborhoods of Atoms) descriptors . QNA descriptors are based on values of ionization potential (IP) and electron affinity (EA) of each atom in the molecule. QNA descriptors are calculated as following:
Pi = Σk Bi-½(Exp(–½C))ikBk-½,
Qi = Σk Bi-½(Exp(–½C))ikBk-½Ak,
where Ak = ½(IPk + EAk), Bk = IPk – EAk, C is a molecular connectivity matrix.
Thus, each atom of molecule is described by two values, P and Q. Since any molecule has different number of atoms, P and Q are proportional to the number of atoms in molecule, but for regression analysis it is necessary to describe the molecular structure as a vector with the fixed length. Therefore, Chebyshev polynomial’s are used for vector’s presentation of a molecular structure:
where Tn is nth degree of Chebyshev polynomial, P` and Q` are the orthonormalized representation of P and Q values (zero mean values of P` and Q`, unit variance and absence of correlation of P` and Q`). The Tn(P,Q) values are calculated for each atom of a molecule. A whole molecule is presented as an average value of Chebyshev polynomials for all atoms; therefore, the length of the vector is defined by the numbers of Chebyshev polynomials - m. On one hand the large number of Chebyshev polynomials may describe complex structure-activity relationships; on the other hand the large length of the vector that represents the structure may provide overtraining in regression analysis. Therefore, the initial value of m is determined as a half of number of molecules in the training set.
Self-Consistent Regression (SCR). Self-consistent regression can obtain the best QSAR/QSPR model for the training set with a large number of descriptors. SCR is based on least-squares regularized method. The main feature of SCR method is a removal of variables, which are worse for description of an appropriate value .
Integral parameter ToxDose. Using dosage characteristics from all 75 end-points, experimental data for which were obtained in vivo, we calculated the integral parameter - ToxDose for 283 compounds:
We compared the distribution of molecules from ToxCast Phase 1 dataset with compounds from the PASS Training set (10_MF), MDDR 2003, and RoadMap 2008. The compounds from ToxCast Phase 1 dataset contain less non-hydrogen atoms than typical drug-like molecules, and RoadMap dataset. In PASS training set average MW=416 Dalton; The average molecular weight of compounds from ToxCast Phase 1 database is 302 Dalton that smaller than those for drug-like compounds.
Results of PASS Training for Predicting Different Categories of ToxDose
Analysis of Biological Parameters
Biological activities predicted by PASS can be directly compared to TC1 in vivo data only in a few cases (carcinogenicity and cholinesterase inhibitors).
Comparison of TC1 in vivo data for the same species and between the species lead to the following conclusions:
1) Correlation coefficients between the in vivo data for same species varies from 0.75 to 0.95.
2) Correlation coefficients for the same tissues between the species less than 0.10.
Comparison of in vivo data presented as integral parameter (ToxDose) with in vitro data demonstrated that the maximal value of correlation coefficient is 0.26 (ToxDose vs. ToxCast Novascreen data). Thus, no significant correlation between in vivo and in vitro data is found.
PASS prediction of the carcinogenicity
To predict the carcinogenicity with PASS, we used compounds from CPDB as a training set. After the training procedure average accuracy of prediction (LOO CV) equals to 74%. With the trained version of PASS we predicted carcinogenicity for 306 compounds from ToxRefDB. Four compounds that have two components were excluded from the prediction. Accuracy of prediction for ToxRefDB is given below.
Here: Data1 is 301 chemicals from ToxCast in vivo dataset with 100 drug-like chemical compounds; Data2 is 301 chemicals from ToxCast in vivo dataset with 60620 chemical compounds from PASS training set.
The ninety five percent of ToxCast compounds are discriminated from drug-like molecules. For different toxicity grouping PASS accuracy of recognition varies from 75.0% to 59.1%; and the most toxic compounds are predicted better. Thus, PASS prediction could be applied for selection of priorities in testing of the most probable toxic compounds.
NA - data not available; TP - true positive; TN - true negative; FP - false positive; FN - false negative;
Sensitivity – TP/(TP+FN); Specificity – TN/(TN+FP); Accuracy – (TP+TN)/(TP+TN+FP+FN)
Accuracy of prediction varies from 0.57 (CHR_Mouse_LiverTumors and CHR_Rat_TesticularTumors) to 0.85 (CHR_Rat_ThyroidTumors). Sensetivity varies from 0.12 (CHR_Mouse_LungTumors) to 0.63 (CHR_Rat_TesticularTumors).
Specificity varies from 0.57 (CHR_Rat_TesticularTumors) to 0.93 (CHR_Mouse_LungTumors). Rat activities were predicted more accurately than mouse activities.
1) Compounds from TC1 dataset are smaller than typical drug-like structures and molecules from RoadMap set.
2) No significant correlation between the in vivo and in vitro data from TC1 set was observed.
3) Despite the chemical dissimilarity between the TC1 compounds and drug-like molecules, PASS-based prediction of carcinogenicity could be obtain with reasonable accuracy.
4) It is shown that integral parameter characterizing general toxicity ToxDose can be predicted by PASS with reasonable accuracy. Thus, such approach could be recommended for prioritization in chemicals testing.
QSAR Models for Rat’s Cholinesterase Inhibitors
We collected 45 inhibitors of Rat’s Cholinesterase (CHR_Rat_CholinesteraseInhibition) from TC1. The toxicity end-point based on the EC50 (mg/kg) values was used for building of QSAR models, using QNA descriptors and SCR. The eighteen QSAR models were created by QNA/SCR approach. Only four QSAR models have Q2 > 0.50 and R2 > 0.60. These models were additionally validated by leave-10%-out cross validation. Leave-10%-out cross validation procedure was repeated 20 times and average R2 of prediction was calculated. Results of validation are presented below.
Acknowledgements. We gratefully acknowledge Prof. Alex Tropsha for kindly assistance in presentation of the results at the ToxCast Poster Session. The work was supported in part by the FP7 project 200787 (OpenTox) and ISTC project # 3777.
Apologies. I am sorry for not obtaining the US visa in time and, therefore, inability to take part in the ToxCast Workshop on May 14-15, 2009. In case, if you will have any questions/suggestions, please, do not hesitate to contact me: [email protected]; tel: 7 499 246-0920; fax: 7 499 245-0857.
where: D is the LEL value; m is the number of end-points for a particular compound; n is the total number of tests.
Three from four QSAR models have average R2 of prediction more then 0.60. Therefore, the obtained models are robust and predictive.
2) Poroikov V, Filimonov D. 2005. PASS: Prediction of Biological Activity Spectra for Substances. In: Predictive Toxicology (Christoph Helma, eds). LLC, Boca Raton, Taylor & Francis Group, 459-478.
3) Lagunin, A.; Zakharov, A.; Filimonov, D.; Poroikov, V. A new approach to QSAR modelling of acute toxicity. SAR and QSAR in Environmental Research 2007, 18, 285-298.
4) Filimonov, D.; Akimov, D.; Poroikov, V. The Method of Self-Consistent Regression for the Quantitative Analysis of Relationships Between Structure and Properties of Chemicals. Pharm.Chem. J. 2004, 1, 21-24.