A Study on Feature Selection for Toxicity Prediction *

A Study on Feature Selection for Toxicity Prediction* Gongde Guo1, Daniel Neagu1 and Mark Cronin2 1Department of Computing, University of Bradford 2School of Pharmacy and Chemistry, Liverpool John Moores University *EPSRC Project: PYTHIA – Predictive Toxicology Knowledge representation and Processing Tool based on a Hybrid Intelligent Systems Approach, Grant Reference:GR/T02508/01

Outline of Presentation • Predictive Toxicology • Feature Section Methods • Relief Family: Relief, ReliefF • KNNMFS Feature Selection • Evaluation Criteria • Toxicity Dataset: Phenols • Evaluation I: Toxicity • Evaluation II: Mechanism of Action • Conclusions

Predictive Toxicology • The goal of predictive toxicology is to describe the relations between the chemical structure of a molecule and biological and toxicological processes (Structure-Activity Relationship SAR) and to use these relations to predict the behaviour of new, unknown chemical compounds. • Predictive toxicology data mining comprises steps of data preparation; data reduction (includes feature selection); data modelling; prediction (classification, regression); and evaluation of results and further knowledge discovery tasks.

Feature Selection Methods • Feature selection is the process of identifying and removing as much of the irrelevant and redundant information as possible. • Seven feature selection methods (Witten et al, 2000) are involved in our study: • GR – Gain Ratio feature evaluator; • IG – Information Gain ranking filter; • Chi – Chi-squared ranking filter; • ReliefF – ReliefF Feature selection; • SVM- SVM feature evaluator; • CS – Consistency Subset evaluator; • CFS – Correlation-based Feature Selection; • But in this work, we focused on the drawbacks of the ReliefF feature selection method and proposed the kNNMFS feature selection method.

Miss Hit Relief Feature Selection Method The Relief algorithm works by randomly sampling an instance and locating its nearest neighbour from the same and opposite class. The values of the features of the nearest neighbours are compared to the sampled instance and used to update the relevance scores for each feature. K=1 Noise? M=? How to choose individual M instances?

Relief Feature Selection Method Algorithm Relief Input: for each training instance a vector of attribute values and the class value Output: the vector W of estimations of the qualities of attributes Set all weights W[Ai]=0.0, i=1,2,…,p ; for j=1 to m do begin randomly select an instance Xj; find nearest hit Hj and nearest miss Mj; for k=1 to p do begin W[Ak]=W[Ak]-diff(Ak, Xj, Hj)/m+diff(Ak, Xj, Mj)/m; end; end;

K=3 Hit Miss ReliefF Feature Selection Method Noise (X); K=? M=? How to choose M instances?

ReliefF Feature Selection Method

kNN Model-based Classification Method (Guo et al, 2003) The basic idea of kNN model-based classification method is to find a set of more meaningful representatives of the complete dataset to serve as the basis for further classification. kNNModel can generate a set of optimal representatives via inductively learning from the dataset.

An Example of kNNModel Each representative di is represented in the form of <Cls(di), Sim(di), Num(di), Rep(di)> which respectively represents the class label of di; the similarity of di to the furthest instance among the instances covered by Ni; the number of instances covered by Ni; a representation of instance di.

KNNMFS: kNN Model-based Feature Selection kNNMFS takes the output of the kNNModel as seeds for further feature selection. Given a new instance, kNNMFS finds the nearest representative for each class and then directly uses the inductive information of each representative generated by kNNModel for feature weight calculation. The k in ReliefF is varied in our algorithm. Its value depends on the number of instances covered by each nearest representative used for feature weight calculation. The M in kNNMFS is the number of representatives output from the kNNModel.

KNNMFS Feature Selection Method

Toxicity Dataset: Phenols Phenols data set was collected from TETRATOX database (Scheultz, 1997) which contained 250 compounds. A total of 173 descriptors were calculated for each compounds using different software tools, e.g., ACD/Labs, Chem-X, TSAR. These descriptors were calculated to represent the physico-chemical, structure and topological properties that were relevant to toxicity. Some features are irrelevant to or poor correlate with the class label X: CX-EMP20 Y:Toxicity X:TS_QuadXX Y:Toxicity

Evaluation Measure for Continuous Class Values Prediction

Endpoint I: Toxicity Table 1. Performance of linear regression algorithm on different phenols subsets

Endpoint II: Mechanism of Action Table 2. Performance of wkNN algorithm on different phenols subsets

Conclusion and Future Research Directions • Using a kNN model as the starter selection can choose a set of more meaningful representatives to replace the original data for feature selection; • Presenting a more reasonable ‘difference function calculation’ based on inductive information in each representative obtained by kNNModel. • Better performances are obtained on the subsets of the Phenol dataset with different endpoints by kNNMFS. • Investigating the effectives of boundary data or centre data of clusters chosen as seeds for kNNMFS • More comprehensive experiments on the benchmark data will be carried out.

References • Witten, I.H. and Frank, E.: Data Mining: Practical Machine Learning Tools with Java Implementations, Morgan Kaufmann (2000), San Francisco • Guo, G., Wang, H., Bell, D. et al.: kNN Model-based Approach in Classification. In Proc. of CoopIS/DOA/ODBASE 2003, LNCS 2888, Springer-Verlag, pp. 986-996 (2003) • Scheultz, T.W.: TETRATOX: The Tetrahymena Pyriformis Population Growth Impairment Endpoint – A Surrogate for Fish Lethality. Toxicol. Methods, 7, 289-309 (1997)

Thank you very much!

A Study on Feature Selection for Toxicity Prediction *