Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction

IEEE CBMS’06: DM Track Salt Lake City, Utah, USA June 21-23, 2006 Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction Mykola Pechenizkiy Dept. of Mathematical IT University of Jyväskylä Finland Alexey Tsymbal Department of Computer ScienceTrinity College DublinIreland Seppo Puuronen & Oleksandr Pechenizkiy Dept. of CS and IS University of Jyväskylä Finland “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Outline • DM and KDD background • KDD as a process, DM strategy • Supervised Learning (SL) • Noise in data • Types and sources of noise • Feature Extraction approaches used: • Conventional Principal Component Analysis • Class-conditional FE: parametric and non-parametric • Experiments design • Impact of class noise on SL and the effect of FE • Dataset characteristics • Results and Conclusion “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

kNN Naïve Bayes C4.5 PCA and LDA Class noise is introduced in training datasets Knowledge discovery as a process Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997. “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Givenntraining instances (xi, yi) wherexi are values of attributes andy is class Goal: given new x0, predict classy0 The task of classification Jclasses, n training observations, pfeatures Training Set New instance to be classified CLASSIFICATION Examples:- diagnosis of thyroid diseases; - heart attack prediction, etc. Class Membership of the new instance “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Data may contain various types of errors: • random or systematic • random errors are often referred to as noise; • some authors regard as noise both mislabeled examplesand outlierswhich are correctly classified but are relatively rare instances (exceptions). • Quality of a dataset in SL – characterized by two information parts of instances: • the quality of the attributes indicates how well they characterize instances for classification purposes, and • the quality of class labels indicates the correctness of class labels’ assignments. • Noise is often similarly divided into two major categories that are • class noise (misclassifications or mislabeling) • contradictory instances (instances with the same values of the attributes but different class labels, forming so-called irreducibleorBayes error) and wrongly classified (labeled) instances that are misclassifications (mislabelings). • attribute noise (errors introduced to attribute values): • erroneous attribute values, missing or so-called ‘don‘t know’ values, and incomplete or so-called ‘don’t care’ values. “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Sources of Class Noise: • The major factors that impact on the amount of mislabeled instances in a dataset • data-entry errors; • the errors of devices used for automatic classification; • the subjectivity and the inadequacyof information used to label each instance. • Domains in which medical experts may disagree are natural ones for subjective labeling errors: • if the absolute ground truth is unknown then experts must subjectively provide labels and mislabeled instances naturally appear; • if an observation needs to be ranked according to a disease severity; • If the information used to label an instance is different from the information to which the learning algorithm will have access: • if an expert relies on visual input rather than the numeric values of the attributes. • If the results of some tests (attribute values) are unknown – impossible to obtain or difficult to obtain • e.g. because of cost or time considerations. “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Handling class noise • noise-toleranttechniques – try to avoid overfitting the possibly noisy training set during SL: • handle noise implicitly; • noise-handling mechanism is often embedded into either • search heuristics and stopping criteria used in model construction; • post-processing such as decision tree post-pruning; or • model selection mechanism based e.g. on MDL principle. • filtering techniques – detect and eliminate noisy instances before SL: • handle noise explicitly; • the noise-handling mechanism is often implemented as a filter that is applied before SL; • results in a reduced training set f the noisy instances are not corrected but deleted; • single-algorithm filters and ensemble filters. • brief review of these approaches, their proc and cons can be found in paper, we omit this discussion due to time constrains It is often hard to distinguish noise from exceptions (outliers) without the help of an expert, especially if the noise is systematic “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Focus of this study: • to apply Feature Extraction (FE) techniques to eliminate the effect of class noise on SL. • This approach fits better to the second category of noise-tolerant techniques as • it helps to avoid overfitting implicitly within learning techniques. • However, this approach has also some similarity with the filtering approach as • it clearly has a separate phase of dimensionality reduction which is undertaken before the SL process. • Brief background on FE techniques used in this study – in next few slides. “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Feature Extraction • Feature extraction (FE) is a dimensionality reduction technique that extracts a subset of new features from the original set by means of some functional mapping keeping as much information in the data as possible (Fukunaga 1990). • Conventional Principal Component Analysis (PCA) is one of the most commonly used feature extraction techniques, that is based on extracting the axes on which the data shows the highest variability (Jolliffe 1986). PCA has the following properties: (1) it maximizes the variance of the extracted features; (2) the extracted features are uncorrelated; (3) it finds the best linear approximation in the mean-squares sense; (4) it maximizes the information contained in the extracted features. “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

-0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate -0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate 0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate FE example “Heart Disease” 100% Variance covered 87% 60% <= classification accuracy => 67% “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

PCA- and LDA-based Feature Extraction • Use of class information in FE process is crucial for many datasets: • Class-conditional FE can result in better classification accuracy while solely variance-based FE has no effect on or deteriorates the accuracy. • Experimental studies with these FE techniques and basic SL techniques: Tsymbal et al., FLAIRS’02; Pechenizkiy et al., AI’05 No superior technique, but nonparametric approaches are more stables to various dataset characteristics “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Experiment design • WEKA3 environment: Data Mining Software in Java: • http://www.cs.waikato.ac.nz/ml/weka/ • 10 medical datasets: • next slide. • Classification algorithms: • kNN, Naïve Bayes, C4.5. • Feature Extraction techniques: • PCA, PAR, NPAR – 0.85% variance threshold. • Artificially imputed class noise: • 0% - 20%; 2% step • Evaluation: • accuracy averaged over 30 test runs of Monte-Carlo cross validation for each sample; • 30% - test set; 70% - used for forming a train set out of which 0%-20% have artificially corrupted class label. “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Datasets Characteristics Further information on these datasets and the datasets themselves are available at http://www.informatics.bangor.ac.uk/~kuncheva/activities/real_data.htm. “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Classification Accuracy vs. Imputed Class Noise “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

ClassificationError Increase due to Class Noise k NN (k Nearest Neigbour) Naïve Bayes C4.5 decision tree “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Typical behavior of SL with(-out) FE * laryngeal1 dataset k Nearest Neigbour Naïve Bayes C4.5 decision tree “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Summary and Conclusions • Class noise affects SL with most of considered datasets • FE can significantly increase the accuracy of SL • producing better feature space and fighting “the curse of dimensionality”. • In this study we showed that applying FE for SL • decreases the negative effect of class noise in the data; • Directions of further research: • the comparison of FE techniques with other dimensionality reduction and instance selection techniques; • the comparison of FE with filter approaches for class noise elimination. “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Contact Info MS Power Point slides of this and other recent talks and full texts of selected publications are available online at: http://www.cs.jyu.fi/~mpechen Mykola Pechenizkiy Department of Mathematical Information Technology, University of Jyväskylä, FINLAND E-mail: mpechen@cs.jyu.fi www.cs.jyu.fi/~mpechen THANK YOU! “Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction” by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction