class noise and supervised learning in medical domains the effect of feature extraction l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction PowerPoint Presentation
Download Presentation
Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction

Loading in 2 Seconds...

play fullscreen
1 / 18

Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction - PowerPoint PPT Presentation


  • 191 Views
  • Uploaded on

IEEE CBMS’06: DM Track Salt Lake City, Utah, USA June 21-23, 2006. Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction . Mykola Pechenizkiy Dept. of Mathematical IT University of Jyväskylä Finland. Alexey Tsymbal

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction' - dunn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
class noise and supervised learning in medical domains the effect of feature extraction

IEEE CBMS’06: DM Track Salt Lake City, Utah, USA June 21-23, 2006

Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction

Mykola Pechenizkiy

Dept. of Mathematical IT

University of Jyväskylä Finland

Alexey Tsymbal

Department of Computer ScienceTrinity College DublinIreland

Seppo Puuronen &

Oleksandr Pechenizkiy

Dept. of CS and IS

University of Jyväskylä

Finland

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

outline
Outline
  • DM and KDD background
    • KDD as a process, DM strategy
    • Supervised Learning (SL)
  • Noise in data
    • Types and sources of noise
  • Feature Extraction approaches used:
    • Conventional Principal Component Analysis
    • Class-conditional FE: parametric and non-parametric
  • Experiments design
    • Impact of class noise on SL and the effect of FE
    • Dataset characteristics
  • Results and Conclusion

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

knowledge discovery as a process

kNN

Naïve Bayes

C4.5

PCA and LDA

Class noise is

introduced in

training datasets

Knowledge discovery as a process

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997.

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

the task of classification

Givenntraining instances

(xi, yi) wherexi are values of

attributes andy is class

Goal: given new x0,

predict classy0

The task of classification

Jclasses, n training observations, pfeatures

Training

Set

New instance

to be classified

CLASSIFICATION

Examples:- diagnosis of thyroid diseases;

- heart attack prediction, etc.

Class Membership of

the new instance

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

data may contain various types of errors
Data may contain various types of errors:
  • random or systematic
    • random errors are often referred to as noise;
    • some authors regard as noise both mislabeled examplesand outlierswhich are correctly classified but are relatively rare instances (exceptions).
  • Quality of a dataset in SL – characterized by two information parts of instances:
    • the quality of the attributes indicates how well they characterize instances for classification purposes, and
    • the quality of class labels indicates the correctness of class labels’ assignments.
  • Noise is often similarly divided into two major categories that are
    • class noise (misclassifications or mislabeling)
      • contradictory instances (instances with the same values of the attributes but different class labels, forming so-called irreducibleorBayes error) and wrongly classified (labeled) instances that are misclassifications (mislabelings).
    • attribute noise (errors introduced to attribute values):
      • erroneous attribute values, missing or so-called ‘don‘t know’ values, and incomplete or so-called ‘don’t care’ values.

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

sources of class noise
Sources of Class Noise:
  • The major factors that impact on the amount of mislabeled instances in a dataset
    • data-entry errors;
    • the errors of devices used for automatic classification;
    • the subjectivity and the inadequacyof information used to label each instance.
  • Domains in which medical experts may disagree are natural ones for subjective labeling errors:
    • if the absolute ground truth is unknown then experts must subjectively provide labels and mislabeled instances naturally appear;
    • if an observation needs to be ranked according to a disease severity;
  • If the information used to label an instance is different from the information to which the learning algorithm will have access:
    • if an expert relies on visual input rather than the numeric values of the attributes.
  • If the results of some tests (attribute values) are unknown – impossible to obtain or difficult to obtain
    • e.g. because of cost or time considerations.

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

handling class noise
Handling class noise
  • noise-toleranttechniques – try to avoid overfitting the possibly noisy training set during SL:
    • handle noise implicitly;
    • noise-handling mechanism is often embedded into either
      • search heuristics and stopping criteria used in model construction;
      • post-processing such as decision tree post-pruning; or
      • model selection mechanism based e.g. on MDL principle.
  • filtering techniques – detect and eliminate noisy instances before SL:
    • handle noise explicitly;
    • the noise-handling mechanism is often implemented as a filter that is applied before SL;
    • results in a reduced training set f the noisy instances are not corrected but deleted;
    • single-algorithm filters and ensemble filters.
  • brief review of these approaches, their proc and cons can be found in paper, we omit this discussion due to time constrains

It is often hard to distinguish noise from exceptions (outliers) without the help of an expert, especially if the noise is systematic

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

focus of this study
Focus of this study:
  • to apply Feature Extraction (FE) techniques to eliminate the effect of class noise on SL.
  • This approach fits better to the second category of noise-tolerant techniques as
    • it helps to avoid overfitting implicitly within learning techniques.
  • However, this approach has also some similarity with the filtering approach as
    • it clearly has a separate phase of dimensionality reduction which is undertaken before the SL process.
  • Brief background on FE techniques used in this study – in next few slides.

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

feature extraction
Feature Extraction
  • Feature extraction (FE) is a dimensionality reduction technique that extracts a subset of new features from the original set by means of some functional mapping keeping as much information in the data as possible (Fukunaga 1990).
  • Conventional Principal Component Analysis (PCA) is one of the most commonly used feature extraction techniques, that is based on extracting the axes on which the data shows the highest variability (Jolliffe 1986).

PCA has the following properties:

(1) it maximizes the variance of the extracted features;

(2) the extracted features are uncorrelated;

(3) it finds the best linear approximation in the mean-squares sense;

(4) it maximizes the information contained in the extracted features.

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

fe example heart disease

-0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate

-0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate

0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate

FE example “Heart Disease”

100% Variance covered 87%

60% <= classification accuracy => 67%

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

pca and lda based feature extraction
PCA- and LDA-based Feature Extraction
  • Use of class information in FE process is crucial for many datasets:
    • Class-conditional FE can result in better classification accuracy while solely variance-based FE has no effect on or deteriorates the accuracy.
  • Experimental studies with these FE techniques and basic SL techniques: Tsymbal et al., FLAIRS’02; Pechenizkiy et al., AI’05

No superior technique, but nonparametric approaches are more stables to various dataset characteristics

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

experiment design
Experiment design
  • WEKA3 environment: Data Mining Software in Java:
    • http://www.cs.waikato.ac.nz/ml/weka/
  • 10 medical datasets:
    • next slide.
  • Classification algorithms:
    • kNN, Naïve Bayes, C4.5.
  • Feature Extraction techniques:
    • PCA, PAR, NPAR – 0.85% variance threshold.
  • Artificially imputed class noise:
    • 0% - 20%; 2% step
  • Evaluation:
    • accuracy averaged over 30 test runs of Monte-Carlo cross validation for each sample;
    • 30% - test set; 70% - used for forming a train set out of which 0%-20% have artificially corrupted class label.

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

datasets characteristics
Datasets Characteristics

Further information on these datasets and the datasets themselves are available at http://www.informatics.bangor.ac.uk/~kuncheva/activities/real_data.htm.

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

classification accuracy vs imputed class noise
Classification Accuracy vs. Imputed Class Noise

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

classification error increase due to class noise
ClassificationError Increase due to Class Noise

k NN

(k Nearest Neigbour)

Naïve Bayes

C4.5

decision tree

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

typical behavior of sl with out fe
Typical behavior of SL with(-out) FE

* laryngeal1 dataset

k Nearest Neigbour

Naïve Bayes

C4.5 decision tree

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

summary and conclusions
Summary and Conclusions
  • Class noise affects SL with most of considered datasets
  • FE can significantly increase the accuracy of SL
    • producing better feature space and fighting “the curse of dimensionality”.
  • In this study we showed that applying FE for SL
    • decreases the negative effect of class noise in the data;
  • Directions of further research:
    • the comparison of FE techniques with other dimensionality reduction and instance selection techniques;
    • the comparison of FE with filter approaches for class noise elimination.

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy

contact info
Contact Info

MS Power Point slides of this and other recent talks

and full texts of selected publications are available

online at: http://www.cs.jyu.fi/~mpechen

Mykola Pechenizkiy

Department of Mathematical Information Technology,

University of Jyväskylä, FINLAND

E-mail: mpechen@cs.jyu.fi

www.cs.jyu.fi/~mpechen

THANK YOU!

“Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction”

by M. Pechenizkiy, S. Puuronen, A. Tsymbal and O. Pechenizkiy