Inductive Learning from Imbalanced Data Sets. Nathalie Japkowicz, Ph.D. School of Information Technology and Engineering University of Ottawa . Inductive Learning: Definition.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Nathalie Japkowicz, Ph.D.
School of Information Technology and Engineering
University of Ottawa
Inductive Learning: Definition
Given a sequence of input/output pairs of the form <xi, yi>, where xi is a possible input, and yi is the output associated with xi:
Learn a function f such that:
[If f has only 2 possible outputs, f is called a concept and learning is called conceptlearning.]
Patient Attributes Class
Temperature Cough Sore Throat Sinus Pain
1 37 yes no no no flu
2 39 no yes yes flu
3 38.4 no no no no flu
4 36.8 no yes no no flu
5 38.5 yes no yes flu
6 39.2 no no yes flu
Inductive Learning: Example
Goal: Learn how to predict whether a new patient with
a given set of symptoms does or does not have the flu.
What domain characteristics aggravate the problem?
Class imbalances or small disjuncts?
Are all classifiers sensitive to class imbalances?
Which proposed solutions to the class imbalance problem are more appropriate?
New Approaches
Specialized Resampling: withinclass versus betweenclass imbalances
One class versus twoclass learning
Multiple Resampling
My ContributionsWhat domain characteristics aggravate the problem?
Class Imbalances or Small Disjuncts?
Are all classifiers sensitive to class imbalances?
Which proposed solutions to the class imbalance problem are more appropriate?
+  +  +  + 
1
0
I. I What domain characteristics aggravate the Problem?To answer this question, I generated artificial domains that vary along three different axes:
Imbal.
Full
balance
I.II: Clas Imbalances or Small
Disjuncts?
High Concept Complexity: c=5
Error
rate
Previous
Experiment
This
Experiment
Decision Tree (C5.0)
A5
T
F
A2 A7
< 5
T
> 5
5
F
+ A5  +
T
F
+ 
Support Vector Machines (SVMs)

+

+
+




+
+
I.III Are all classifiers sensitive to class imbalances?Specialized Resampling: withinclass versus betweenclass imbalances
One class versus twoclass learning
Multiple Resampling
Idea:
Results:
Idea:
Idea (Continued):
Oversampling Expert
Undersampl. Expert
…
…
Oversampling Classifiers
(sampled at different rates)
Undersampling Classifiers
(sampled at diff.rates)
II.III Multiple ResamplingF
Measure
In all cases, the mixture scheme is superior to Adaboost.
However, though it helps both recall and precision, it
helps recall more.
Pos
Neg
Pos
a
c
Classi
Fied as
Neg
b
d
A Summary of the Various Measures Used