Inductive Learning from Imbalanced Data Sets. Nathalie Japkowicz, Ph.D. School of Information Technology and Engineering University of Ottawa. Inductive Learning: Definition.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Inductive Learning from Imbalanced Data Sets
Nathalie Japkowicz, Ph.D.
School of Information Technology and Engineering
University of Ottawa
Inductive Learning: Definition
Given a sequence of input/output pairs of the form <xi, yi>, where xi is a possible input, and yi is the output associated with xi:
Learn a function f such that:
[If f has only 2 possible outputs, f is called a concept and learning is called conceptlearning.]
Patient Attributes Class
Temperature Cough Sore Throat Sinus Pain
1 37 yes no no no flu
2 39 no yes yes flu
3 38.4 no no no no flu
4 36.8 no yes no no flu
5 38.5 yes no yes flu
6 39.2 no no yes flu
Inductive Learning: Example
Goal: Learn how to predict whether a new patient with
a given set of symptoms does or does not have the flu.
But What is the Problem?
Several Common Approaches
Fundamental
What domain characteristics aggravate the problem?
Class imbalances or small disjuncts?
Are all classifiers sensitive to class imbalances?
Which proposed solutions to the class imbalance problem are more appropriate?
New Approaches
Specialized Resampling: withinclass versus betweenclass imbalances
One class versus twoclass learning
Multiple Resampling
Part I: Fundamentals
What domain characteristics aggravate the problem?
Class Imbalances or Small Disjuncts?
Are all classifiers sensitive to class imbalances?
Which proposed solutions to the class imbalance problem are more appropriate?
+  +  +  + 
1
0
To answer this question, I generated artificial domains that vary along three different axes:
Large
Imbal.
Full
balance
I. I What domain characteristics aggravate the Problem?
Error
rate
S= 1
Large
Imbal.
Full
balance
I.I What domain characteristics
aggravate the Problem?
Error
rate
S= 5
Large
Imbal.
Full
balance
I.II: Clas Imbalances or Small
Disjuncts?
High Concept Complexity: c=5
Error
rate
Previous
Experiment
This
Experiment
Neural Net (MLPs)
Decision Tree (C5.0)
A5
T
F
A2 A7
< 5
T
> 5
5
F
+ A5  +
T
F
+ 
Support Vector Machines (SVMs)

+

+
+




+
+
Large
Imbal.
Full
balance
Error rate
S = 1; C = 3
Large
Imbal.
Full
Bal.
Err. rate
S = 1; C = 3
I.IV Which Solution is Best?
Part II: New Approaches
Specialized Resampling: withinclass versus betweenclass imbalances
One class versus twoclass learning
Multiple Resampling
Idea:
5
II.I: Withinclass versus Betweenclass Imbalances
Symmetric
Case
Asymmetric
Case
Results:
RMLP
DMLP
Decision Tree (C5.0)
A5
T
F
A2 A7
< 5
T
> 5
5
F
+ A5  +
T
F
+ 
Error
Rate
Idea:
II.III Multiple Resampling
Idea (Continued):
Output
Oversampling Expert
Undersampl. Expert
…
…
Oversampling Classifiers
(sampled at different rates)
Undersampling Classifiers
(sampled at diff.rates)
F
Measure
In all cases, the mixture scheme is superior to Adaboost.
However, though it helps both recall and precision, it
helps recall more.
Actual Class
Pos
Neg
Pos
a
c
Classi
Fied as
Neg
b
d