Decision Trees and Rule Induction. Kurt Driessens with slides stolen from Evgueni Smirnov and Hendrik Blockeel. Overview. Concepts, Instances, Hypothesis space Decisions trees Decision Rules. Concepts  Classes. Instances & Representation. How to represent information about instances
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Decision Trees and Rule Induction
Kurt Driessens
with slides stolen from Evgueni Smirnov and HendrikBlockeel
How to represent information about instances
head = triangle
body = round
color = blue
legs = short
holding = balloon
smiling = false
Can be symbolic or numeric
head = round
body = square
color = red
legs = long
holding = knife
smiling = true
In this course: AttributeValue
H
H
Each internal node tests an attribute
Outlook
Each branch corresponds to an attribute value
Sunny
Rainy
Overcast
Humidity
Yes
Wind
Normal
Strong
Weak
High
No
Yes
No
Yes
Each leaf assigns a classification
Fetal_Presentation
1
3
2
Previous_Csection


0
[3+, 29]
.11+ .89
[8+, 22]
.27+ .73
1
Primiparous
+
[55+, 35]
.61+ .39
…
…
A
true
false
true
B
true
false
true
false
Most (but not all) decision tree research in data mining focuses on classification trees
Basic algorithm for TDIDT: (based on ID3; later more formal)
= examples with same class, or otherwise similar, are put together
Main questions:
?
Is this drink going to
make me ill, or not?
Shape
Shape
Colour
?
Shape
Colour
orange
Nonorange
Find test for which children are as “pure” as possible
Given set S with instances belonging to class i with probability pi:
Entropy(S) =  pi log2 pi
Entropy in function of p, for 2 classes:
E = 0.940
E = 0.940
E = 0.985
E = 0.592
E = 0.811
E = 1.0
Gain(S, Humidity)
= .940  (7/14).985  (7/14).592
= 0.151
Gain(S, Wind)
= .940  (8/14).811  (6/14)1.0
= 0.048
S: [9+,5]
S: [9+,5]
Humidity
Wind
Normal
Strong
Weak
High
S: [3+,4]
S: [6+,1]
S: [6+,2]
S: [3+,3]
Note: for e.g. Boolean attributes, H is complete: each concept can be represented!
So what about inductive bias?
In this case: preference for short trees with informative attributes at the top
Phenomenon of overfitting:
Cf. fitting a curve with too many parameters
.
.
.
.
.
.
.
.
.
.
.
.
area with probably
wrong predictions

+
+
+

+

+

+

+


+












accuracy on training data
accuracy on unseen data
accuracy
overfitting starts about here
size of tree
= data not considered for choosing the best test
when accuracy goes down on validation set: stop adding nodes to this branch
After learning the tree: start pruning branches away
Constitutes a second search in the hypothesis space
accuracy
accuracy on training data
accuracy on unseen data
effect of pruning
size of tree
Outlook
Sunny
Rainy
Overcast
Humidity
Yes
Wind
Normal
Strong
Weak
High
No
Yes
No
Yes
if Outlook = Sunny and Humidity = High then No
if Outlook = Sunny and Humidity = Normal then Yes
…
Possible method:
question with many possible answers is more informative than yes/no question
Gain Ratio: GR(S,A) = Gain(S,A) / SI(S,A)
with i ranging over different results of test A
function TDIDT(E: set of examples) returns tree;
T' := grow_tree(E);
T := prune(T');
returnT;
function grow_tree(E: set of examples) returns tree;
T := generate_tests(E);
t := best_test(T, E);
P := partition induced on E by t;
ifstop_criterion(E, P)
thenreturn leaf(info(E))
else
for allEjin P: tj := grow_tree(Ej);
return node(t, {(j,tj)};
Popular systems: C4.5 (Quinlan 1993), C5.0
{1,3,4,7,8,12}
{1,3,4,7,8,12}
A1
A2
{1,4,12}
{3,7,8}
{1,3,7}
{4,8,12}
A
Another popular representation for concept definitions:
ifthenrules
IF <conditions> THEN belongs to concept
How can we learn such rules ?




+
+




+
+


+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+


+




+
+










if A and B then pos
if C and D then pos
High accuracy
Any coverage
These have been taken care of; now focus on the rest
function LearnRuleSet(Target, Attrs, Examples, Threshold):
LearnedRules :=
Rule := LearnOneRule(Target, Attrs, Examples)
while performance(Rule,Examples) > Threshold, do
LearnedRules := LearnedRules {Rule}
Examples := Examples \ {examples classified correctly by Rule}
Rule := LearnOneRule(Target, Attrs, Examples)
sort LearnedRules according to performance
return LearnedRules
To learn one rule:
function LearnOneRule(Target, Attrs, Examples):
NewRule := “IF true THEN pos”
NewRuleNeg := Neg
while NewRuleNeg not empty, do
// add a new literal to the rule
Candidates := generate candidate literals
BestLit := argmaxLCandidates performance(Specialise(NewRule,L))
NewRule := Specialise(NewRule, BestLit)
NewRuleNeg := {xNeg  x covered by NewRule}
return NewRule
function Specialise(Rule, Lit):
let Rule = “IF conditions THEN pos”
return “IF conditions and Lit THEN pos”
IF true THEN pos
IF A THEN pos
IF A & B THEN pos


+


+

+
+
+
+
+
+
+
+
+



+





IF C & D THEN pos


+


+

+
+
+
+
+
+
+
+
+



+





IF A & B THEN pos
IF C THEN pos
IF true THEN pos
Bottomup: typically more specific rules


+


+

+
+
+
+
+
+
+
+
+



+





Topdown: typically more general rules
Value
of A:
a


If A=a then pos



+
b
+
+
+
+
+
+
c
+
+





d



Looking for a good rule in the format “IF A=... THEN pos”
a





+
b
+
If A=b then pos
+
+
+
+
+
c
+
+





d



a





+
b
+
+
+
+
+
+
If A=c then pos
c
+
+





d



a





+
b
+
+
+
+
+
+
c
+
+





d
If A=d then pos



If A=b then pos





+
+
+
+
+
+
+
+
+



+





Try only rules that cover the seed “+” which has A=b.
Hence, A=b is a reasonable test, A=a is not.
We do not try all 4 alternatives in this case! Just one.
Approaches to Avoiding Overfitting
PostPruning
Incremental Reduced Error Pruning
Postpruning
D1
D1
D21
D3
D2
D22
D3
Incremental Reduced Error Pruning
Summary Points