130 likes | 220 Views
Decision Trees. Chapter 18 From Data to Knowledge. Concerns. Representational Bias: Hyperrectangles – does it match domain Generalization Accuracy Is the learned concept correct? Comprehensibility Medical diagnosis Efficiency of Learning Efficiency of Learned Procedure.
E N D
Decision Trees Chapter 18 From Data to Knowledge
Concerns • Representational Bias: • Hyperrectangles – does it match domain • Generalization Accuracy • Is the learned concept correct? • Comprehensibility • Medical diagnosis • Efficiency of Learning • Efficiency of Learned Procedure
Simple Example: Weather Data: • Four Features: windy, play, outlook: nominal • Temperature: numeric outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0)
Dumb DT Algorithm Build tree: ( discrete features only) If all entries below node are homogenous, stop Else pick a feature at random, create a node for feature and form subtrees for each of the values of the feature. Recurse on each subtree. Will this work?
Properties of Dumb Algorithm • Complexity • Homogeneity cost is O(DataSize) • Splitting is O(DataSize) • Times number of node in tree = bd on work • Accuracy on training set • perfect • Accuracy on test set • Not great. almost random
Many DT models • Random selection worked – • If n-binary features then: • N * 2*(N-1)*2*(N-2).. = O(2^N*N!) UGH! • Which trees are best? • Occam’s razor: small ones (testable?) • Exhaustive search impossible, so maybe Heuristic Search. But what heuristic? • Goal: replace random with heuristic selection
Heuristic DT algorithm • Entropy Set with mixed classes c1, c2,..ck • Entropy(S) = - sum pi* lg(pi) where pi is probability of class ci. • Sum weighted entropies of each subtrees, where weight is proportion of examples in the subtree. • This defines a quality measure on features.
Heuristic score of a feature • Say split on feature f yields: (4+, 4-) and ( 1+, 3-) quality of f = 8/12*E({4+,4-}+ 4/12*E({1+,3-}) = 8/13* 2 + 4/12* (- 1/4*log(1/4) -3/4*log(3/4)) • Do this for every feature! • J48 is roughly dumb + entropy heuristic
Shannon Entropy • Entropy is the only function that: • Is 0 when only 1 class present • Is k if 2^k classes, equally present • Is “additive” ie. • E(X,Y) = E(X)+E(Y) if X and Y are independent. • Entropy sometimes called uncertainty and sometimes information. • Uncertainty defined on RV where “draws” are from the set of classes.
Majority Function • Suppose 2n boolean features. • Class defined by n or more features are on. • How big is the tree? • At least 2n choose n leaves. • Prototype Function: At least k of n are true is a common medical concept. • Concepts that are prototypical do not match the representational bias of DTS.
Dts with real valued attributes • Idea: convert to solved problem • For each real valued attribute f with values v1, v2,… vn (sorted) and binary features: f1< (v1+v2)/2 f2 < (v2+v3/2) etc • Other approaches possible. • E.g. fi<any vj so no sorting needed
DTs ->Rules (Part) • For each leaf, we make a rule by collecting the tests to the leaf. • Number of rules = number of leaves • Simplification: test each condition on a rule and see if dropping it harms accuracy. • Can we go from Rules to DTs • Not easily. Hint: no root.
Summary • Comprehensible if tree is not large. • Effective if small number of features sufficient. Bias. • Does multi-class problems naturally. • Easily generates rules (expert system) • And measures of confidence (count) • Can be extended for regression. • Easy to implement and low complexity