Decision Trees

Decision Trees

Outline • What is a decision tree ? • How to construct a decision tree ? • What are the major steps in decision tree induction ? • How to select the attribute to split the node ? • What are the other issues ?

Age? 30 >40 31…40 Credit? Student? YES YES YES NO NO excellent no yes fair Classification by Decision Tree Induction • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of test • Leaf nodes represent class labels or class distribution

Training Dataset

YES YES YES NO NO Output: A Decision Tree for “buy_computer” Age? 31…40 <=30 >40 Credit? Student? excellent no yes fair

Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Attributes are categorical (if continuous-valued, they are discretized in advance) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all training examples are at the root • Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) • Examples are partitioned recursively based on selected attributes

Construction of A Decision Tree for “buy_computer” [P1,…P14] Yes: 9, No:5 ?

Construction of A Decision Tree for “buy_computer” [P1,…P14] Yes: 9, No:5 Age? 31…40 >40 <=30

YES Construction of A Decision Tree for “buy_computer” [P1,…P14] Yes: 9, No:5 Age? 31…40 >40 <=30 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 ? ?

[P9,P11] Yes: 2, No:0 [P1,P2,P8] Yes: 0, No:3 YES YES NO Construction of A Decision Tree for “buy_computer” [P1,…P14] Yes: 9, No:5 Age? 30…40 >40 <=30 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 ? Student? no yes

[P4,P5,P10] Yes: 3, No:0 [P6,P14] Yes: 0, No:2 YES YES YES NO NO Construction of A Decision Tree for “buy_computer” [P1,…P14] Yes: 9, No:5 Age? 30…40 >40 <=30 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 Credit? Student? no yes excellent fair [P9,P11] Yes: 2, No:0 [P1,P2,P8] Yes: 0, No:3

Which Attribute is the Best? • The attribute most useful for classifying examples • Information gain • An information-theoretic approach • Measure how well an attribute separates the training examples • Use the attribute with the highest information gain to split • Minimize the expected number of tests needed to classify a new tuple How pure splitting result? How useful? How well separated? Information gain

Choosing an attribute • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • Patrons? is a better choice

Information theory • If there are n equally probable possible messages, then the probability p of each is 1/n • Information conveyed by a message is -log(p) = log(n) • E.g., if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message • In general, if we are given a probability distribution P = (p1, p2, .., pn) • Then the information conveyed by the distribution (aka entropy of P) is: I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))

Information theory II • Information conveyed by distribution (a.k.a. entropy of P): I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn)) • Examples: • If P is (0.5, 0.5) then I(P) is 1 • If P is (0.67, 0.33) then I(P) is 0.92 • If P is (1, 0) then I(P) is 0 • The more uniform the probability distribution, the greater its information: More information is conveyed by a message telling you which event actually occurred • Entropy is the average number of bits/message needed to represent a stream of messages

Information for classification • If a set S of records is partitioned into disjoint exhaustive classes (C1,C2,..,Ck) on the basis of the value of the class attribute, then the information needed to identify the class of an element of S is Info(S) = I(P) where P is the probability distribution of partition (C1,C2,..,Ck): P = (|C1|/|S|, |C2|/|S|, ..., |Ck|/|S|) C1 C3 C2 C1 C3 C2 Low information High information

Information, Entropy, and Information Gain • S contains si tuples of class Ci for i = {1, ..., m} • Information measures “the amount of info” required to classify any arbitrary tuple where is the probability that an arbitrary tuple belongs to Ci • Example: S contains 100 tuples, 25 belong to class C1 and 75 belong to class C2

Information, Entropy, and Information Gain • Information reflects the “purity” of the data set • Low information value indicates high purity • High information value indicates high diversity • Example: S contains 100 tuples • 0 belongs to class C1 and 100 belong to class C2 • 50 belong to class C1 and 50 belong to class C2

Information for classification II • If we partition S w.r.t attribute X into sets {T1,T2, ..,Tn} then the information needed to identify the class of an element of S becomes the weighted average of the information needed to identify the class of an element of Ti, i.e. the weighted average of Info(Ti): Info(X,T) = S|Ti|/|S| * Info(Ti) C1 C3 C1 C3 C2 C2 Low information High information

Information gain • Consider the quantity Gain(X,S) defined as Gain(X,S) = Info(S) - Info(X,S) • This represents the difference between • information needed to identify an element of S and • information needed to identify an element of S after the value of attribute X has been obtained That is, this is the gain in information due to attribute X • We can use this to rank attributes and to build decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root • The intent of this ordering is: • To create small decision trees so that records can be identified after only a few questions • To match a hoped-for minimality of the process represented by the records being considered (Occam’s Razor)

Information, Entropy, and Information Gain • S contains si tuples of class Ci for i = {1, ..., m} • Attribute A has values {a1,a2,...,av} • Let sij be the number of tuples which • belong to class Ci, and • have a value of aj in attribute A • Entropy of attribute A is • Information gained by branching on attribute A

Information, Entropy, and Information Gain • Let Tj be the set of tuples having value aj in attribute A • s1j+…,+smj = |Tj| • I(s1j,…,smj) = I(Tj) • Entropy of attribute A is Proportion of |Tj| over |S| Information of Tj

Information, Entropy, and Information Gain 1010 S contains 100 tuples, 40 belong to class C1 (red) and 60 belong to class C2 (blue) A=a1 20 tuples I(10,10)=1 I(40,60)=0.971 A=a2 40 60 10 20 30 tuples I(10,20)=0.918 A=a3 50 tuples 20 30 I(20,30)=0.971

N French Y Y N Italian Thai N Y N Y N N Y Y Burger Empty Some Full Computing information gain • I(S) = - (.5 log .5 + .5 log .5) = .5 + .5 = 1 • I (Pat, S) = 1/6 (0) + 1/3 (0) + 1/2 (- (2/3 log 2/3 + 1/3 log 1/3)) = 1/2 (2/3*.6 + 1/3*1.6) = .47 • I (Type, S) = 1/6 (1) + 1/6 (1) + 1/3 (1) + 1/3 (1) = 1 Gain (Pat, S) = 1 - .47 = .53 Gain (Type, T) = 1 – 1 = 0

Regarding the Definition of Entropy… • On Text book Page 134 (Equ. 3.6) • On Text book Page 287 (Equ. 7.2) • Polymophism • When entropy is defined on tuples, use Equ. 3.6 • When entropy is defined on attribute, use Equ. 7.2

How well does it work? Many case studies have shown that decision trees are at least as accurate as human experts. • A study for diagnosing breast cancer had humans correctly classifying the examples 65% of the time; the decision tree classified 72% correct • British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system • Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example

Extracting Classification Rules from Trees • Represent knowledge in the form of IF-THEN rules • One rule is created for each path from root to a leaf • Each attribute-value pair along a path forms a conjunction • Leaf node holds class prediction • Rules are easier for humans to understand

Age? 30 >40 31…40 Credit? Student? YES YES YES NO NO excellent no yes fair Examples of Classification Rules • Classification rules: 1. IF age = “<=30” AND student = “no” THEN buys_computer = “no” 2. IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” 3. IF age = “31…40” THEN buys_computer = “yes” 4. IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” 5. IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

Avoid Over-fitting in Classification • Generated tree may over-fit training data • Too many branches, some may reflect anomalies due to noise or outliers • Result is in poor accuracy for unseen samples • Two approaches to avoiding over-fitting • Pre-pruning: Halt tree construction early—do not split a node if this would result in goodness measure falling below a threshold • Difficult to choose an appropriate threshold • Post-pruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from training data to decide which is “best pruned tree”

Enhancements to basic decision tree induction • Dynamic discretization for continuous-valued attributes • Dynamically define new discrete-valued attributes that partition continuous attribute value into a discrete set of intervals • Handle missing attribute values • Assign most common value of attribute • Assign probability to each of possible values • Attribute construction • Create new attributes based on existing ones that are sparsely represented • Reduce fragmentation (no. of samples at branch becomes too small to be statistically significant), repetition (attribute is repeatedly tested along a branch), and replication (duplicate subtrees)

Classification in Large Databases • Classification—a classical problem extensively studied by statisticians and machine learning researchers • Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed • Why decision tree induction in data mining? • relatively faster learning speed (than other classification methods) • convertible to simple and easy to understand classification rules • can use SQL queries for accessing databases • comparable classification accuracy with other methods

Scalable Decision Tree Induction Methods • SLIQ (EDBT’96 — Mehta et al.) • Build an index for each attribute and only class list and the current attribute list reside in memory • SPRINT (VLDB’96 — J. Shafer et al.) • constructs an attribute list data structure • PUBLIC (VLDB’98 — Rastogi & Shim) • integrates tree splitting and tree pruning: stop growing the tree earlier • RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) • separates the scalability aspects from the criteria that determine the quality of the tree • builds an AVC-list (attribute, value, class label)

Summary • What is a decision tree ? • A flow-chart-like tree: internal nodes, branches, and leaf nodes • How to construct a decision tree ? • What are the major steps in decision tree induction ? • Test attribute selection • Sample partition • How to select the attribute to split the node ? • Select the attribute with the highest information gain • Calculate the information of the node • Calculate the entropy of the attribute • Calculate the difference between the information and entropy • What are the other issues ?

Decision Trees

Decision Trees

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees