Mastering Decision Trees: Algorithms, Entropy, Huffman Coding

Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees

Outline • Decision Tree Representations • ID3 and C4.5 learning algorithms (Quinlan 1986) • CART learning algorithm (Breiman et al. 1985) • Entropy, Information Gain • Overfitting Decision Trees

Training Data Example: Goal is to Predict When This Player Will Play Tennis? Decision Trees

Decision Trees

\ 5 Decision Trees

Decision Trees

Learning Algorithm for Decision Trees Decision Trees

Choosing the Best Attribute A1 and A2 are “attributes” (i.e. features or inputs). Number + and – examples before and after a split. - Many different frameworks for choosing BEST have been proposed! - We will look at Entropy Gain. Decision Trees

Entropy Decision Trees

Entropy is like a measure of purity… Decision Trees

Entropy Decision Trees

Entropy & Bits • You are watching a set of independent random sample of X • X has 4 possible values: P(X=A)=1/4, P(X=B)=1/4, P(X=C)=1/4, P(X=D)=1/4 • You get a string of symbols ACBABBCDADDC… • To transmit the data over binary link you can encode each symbol with bits (A=00, B=01, C=10, D=11) • You need 2 bits per symbol

Fewer Bits – example • Now someone tells you the probabilities are not equal P(X=A)=1/2, P(X=B)=1/4, P(X=C)=1/8, P(X=D)=1/8 • Now, it is possible to find coding with expectation of 1.75 bits on the average. • >>> -1 * logBaseB(2,0.5) • 1.0 • >>> -1 * logBaseB(2,0.25) • 2.0 • >>> -1 * logBaseB(2,1/8.0) • 3.0 • >>> 1 * 0.5 + 2 * 0.25 + 3 * 1/8.0 + 3 * 1/8.0 • 1.75 • >>> • Use more bits for less probable ones. We expect those to appear less often.

Reality • Of course, we can’t use partial bits, so the specific numbers are theoretical numbers only • Common encoding method: Huffman coding (from 1951 class project at MIT!) • In 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman’s professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer’s memory. • Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. “It was the most singular moment of my life,” Huffman says. “There was the absolute lightning of sudden realization.” • Huffman says he might never have tried his hand at the problem—much less solved it at the age of 25—if he had known that Fano, his professor, and Claude E. Shannon, the creator of information theory, had struggled with it. “It was my luck to be there at the right time and also not have my professor discourage me by telling me that other good people had struggled with this problem,” he says. • Optimal encoding exists such that average expected value X is between H(X) and H(X) + 1 Decision Trees

A simple example • Suppose we have a message consisting of 5 symbols, e.g. [►♣♣♠☻►♣☼►☻] • How can we code this message using 0/1 so the coded message will have minimum length (for transmission or saving!) • 5 symbols  at least 3 bits • For a simple encoding, length of code is 10*3=30 bits

A simple example – cont. • Intuition: Those symbols that are more frequent should have smaller codes, yet since their length is not the same, there must be a way of distinguishing each code • For Huffman code, length of encoded message will be ►♣♣♠☻►♣☼►☻ =3*2 +3*2+2*2+3+3=24bits

Huffman Coding • We won’t cover the algorithm here (perhaps you covered it in a systems course?) • This was to give you an idea • Information theory comes up in (all?) many areas of CS Decision Trees

Information Gain Decision Trees

Training Example Decision Trees

Selecting the Next Attribute Decision Trees

Decision Trees

Non-Boolean Features • Features with multiple discrete values • Multi-way splits • Real-valued features • Use thresholds • Regression • Segaran considers variance from the mean • Mean = sum(data)/len(data) • Return sum([(d-mean) **2 for d in data])/len(data) • Idea: A high variance means the numbers are widely dispersed; low variance means the numbers are close together. We’ll look at how this is used in his code later. Decision Trees

Hypothesis Space Search • You do not get the globally optimal tree! • Search space is exponential. Decision Trees

Overfitting Decision Trees

Overfitting in Decision Trees Decision Trees

Development Data is Used to Control Overfitting • Prune tree to reduce error on validation set • Segaran: Start at the leafs • Create a combined data set of sibling leafs • Suppose there are two, called tb and fb • Delta = (entropy(tb+fb) – (entropy(tb) + entropy(fb))/2 • If Delta < mingain (a parameter): • Merge the branches • Then, return up one level of recursion, and consider further merging branches • Note: Segaran just uses the training data Decision Trees

Segaran’s Trees • Same ideas, but different structure • Each node corresponds to: • Attr_i == Value_j? Yes, No. • In lecture: examples of his trees and how they are built Decision Trees

Mastering Decision Trees: Algorithms, Entropy, Huffman Coding

Mastering Decision Trees: Algorithms, Entropy, Huffman Coding

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision trees

Decision Trees

Decision Trees