1 / 28

Decision Trees

Decision Trees. Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe]. Outline. Decision Tree Representations ID3 and C4.5 learning algorithms (Quinlan 1986) CART learning algorithm (Breiman et al. 1985) Entropy, Information Gain Overfitting.

dvanwyk
Download Presentation

Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees

  2. Outline • Decision Tree Representations • ID3 and C4.5 learning algorithms (Quinlan 1986) • CART learning algorithm (Breiman et al. 1985) • Entropy, Information Gain • Overfitting Decision Trees

  3. Training Data Example: Goal is to Predict When This Player Will Play Tennis? Decision Trees

  4. Decision Trees

  5. Decision Trees

  6. \ 5 Decision Trees

  7. Decision Trees

  8. Learning Algorithm for Decision Trees Decision Trees

  9. Choosing the Best Attribute A1 and A2 are “attributes” (i.e. features or inputs). Number + and – examples before and after a split. - Many different frameworks for choosing BEST have been proposed! - We will look at Entropy Gain. Decision Trees

  10. Entropy Decision Trees

  11. Entropy is like a measure of purity… Decision Trees

  12. Entropy Decision Trees

  13. Entropy & Bits • You are watching a set of independent random sample of X • X has 4 possible values: P(X=A)=1/4, P(X=B)=1/4, P(X=C)=1/4, P(X=D)=1/4 • You get a string of symbols ACBABBCDADDC… • To transmit the data over binary link you can encode each symbol with bits (A=00, B=01, C=10, D=11) • You need 2 bits per symbol

  14. Fewer Bits – example • Now someone tells you the probabilities are not equal P(X=A)=1/2, P(X=B)=1/4, P(X=C)=1/8, P(X=D)=1/8 • Now, it is possible to find coding with expectation of 1.75 bits on the average. • >>> -1 * logBaseB(2,0.5) • 1.0 • >>> -1 * logBaseB(2,0.25) • 2.0 • >>> -1 * logBaseB(2,1/8.0) • 3.0 • >>> 1 * 0.5 + 2 * 0.25 + 3 * 1/8.0 + 3 * 1/8.0 • 1.75 • >>> • Use more bits for less probable ones. We expect those to appear less often.

  15. Reality • Of course, we can’t use partial bits, so the specific numbers are theoretical numbers only • Common encoding method: Huffman coding (from 1951 class project at MIT!) • In 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman’s professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer’s memory. • Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. “It was the most singular moment of my life,” Huffman says. “There was the absolute lightning of sudden realization.” • Huffman says he might never have tried his hand at the problem—much less solved it at the age of 25—if he had known that Fano, his professor, and Claude E. Shannon, the creator of information theory, had struggled with it. “It was my luck to be there at the right time and also not have my professor discourage me by telling me that other good people had struggled with this problem,” he says. • Optimal encoding exists such that average expected value X is between H(X) and H(X) + 1 Decision Trees

  16. A simple example • Suppose we have a message consisting of 5 symbols, e.g. [►♣♣♠☻►♣☼►☻] • How can we code this message using 0/1 so the coded message will have minimum length (for transmission or saving!) • 5 symbols  at least 3 bits • For a simple encoding, length of code is 10*3=30 bits

  17. A simple example – cont. • Intuition: Those symbols that are more frequent should have smaller codes, yet since their length is not the same, there must be a way of distinguishing each code • For Huffman code, length of encoded message will be ►♣♣♠☻►♣☼►☻ =3*2 +3*2+2*2+3+3=24bits

  18. Huffman Coding • We won’t cover the algorithm here (perhaps you covered it in a systems course?) • This was to give you an idea • Information theory comes up in (all?) many areas of CS Decision Trees

  19. Information Gain Decision Trees

  20. Training Example Decision Trees

  21. Selecting the Next Attribute Decision Trees

  22. Decision Trees

  23. Non-Boolean Features • Features with multiple discrete values • Multi-way splits • Real-valued features • Use thresholds • Regression • Segaran considers variance from the mean • Mean = sum(data)/len(data) • Return sum([(d-mean) **2 for d in data])/len(data) • Idea: A high variance means the numbers are widely dispersed; low variance means the numbers are close together. We’ll look at how this is used in his code later. Decision Trees

  24. Hypothesis Space Search • You do not get the globally optimal tree! • Search space is exponential. Decision Trees

  25. Overfitting Decision Trees

  26. Overfitting in Decision Trees Decision Trees

  27. Development Data is Used to Control Overfitting • Prune tree to reduce error on validation set • Segaran: Start at the leafs • Create a combined data set of sibling leafs • Suppose there are two, called tb and fb • Delta = (entropy(tb+fb) – (entropy(tb) + entropy(fb))/2 • If Delta < mingain (a parameter): • Merge the branches • Then, return up one level of recursion, and consider further merging branches • Note: Segaran just uses the training data Decision Trees

  28. Segaran’s Trees • Same ideas, but different structure • Each node corresponds to: • Attr_i == Value_j? Yes, No. • In lecture: examples of his trees and how they are built Decision Trees

More Related