1 / 52

Decision Tree

Decision Tree. Saed Sayad. Decision Tree (Mitchell 97).

mahola
Download Presentation

Decision Tree

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Tree Saed Sayad

  2. Decision Tree(Mitchell 97) • Decision tree induction is a simple but powerful learning paradigm. In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed. At the end of the learning process, a decision tree covering the training set is returned.

  3. Dataset

  4. Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Decision Tree Attribute Node Value Node Leaf Node

  5. Frequency Tables No 5 / 14 = 0.36 Sort Yes 9 / 14 = 0.64

  6. Play Frequency Tables … Outlook | No Yes -------------------------------------------- Sunny | 3 2 -------------------------------------------- Overcast | 0 4 -------------------------------------------- Rainy | 2 3 Temp | No Yes -------------------------------------------- Hot | 2 2 -------------------------------------------- Mild | 2 4 -------------------------------------------- Cool | 1 3 Humidity | No Yes -------------------------------------------- High | 4 3 -------------------------------------------- Normal | 1 6 Windy | No Yes -------------------------------------------- False | 2 6 -------------------------------------------- True | 3 3

  7. Entropy Entropy(S) = - p log2 p - q log2 q • Entropy measures the impurity of S • S is a set of examples • p is the proportion of positive examples • q is the proportion of negative examples

  8. Entropy: One Variable Play Yes No 9 / 14 = 0.64 5 / 14 = 0.36 • Entropy(Play) = -p log2 p - q log2 q • = - (0.64 * log2 0.64) - (0.36 * log2 0.36) • = 0.94

  9. Entropy: One Variable • Example: • Entropy(5,3,2) = - (0.5 * log2 0.5) - (0.3 * log2 0.3) - (0.2 * log2 0.2) • = 1.49

  10. Entropy: Two Variables Play Outlook | No Yes -------------------------------------------- Sunny | 3 2 |5 -------------------------------------------- Overcast | 0 4 |4 -------------------------------------------- Rainy | 2 3 |5 -------------------------------------------- | 14 Size of the subset Size of the set E (Play,Outlook) = (5/14)*0.971 + (4/14)*0.0 + (5/14)*0.971 = 0.693

  11. Information Gain Gain(S, A) = E(S) – E(S, A) Example: Gain(Play,Outlook) = 0.940 – 0.693 = 0.247

  12. Selecting The Root Node Play=[9+,5-] E=0.940 Outlook Sunny Overcast Rain [2+, 3-] [4+, 0-] [3+, 2-] E=0.971 E=0.971 E=0.0 Gain(Play,Outlook) = 0.940 – ((5/14)*0.971 + (4/14)*0.0 + (5/14)*0.971) = 0.247

  13. Selecting The Root Node … Play=[9+,5-] E=0.940 Temp Hot Mild Cool [3+, 1-] [2+, 2-] [4+, 2-] E=0.811 E=1.0 E=0.918 Gain(Play,Temp) = 0.940 – ((4/14)*1.0 + (6/14)*0.918 + (4/14)*0.811) = 0.029

  14. Selecting The Root Node … Play=[9+,5-] E=0.940 Humidity High Normal [3+, 4-] [6+, 1-] E=0.592 E=0.985 Gain(Play,Humidity) = 0.940 – ((7/14)*0.985 + (7/14)*0.592) = 0.152

  15. Selecting The Root Node … Play=[9+,5-] E=0.940 Windy Weak Strong [6+, 2-] [3+, 3-] E=0.811 E=1.0 Gain(Play,Wind) = 0.940 – ((8/14)*0.811 + (6/14)*1.0) = 0.048

  16. Selecting The Root Node … Play Outlook Gain=0.247 Temperature Gain=0.029 Humidity Gain=0.152 Windy Gain=0.048

  17. Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Decision Tree - Classification Attribute Node Value Node Leaf Node

  18. ID3 Algorithm • A  the “best” decision attribute for next node • Assign A as decision attribute for node • 3. For each value of A create new descendant • Sort training examples to leaf node according to the attribute value of the branch • If all training examples are perfectly classified (same value of target attribute) stop, else iterate over newleaf nodes.

  19. Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Converting Tree to Rules R1: IF (Outlook=Sunny) AND (Humidity=High) THEN Play=No R2: IF (Outlook=Sunny) AND (Humidity=Normal) THEN Play=Yes R3: IF (Outlook=Overcast) THEN Play=Yes R4: IF (Outlook=Rain) AND (Wind=Strong) THEN Play=No R5: IF (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes

  20. Decision Tree - Regression

  21. Dataset Numeric

  22. Outlook Sunny Overcast Rain Humidity 50 Wind High Normal Strong Weak 30 45 25 55 Decision Tree - Regression Attribute Node Value Node Leaf Node

  23. Standard Deviation and Mean SD (Players) = 9.32 Mean (Players) = 39.79

  24. Players Standard Deviation and Mean Outlook | SD Mean -------------------------------------------- Sunny | 7.78 35.20 -------------------------------------------- Overcast | 3.49 46.25 -------------------------------------------- Rainy | 10.87 39.2 Temp | SD Mean -------------------------------------------- Hot | 8.95 36.25 -------------------------------------------- Mild | 7.65 42.67 -------------------------------------------- Cool | 10.51 39.00 Humidity | SD Mean -------------------------------------------- High | 9.36 37.57 -------------------------------------------- Normal | 8.73 42.00 Windy | SD Mean -------------------------------------------- False | 7.87 41.36 -------------------------------------------- True | 10.59 37.67

  25. Standard Deviation versus Entropy Decision Tree Classification Regression

  26. Information GainversusStandard Error Reduction Decision Tree Classification Regression

  27. Selecting The Root Node Play=[14] SD=9.32 Outlook Sunny Overcast Rain [5] [5] [4] SD=10.87 SD=7.78 SD=3.49 SDR(Play,Outlook) = 9.32 - ((5/14)*7.78 + (4/14)*3.49 + (5/14)*10.87) = 1.662

  28. Selecting The Root Node … Play=[14] SD=9.32 Temp Hot Mild Cool [4] [4] [6] SD=10.51 SD=8.95 SD=7.65 SDR(Play,Temp) =9.32 - ((4/14)*8.95 + (6/14)*7.65 + (4/14)*10.51) =0.481

  29. Selecting The Root Node … Play=[14] SD= 9.32 Humidity High Normal [7] [7] SD=8.73 SD=9.36 SDR(Play,Humidity) =9.32 - ((7/14)*9.36 + (7/14)*8.73) =0.275

  30. Selecting The Root Node … Play=[14] SD= 9.32 Windy Weak Strong [8] [6] SD=7.87 SD=10.59 SDR(Play,Humidity) =9.32 - ((8/14)*7.87 + (6/14)*10.59) =0.284

  31. Selecting The Root Node … Players Outlook SDR=1.662 Temperature SDR=0.481 Humidity SDR=0.275 Windy SDR=0.284

  32. Outlook Sunny Overcast Rain Humidity 50 Wind High Normal Strong Weak 30 45 25 55 Decision Tree - Regression Attribute Node Value Node Leaf Node

  33. Decision Tree - Issues • Working with Continuous Attributes – Discretization • Overfitting and Pruning • Super Attributes; attributes with many values • Working with Missing Values • Attributes with Different Costs

  34. Discretization • Equally probable intervals This strategy creates a set of N intervals with the same number of elements. • Equal width intervals The original range of values is divided into N intervals with the same range. • Entropy based For each numeric attribute, instances are sorted and, for each possible threshold, a binary <, >= test is considered and evaluated in exactly the same way that a categorical attribute would be.

  35. Avoid Overfitting Overfitting when our learning algorithm continues develop hypotheses that reduce training set error at the cost of an increased test set error. • Stop growing when data split not statistically significant (Chi2 test) • Grow full tree then post-prune • Minimum description length (MDL): • Minimize: • size(tree) + size(misclassifications(tree))

  36. Post-pruning • First, build full tree • Then, prune it • Fully-grown tree shows all attribute interactions • Problem: some subtrees might be due to chance effects • Two pruning operations: • Subtree replacement • Subtree raising • Possible strategies: • error estimation • significance testing • MDL principle witten & eibe

  37. Subtree replacement • Bottom-up • Consider replacing a tree only after considering all its subtrees witten & eibe

  38. Subtree Replacement witten & eibe

  39. Error Estimation • Transformed value for f :(i.e. subtract the mean and divide by the standard deviation) • Resulting equation: • Solving for p: witten & eibe

  40. Error Estimation • Error estimate for subtree is weighted sum of error estimates for all its leaves • Error estimate for a node (upper bound): • If c = 25% then z = 0.69 (from normal distribution) • f is the error on the training data • N is the number of instances covered by the leaf witten & eibe

  41. f = 5/14 e = 0.46e < 0.51so prune! f=0.33 e=0.47 f=0.5 e=0.72 f=0.33 e=0.47 Combined using ratios 6:2:6 gives 0.51 witten & eibe

  42. Super Attributes • The information gain equation, G(S,A) is biased toward attributes that have a large number of values over attributes that have a smaller number of values. • Theses ‘Super Attributes’ will easily be selected as the root, result in a broad tree that classifies perfectly but performs poorly on unseen instances. • We can penalize attributes with large numbers of values by using an alternative method for attribute selection, referred to as GainRatio. GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)

  43. Super Attributes … Outlook | No Yes -------------------------------------------- Sunny | 3 2 |5 -------------------------------------------- Overcast | 0 4 |4 -------------------------------------------- Rainy | 2 3 |5 -------------------------------------------- | 14 Split (Play,Outlook)= - (5/14*log2(5/14)+4/14*log2(4/15)+5/14*log2(5/14)) = 1.577 Gain (Play,Outlook) = 0.247 Gain Ratio (Play,Outlook) = 0.247/1.577 = 0.156

  44. Missing Values • Most common value • Most common value at node K • Mean or Median • Nearest Neighbor • …

  45. Attributes with Different Costs • Sometimes the best attribute for splitting the training elements is very costly. In order to make the overall decision process more cost effective we may wish to penalize the information gain of an attribute by its cost.

  46. Decision Trees: • are simple, quick and robust • are non-parametric • can handle complex datasets • can use any combination of categorical and continuous variables and missing values • are not incremental and adaptive • sometimes are not easy to be read • …

  47. Thank You!

  48. Outlook Humidity Sunny Overcast Rainy High Normal Yes Yes No No No Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes No No No No Yes Yes Yes Yes Yes Yes No Windy Temperature False True Yes Yes Yes Yes Yes Yes No No Yes Yes Yes No No No Hot Mild Cool Yes Yes No No Yes Yes Yes Yes No No Yes Yes Yes No

  49. Outlook Sunny Overcast Rainy 25 30 35 38 48 46 43 52 44 45 52 23 46 30 SD=3.49 SD=7.78 SD=10.87 Windy False True 25 46 45 52 35 38 46 44 30 23 43 48 52 30 SD=10.59 Standard Deviation SD=7.87 Humidity Temperature High Normal 25 30 46 45 35 52 30 52 23 43 38 46 48 44 Hot Mild Cool 25 30 46 44 45 35 46 48 52 30 52 23 43 38 SD=9.36 SD=8.73 SD=8.95 SD=10.51 SD=7.65

  50. Overfitting (Mitchell 97) Consider error of hypothesis h over • Training data: errortrain(h) • Entire distribution D of data: errorD(h) • Hypothesis hH overfits training data if there is an alternative hypothesis h’H such that errortrain(h) < errortrain(h’) and errorD(h) > errorD(h’)

More Related