Understanding Decision Trees: Examples, ID3 Algorithm, and Information Gain

Decision Trees References: "Artificial Intelligence: A Modern Approach, 3rded" (Pearson) 18.3-18.4 http://onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html http://chem-eng.utoronto.ca/~datamining/dmc/decision_tree_overfitting.htm

What are they? • A "flowchart" of logic • Example: • If my health is low: • run to cover • Else: • if an enemy is nearby: • Shoot it • else: • scavenge for treasure

Another Example • Goal: Decide if we'll wait for a table at a restaurant • Factors: • Alternate: Is there another restaurant nearby? • Bar: Does the restaurant have a bar? • Fri / Sat: Is it a Friday or Saturday? • Hungry: Are we hungry? • Patrons: How many people {None, Some, Full} • Price: Price Range {$, $$, $$$} • Raining: Is it raining? • Reservation: Do we have a reservation? • Type: {French, Italian, Thai, Burger} • Wait: {0-10, 10-30, 30-60, >60}

Possible decision tree Patrons Full None Some Wait Y N 0-10 30-60 >60 10-30 Alternate Hungry N Y Yes Yes No No Reservation Fri/Sat Alternate Y No No Yes Yes No Yes Raining Y N Bar Y Y No Yes No Yes N N Y Y

Analysis • Pluses: • Easy to traverse • Naturally expressed as if/else's • Negatives: • how do we build an optimal tree?

Sample Input

Sample Input, cont • We can also think of these as "training data" • For a decision tree we want to model • In this context, the input: • is that of "Experts" • exemplifies the thinking you want to encode • is raw data we want to mine • … • Note: • Doesn't contain all possibilities • There might be noise

Building a tree • So how do we build a decision tree from input? • A lot of possible trees: • O(2n) • Some are good, some are bad: • good == shallowest • bad == deepest • Intractable to find the best • Using a greedy algorithm, we can find a pretty-good one…

ID3 algorithm • By Ross Quinlan (RuleQuest Research) • Basic idea: • Choose the best attribute, i • Create a tree with n children • n is the number of values for attribute i • Divide the training set into n sub-sets • Where all items in a subset have the same value for attribute i. • If all items in the subset have the same output value, make this a leaf node. • If not, recursively create a new sub-tree • Only use those training examples in this subset • Don't consider attribute i any more.

"Best" attribute • Entropy (in information theory) • A measure of uncertainty. • Gaining info == lowering entropy • A fair coin = 1 bit of entropy • A loaded coin (always heads) = 0 bits of entropy • No uncertainty • A fair roll of a d4 = 2 bits of entropy • A fair roll of a d8 = 3 bits of entropy

Entropy, cont. • Given: • V: random variable, values v1…vn • Entropy: • Where: • P(x) is the probability of x.

Entropy, cont. • Example: • We have a loaded 4-sided dice • We get a • {1:10%, 2:5%, 3:25%, 4:60%} Recall: The entropy of a fair d4 is 2.0, so this dice is slightly more predictable.

Information Gain • The reduction in entropy • In the ID3 algorithm, • We want to split the training cases based on attribute i • Where attribute i gives us the most information • i.e. lowers entropy the most

Information Gain, cont. • Suppose: • E is a set of p training cases • There are n "results" of each training case: r1…rn • We're considering splitting E based on attribute i, which has m possible values: Ai1…Aim • Ej is the subset of E which has result j, where 1 <= j <= n • size(E) is p; size(Ej) is the size of that subset • The resulting tree would have m branches. • The gain is: • Split on the attribute with largest gain.

Original Example • Let's take two potential attributes: Est & Patrons. Which is best to split on? • Calculate H(E) • 6 Yes, 6 No

Original Example, cont. • Calculate H(Eest) • 4 possible values, so we'd end up with 4 branches • "0-10": {1, 3, 6, 7, 8, 11}; 4 Yes, 2 No • "10-30": {4, 10}; 1 Yes, 1 No • "30-60": {2, 12}; 1 Yes, 1 No • ">60": {5, 9}; 2 No • Calculate the entropy of this split group

Original Example, cont. • Calculate H(Epat) • 3 possible values, so we'd end up with 3 branches • "Some": {1,3,6,8}; 4 Yes • "Full": {2,4,5,9,10,12}; 2 Yes, 4 No • "None": {7,11}; 2 No • Calculate the entropy of this split group 0.0 So…which is better: splitting on est, or pat?

Original Example, cont. • Pat is much better (0.541 gain vs. 0.21 gain) • Here is the tree so far: • Now we need a subtree to handle the case where Patrons==Full • Note: The training set is smaller now (6 vs. 12) Patrons None Some Full Y {1,3,6,8} N {7,11} {2,4,5,9,10,12}

Original Example, cont. • Look at two alternatives: Alt & Type • Calculate entropy of remaining group: • We actually already calculated it (H("Full")) • H(E)≈0.918

Original Example, cont. • Calculate entropy if we split on Alt • Two possible values: "Yes" and "No" • "Yes": {2,4,5,10,12}; 2 Yes, 3 No • "No": {9}; 1 No

Original Example, cont. • Calculate entropy if we split on Type • 4 possible values: "French", "Thai", "Burger", and "Italian" • "French": {5}; 1 No • "Thai": {2,4}; 1 Yes, 1 No • "Burger": {9,12}; 1 Yes, 1 No • "Italian": {10}; 1 No Which is better: alt or type?

Original Example, cont. • Type is better (0.251 gain vs. 0.109 gain) • Hungry, Price, Reservation, Est would give you same gain. • Here is the tree so far: • Recursively make two more sub-trees… Patrons None Some Full Y {1,3,6,8} N Type {7,11} {2,4,5,9,10,12} Italian French Thai Burger N N {10} {5} {9,12} {2,4}

Original Example, cont. • Here's one possibility (skipping the details): Patrons None Some Full Y {1,3,6,8} N Type {7,11} {2,4,5,9,10,12} Italian French Thai Burger N N {10} {5} {9,12} Fri Alt {2,4} No Yes No Yes N N Y {9} {4} Y {2} {12}

Using a decision tree • This algorithm will perfectly match all training cases. • The hope is that this will generalize to novel cases. • Let's take a new case (not found in training) • Alt="No", Bar="Yes", Fri="No", Pat="Full" • Hungry="Yes", Price="$$", Rain=Yes • Reservation="Yes", Type="Italian", Est="30-60" • Will we wait?

Original Example, cont. • Alt="No" • Bar="Yes" • Fri="No" • Pat="Full" • Hungry="Yes" • Price="$$" • Rain=Yes • Reservation="Yes" • Type="Italian" • Est="30-60" • Here's the decision process: Patrons None Some Full Y {1,3,6,8} N Type {7,11} {2,4,5,9,10,12} Italian French Thai Burger N N {10} {5} {9,12} Fri Alt {2,4} No Yes No Yes N N Y {9} {4} Y {2} {12} So…No, we won't wait.

Pruning • Sometimes an exact fit is not necessary • The tree is too big (deep) • The tree isn't generalizing well to new cases (overfitting) • We don't have a lot of training cases: • We would get close to the same results removing the attr node, and labeling it as a leaf (r1) Attr v3 v2 v1 {98} r1 r2 r1 {11, 41} {47}

Chi-Squared Test • The chi-squared test can be used to determine if a decision node is statistically significant. • Example1: • Is there a strong significance between hair color and eye color?

Chi-Squared Test • Example2: • Is there a strong significance between console preference and passing etgg1803?

Chi-Squared Test • Steps: • 1) Calculate row, column, and overall totals

Chi-Squared Test • 2) Calculate expected values of each cell • RowTotal * ColTotal / OverallTotal 52*44/95 36*43/95

Chi-Squared Test • 3) Calculate χ2 (32-24.08)2/24.08 (22-16.3)2/16.3 χ2 = 2.6 + 3.15 + 1.65 + 2.0 + 0.6 + 0.71 = 10.71

Chi-Squared test • 4) Look up your chi-squared value in a table • The degrees-of-freedom (dof) is (numRows-1)*(numCols-1) • http://home.comcast.net/~sharov/PopEcol/tables/chisq.html • If the table entry (usually for 0.05) is less than your chi-squared, it's statistically significant. • scipy (www.scipy.org) import scipy.stats if 1.0 – scipy.stats.chi2.cdf(chiSquared, dof) > 0.05: # Statistically insignificant

Chi-squared test • We have a χ2 value of 10.71 (dof = 2) • The table entry for 5% probability (0.05) is 5.99 • 10.71 is bigger than 5.99, so this is statistically significant • For the console example • χ2= 8.16 • dof = 4 • table entry for 5% probability is 9.49 • So…this isn't a statistically significant connection.

Chi-Squared Pruning • Bottom-up • Do a depth-first traversal • do your test after calling the function recursively on your children

Original Example, cont. • Look at "Burger?" first Patrons [6Y,6N] None Some Full Y N Type [0Y,2N] [4Y,0N] [2Y,4N] Italian French Thai Burger N N [1Y,1N] [0Y,1N] [0Y,1N] [1Y,1N] Fri Alt No Yes No Yes N N Y Y [1Y,0N] [0Y,1N] [0Y,1N] [1Y,0N]

Original Example, cont. • Do a Chi-squared test: Original Burger [1Y,1N] Alt No Yes Totals N Y [0Y,1N] [1Y,0N] Chi's Expected χ2= 0.5 + 0.5 + 0.5 + 0.5 = 2.0 Table(0.05, 1) = 3.84 dof = (2-1)*(2-1) = 1 So…prune it! Note: we'll have a similar case with Thai. So…prune it too!

Original Example, cont. • Here's one possibility: Patrons Patrons [6Y,6N] [6Y,6N] None None Some Some Full Full Y Y N N Type Type [0Y,2N] [0Y,2N] [4Y,0N] [4Y,0N] [2Y,4N] [2Y,4N] Italian Italian French French Thai Thai Burger Burger N N N N [1Y,1N] [1Y,1N] [0Y,1N] [0Y,1N] [0Y,1N] [0Y,1N] [1Y,1N] [1Y,1N] Fri Alt Y No Yes No Yes Y N N Y Y [1Y,0N] [0Y,1N] [0Y,1N] [1Y,0N]

Original Example, cont. Type [2Y,4N] I got a chi-squared value of 1.52, dof=3…prune it! Italian French Thai Burger N N [1Y,1N] [0Y,1N] [0Y,1N] [1Y,1N] Y Y

Original Example, cont. • Here's one possibility: Patrons Patrons [6Y,6N] [6Y,6N] None None Some Some Full Full Y Y N N Type [0Y,2N] [0Y,2N] [4Y,0N] [4Y,0N] N [2Y,4N] [2Y,4N] Italian French Thai Burger N N [1Y,1N] [0Y,1N] [0Y,1N] [1Y,1N] Y Y

Pruning Example, cont. Patrons [6Y,6N] I got a chi-squared value of 6.667, dof=2. So…keep it! None Some Full Y N [0Y,2N] [4Y,0N] N [2Y,4N] Note: if the evidence were stronger (more training cases) in the burger, thai branch, we wouldn't have pruned it

Questions?

Understanding Decision Trees: Examples, ID3 Algorithm, and Information Gain

Understanding Decision Trees: Examples, ID3 Algorithm, and Information Gain

Presentation Transcript

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

Decision Trees

DECISION TREES

Decision Trees

Decision Trees

Decision trees

Decision Trees

Decision Trees