410 likes | 529 Views
Decision Trees. References: "Artificial Intelligence: A Modern Approach, 3 rd ed " (Pearson) 18.3-18.4 http://onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html http://chem-eng.utoronto.ca/~datamining/dmc/decision_tree_overfitting.htm. What are they?. A "flowchart" of logic Example:
E N D
Decision Trees References: "Artificial Intelligence: A Modern Approach, 3rded" (Pearson) 18.3-18.4 http://onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html http://chem-eng.utoronto.ca/~datamining/dmc/decision_tree_overfitting.htm
What are they? • A "flowchart" of logic • Example: • If my health is low: • run to cover • Else: • if an enemy is nearby: • Shoot it • else: • scavenge for treasure
Another Example • Goal: Decide if we'll wait for a table at a restaurant • Factors: • Alternate: Is there another restaurant nearby? • Bar: Does the restaurant have a bar? • Fri / Sat: Is it a Friday or Saturday? • Hungry: Are we hungry? • Patrons: How many people {None, Some, Full} • Price: Price Range {$, $$, $$$} • Raining: Is it raining? • Reservation: Do we have a reservation? • Type: {French, Italian, Thai, Burger} • Wait: {0-10, 10-30, 30-60, >60}
Possible decision tree Patrons Full None Some Wait Y N 0-10 30-60 >60 10-30 Alternate Hungry N Y Yes Yes No No Reservation Fri/Sat Alternate Y No No Yes Yes No Yes Raining Y N Bar Y Y No Yes No Yes N N Y Y
Analysis • Pluses: • Easy to traverse • Naturally expressed as if/else's • Negatives: • how do we build an optimal tree?
Sample Input, cont • We can also think of these as "training data" • For a decision tree we want to model • In this context, the input: • is that of "Experts" • exemplifies the thinking you want to encode • is raw data we want to mine • … • Note: • Doesn't contain all possibilities • There might be noise
Building a tree • So how do we build a decision tree from input? • A lot of possible trees: • O(2n) • Some are good, some are bad: • good == shallowest • bad == deepest • Intractable to find the best • Using a greedy algorithm, we can find a pretty-good one…
ID3 algorithm • By Ross Quinlan (RuleQuest Research) • Basic idea: • Choose the best attribute, i • Create a tree with n children • n is the number of values for attribute i • Divide the training set into n sub-sets • Where all items in a subset have the same value for attribute i. • If all items in the subset have the same output value, make this a leaf node. • If not, recursively create a new sub-tree • Only use those training examples in this subset • Don't consider attribute i any more.
"Best" attribute • Entropy (in information theory) • A measure of uncertainty. • Gaining info == lowering entropy • A fair coin = 1 bit of entropy • A loaded coin (always heads) = 0 bits of entropy • No uncertainty • A fair roll of a d4 = 2 bits of entropy • A fair roll of a d8 = 3 bits of entropy
Entropy, cont. • Given: • V: random variable, values v1…vn • Entropy: • Where: • P(x) is the probability of x.
Entropy, cont. • Example: • We have a loaded 4-sided dice • We get a • {1:10%, 2:5%, 3:25%, 4:60%} Recall: The entropy of a fair d4 is 2.0, so this dice is slightly more predictable.
Information Gain • The reduction in entropy • In the ID3 algorithm, • We want to split the training cases based on attribute i • Where attribute i gives us the most information • i.e. lowers entropy the most
Information Gain, cont. • Suppose: • E is a set of p training cases • There are n "results" of each training case: r1…rn • We're considering splitting E based on attribute i, which has m possible values: Ai1…Aim • Ej is the subset of E which has result j, where 1 <= j <= n • size(E) is p; size(Ej) is the size of that subset • The resulting tree would have m branches. • The gain is: • Split on the attribute with largest gain.
Original Example • Let's take two potential attributes: Est & Patrons. Which is best to split on? • Calculate H(E) • 6 Yes, 6 No
Original Example, cont. • Calculate H(Eest) • 4 possible values, so we'd end up with 4 branches • "0-10": {1, 3, 6, 7, 8, 11}; 4 Yes, 2 No • "10-30": {4, 10}; 1 Yes, 1 No • "30-60": {2, 12}; 1 Yes, 1 No • ">60": {5, 9}; 2 No • Calculate the entropy of this split group
Original Example, cont. • Calculate H(Epat) • 3 possible values, so we'd end up with 3 branches • "Some": {1,3,6,8}; 4 Yes • "Full": {2,4,5,9,10,12}; 2 Yes, 4 No • "None": {7,11}; 2 No • Calculate the entropy of this split group 0.0 So…which is better: splitting on est, or pat?
Original Example, cont. • Pat is much better (0.541 gain vs. 0.21 gain) • Here is the tree so far: • Now we need a subtree to handle the case where Patrons==Full • Note: The training set is smaller now (6 vs. 12) Patrons None Some Full Y {1,3,6,8} N {7,11} {2,4,5,9,10,12}
Original Example, cont. • Look at two alternatives: Alt & Type • Calculate entropy of remaining group: • We actually already calculated it (H("Full")) • H(E)≈0.918
Original Example, cont. • Calculate entropy if we split on Alt • Two possible values: "Yes" and "No" • "Yes": {2,4,5,10,12}; 2 Yes, 3 No • "No": {9}; 1 No
Original Example, cont. • Calculate entropy if we split on Type • 4 possible values: "French", "Thai", "Burger", and "Italian" • "French": {5}; 1 No • "Thai": {2,4}; 1 Yes, 1 No • "Burger": {9,12}; 1 Yes, 1 No • "Italian": {10}; 1 No Which is better: alt or type?
Original Example, cont. • Type is better (0.251 gain vs. 0.109 gain) • Hungry, Price, Reservation, Est would give you same gain. • Here is the tree so far: • Recursively make two more sub-trees… Patrons None Some Full Y {1,3,6,8} N Type {7,11} {2,4,5,9,10,12} Italian French Thai Burger N N {10} {5} {9,12} {2,4}
Original Example, cont. • Here's one possibility (skipping the details): Patrons None Some Full Y {1,3,6,8} N Type {7,11} {2,4,5,9,10,12} Italian French Thai Burger N N {10} {5} {9,12} Fri Alt {2,4} No Yes No Yes N N Y {9} {4} Y {2} {12}
Using a decision tree • This algorithm will perfectly match all training cases. • The hope is that this will generalize to novel cases. • Let's take a new case (not found in training) • Alt="No", Bar="Yes", Fri="No", Pat="Full" • Hungry="Yes", Price="$$", Rain=Yes • Reservation="Yes", Type="Italian", Est="30-60" • Will we wait?
Original Example, cont. • Alt="No" • Bar="Yes" • Fri="No" • Pat="Full" • Hungry="Yes" • Price="$$" • Rain=Yes • Reservation="Yes" • Type="Italian" • Est="30-60" • Here's the decision process: Patrons None Some Full Y {1,3,6,8} N Type {7,11} {2,4,5,9,10,12} Italian French Thai Burger N N {10} {5} {9,12} Fri Alt {2,4} No Yes No Yes N N Y {9} {4} Y {2} {12} So…No, we won't wait.
Pruning • Sometimes an exact fit is not necessary • The tree is too big (deep) • The tree isn't generalizing well to new cases (overfitting) • We don't have a lot of training cases: • We would get close to the same results removing the attr node, and labeling it as a leaf (r1) Attr v3 v2 v1 {98} r1 r2 r1 {11, 41} {47}
Chi-Squared Test • The chi-squared test can be used to determine if a decision node is statistically significant. • Example1: • Is there a strong significance between hair color and eye color?
Chi-Squared Test • Example2: • Is there a strong significance between console preference and passing etgg1803?
Chi-Squared Test • Steps: • 1) Calculate row, column, and overall totals
Chi-Squared Test • 2) Calculate expected values of each cell • RowTotal * ColTotal / OverallTotal 52*44/95 36*43/95
Chi-Squared Test • 3) Calculate χ2 (32-24.08)2/24.08 (22-16.3)2/16.3 χ2 = 2.6 + 3.15 + 1.65 + 2.0 + 0.6 + 0.71 = 10.71
Chi-Squared test • 4) Look up your chi-squared value in a table • The degrees-of-freedom (dof) is (numRows-1)*(numCols-1) • http://home.comcast.net/~sharov/PopEcol/tables/chisq.html • If the table entry (usually for 0.05) is less than your chi-squared, it's statistically significant. • scipy (www.scipy.org) import scipy.stats if 1.0 – scipy.stats.chi2.cdf(chiSquared, dof) > 0.05: # Statistically insignificant
Chi-squared test • We have a χ2 value of 10.71 (dof = 2) • The table entry for 5% probability (0.05) is 5.99 • 10.71 is bigger than 5.99, so this is statistically significant • For the console example • χ2= 8.16 • dof = 4 • table entry for 5% probability is 9.49 • So…this isn't a statistically significant connection.
Chi-Squared Pruning • Bottom-up • Do a depth-first traversal • do your test after calling the function recursively on your children
Original Example, cont. • Look at "Burger?" first Patrons [6Y,6N] None Some Full Y N Type [0Y,2N] [4Y,0N] [2Y,4N] Italian French Thai Burger N N [1Y,1N] [0Y,1N] [0Y,1N] [1Y,1N] Fri Alt No Yes No Yes N N Y Y [1Y,0N] [0Y,1N] [0Y,1N] [1Y,0N]
Original Example, cont. • Do a Chi-squared test: Original Burger [1Y,1N] Alt No Yes Totals N Y [0Y,1N] [1Y,0N] Chi's Expected χ2= 0.5 + 0.5 + 0.5 + 0.5 = 2.0 Table(0.05, 1) = 3.84 dof = (2-1)*(2-1) = 1 So…prune it! Note: we'll have a similar case with Thai. So…prune it too!
Original Example, cont. • Here's one possibility: Patrons Patrons [6Y,6N] [6Y,6N] None None Some Some Full Full Y Y N N Type Type [0Y,2N] [0Y,2N] [4Y,0N] [4Y,0N] [2Y,4N] [2Y,4N] Italian Italian French French Thai Thai Burger Burger N N N N [1Y,1N] [1Y,1N] [0Y,1N] [0Y,1N] [0Y,1N] [0Y,1N] [1Y,1N] [1Y,1N] Fri Alt Y No Yes No Yes Y N N Y Y [1Y,0N] [0Y,1N] [0Y,1N] [1Y,0N]
Original Example, cont. Type [2Y,4N] I got a chi-squared value of 1.52, dof=3…prune it! Italian French Thai Burger N N [1Y,1N] [0Y,1N] [0Y,1N] [1Y,1N] Y Y
Original Example, cont. • Here's one possibility: Patrons Patrons [6Y,6N] [6Y,6N] None None Some Some Full Full Y Y N N Type [0Y,2N] [0Y,2N] [4Y,0N] [4Y,0N] N [2Y,4N] [2Y,4N] Italian French Thai Burger N N [1Y,1N] [0Y,1N] [0Y,1N] [1Y,1N] Y Y
Pruning Example, cont. Patrons [6Y,6N] I got a chi-squared value of 6.667, dof=2. So…keep it! None Some Full Y N [0Y,2N] [4Y,0N] N [2Y,4N] Note: if the evidence were stronger (more training cases) in the burger, thai branch, we wouldn't have pruned it