1 / 24

COSC 4350 and 5350 Artificial Intelligence

COSC 4350 and 5350 Artificial Intelligence Induction and Decision Tree Learning (Part 2) Dr. Lappoon R. Tang Overview Types of learning History of machine learning Inductive learning Decision tree learning Readings R & N: Chapter 18 Sec 18.1 Sec 18.2 Sec 18.3, skim through

albert
Download Presentation

COSC 4350 and 5350 Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COSC 4350 and 5350Artificial Intelligence Induction and Decision Tree Learning (Part 2) Dr. Lappoon R. Tang

  2. Overview • Types of learning • History of machine learning • Inductive learning • Decision tree learning

  3. Readings • R & N: Chapter 18 • Sec 18.1 • Sec 18.2 • Sec 18.3, skim through • “Noise and overfitting” • “Broadening the applicability of decision trees”

  4. Learning decision trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: • Alternate: is there an alternative restaurant nearby? • Bar: is there a comfortable bar area to wait in? • Fri/Sat: is today Friday or Saturday? • Hungry: are we hungry? • Patrons: number of people in the restaurant (None, Some, Full) • Price: price range ($, $$, $$$) • Raining: is it raining outside? • Reservation: have we made a reservation? • Type: kind of restaurant (French, Italian, Thai, Burger) • WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

  5. Attribute-based representations • Each scenario is described by attribute values (Boolean, discrete, continuous) Situations where I will/won't wait for a table: • Classification of examples is positive (T) or negative (F)

  6. Decision trees • One possible representation for hypotheses • E.g., here is the “true” tree for deciding whether to wait:

  7. Meaning of a Branch in a Decision TreePerspective 1: A Sequence of Questions Q1: Is the restaurant full? A1: Yes (Patrons = Full) Q2: What is the expected waiting time? A2: 30 to 60 mins (WaitEstimate = 30-60) Q3: Do we have an alternative restaurant nearby? A3: Yes (Alternate = Yes) Q4: Is it a Fri or Sat? A4: Yes (Fri/Sat = Yes) So, the customer decided to wait (probably because it was a Fri/Sat, which means other restaurants could also be full …)

  8. Meaning of a Branch in a Decision TreePerspective 2: A Rule for Prediction IF (Patrons = ‘Full’) && (WaitEstimate = ‘30-60’) && (Alternate = ‘Yes’) && (Fri/Sat = ‘Yes’) THEN Wait = True

  9. Meaning of a Decision Tree • Perspective 1: A set of sequences of questions asked • Each sequence of questions leads to a final answer on whether the customer should wait or not (i.e. the classification of the scenario) • Perspective 2: A set of prediction rules • Each rule is used to predict the outcome of a particular scenario depending on whether the scenario satisfies all the pre-conditions of the rule or not

  10. Big tree or Small tree? The Minimum Description Length (MDL) Principle • In the huge space of hypotheses, some are bound to be more complicated (bigger) and some are more simple • Ex: In fitting a curve to a graph of points, there are many different curves that fit the points but they are of varying complexity • The MDL principle states that if there are two hypotheses consistent with the examples of the target concept, the “smaller” one always generalize better than the “bigger” one • Remember the Occam’s Razor bias? • MDL is consistent with the Occam’s Razor bias but it is mathematically much more rigorous

  11. Decision tree learning • Goal: find a “small” tree consistent with the training examples • Idea: (recursively) choose the "most significant" attribute as root of the tree or sub-tree until all leaf nodes are class labels • Form class node if possible • If the set of examples E given are in the same class C, return the decision tree T with the single node labeled with class C • Choose attribute • If not, choose the most discriminative attributeA and add A to the current decision tree as a tree node • Notice: A will “split” the set E into a bunch of subsets such that each contains examples all having a particular value of A • Recursively construct sub-trees • For each subset S in all the subsets produced by A in step 2, start at step 1 again where E is replaced by S

  12. Decision tree learning (cont’d) • Goal: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree Majority-Class

  13. Choosing an attribute • Idea: a good attribute splits the examples into subsets that are ideally "all positive" or "all negative“ (i.e. discriminative) • Patrons is a better choice … Why? Positive examples Negative examples {X11, X7} {X1, X3, X6, X8} +: {X4, X12} -: {X2, X5, X9, X10}

  14. Choosing an attribute • Patrons is a better choice … Why? • Intuitively, an attribute is more “discriminative” if knowing its values would allow one to make good guesses on the class label of an example • Patrons: • Knowing the value ‘none’ allows me to predict that the class is negative • Knowing the value ‘some’ allows me to predict that the class is positive • Knowing the value ‘full’ allows me to make a guess that the class would be negative (no. of negative examples > no. of positive examples) • Type: • Knowing each of the values (e.g. French) gives me no clue at all whether the class would be positive or negative because they are both equally likely

  15. Choosing an attribute Q: Is there a way to mathematically quantify how discriminative an attribute is – so that we can have a basis for choosing one over another? A: Yes – entropy and information gain

  16. Entropy I(p+, p-) In the book • The entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is: • If all examples belong to the same category, entropy is 0 (by def. 0log0 is defined to be 0). • If examples are equally mixed (p+ = p- = 0.5), then entropy is a maximum at 1.0 – least amount of “pattern” can be found in examples • For multiple-category problems with c categories:

  17. Entropy (cont’d)

  18. Information Gain • The information gain of an attribute A is the expected reduction in • entropy caused by partitioning S using attribute A: Total weighted entropy of the partition • IDEA: If we can find an attribute A that can “group” a lot • of (or even all of) the examples in S under the same class, • we want to use A as a test in the decision tree since it is very • Discriminative – knowing A allows us to almost predict the class

  19. Information gain (cont’d) • Given a set of attributes, compute the information gain for each attribute in the set • Choose the attribute with the largest information gain – maximum reduction in entropy of the existing set of examples

  20. 2 4 6 2 4 = - + + = IG ( Patrons ) 1 [ I ( 0 , 1 ) I ( 1 , 0 ) I ( , )] . 0541 12 12 12 6 6 2 1 1 2 1 1 4 2 2 4 2 2 = - + + + = IG ( Type ) 1 [ I ( , ) I ( , ) I ( , ) I ( , )] 0 12 2 2 12 2 2 12 4 4 12 4 4 Information gain (cont’d) For the training set, p = n = 6, I(6/12, 6/12) = 1.0 (entropy) Consider the attributes Patrons and Type (and others too): Patrons has the highest IG of all attributes and so it is chosen to be the root of the decision tree

  21. Example contd. • Decision tree learned from the 12 examples: • Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by the small amount of data

  22. Performance measurement • How do we know that h ≈ f ? • Try h on a new test set of examples (use same distribution over example space as training set) • The higher the test accuracy, the higher is the probability that the induced hypothesis from the data is the true hypothesis Learning curve = % correct on test set as a function of training set size

  23. Exercise • Compute the information gain for the attribute Raining

  24. Summary • Learning needed for unknown environments, lazy designers • For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples • Decision tree learning using information gain can learn a classification function given a set of attribute-value vectors • Learning performance = prediction accuracy measured on test set

More Related