COSC 4350 and 5350 Artificial Intelligence

1 / 24

# COSC 4350 and 5350 Artificial Intelligence - PowerPoint PPT Presentation

COSC 4350 and 5350 Artificial Intelligence Induction and Decision Tree Learning (Part 2) Dr. Lappoon R. Tang Overview Types of learning History of machine learning Inductive learning Decision tree learning Readings R &amp; N: Chapter 18 Sec 18.1 Sec 18.2 Sec 18.3, skim through

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'COSC 4350 and 5350 Artificial Intelligence' - albert

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### COSC 4350 and 5350Artificial Intelligence

Induction and Decision Tree Learning (Part 2)

Dr. Lappoon R. Tang

Overview
• Types of learning
• History of machine learning
• Inductive learning
• Decision tree learning
• R & N: Chapter 18
• Sec 18.1
• Sec 18.2
• Sec 18.3, skim through
• “Noise and overfitting”
• “Broadening the applicability of decision trees”
Learning decision trees

Problem: decide whether to wait for a table at a restaurant, based on the following attributes:

• Alternate: is there an alternative restaurant nearby?
• Bar: is there a comfortable bar area to wait in?
• Fri/Sat: is today Friday or Saturday?
• Hungry: are we hungry?
• Patrons: number of people in the restaurant (None, Some, Full)
• Price: price range (\$, \$\$, \$\$\$)
• Raining: is it raining outside?
• Reservation: have we made a reservation?
• Type: kind of restaurant (French, Italian, Thai, Burger)
• WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations
• Each scenario is described by attribute values (Boolean, discrete, continuous)

Situations where I will/won't wait for a table:

• Classification of examples is positive (T) or negative (F)
Decision trees
• One possible representation for hypotheses
• E.g., here is the “true” tree for deciding whether to wait:
Meaning of a Branch in a Decision TreePerspective 1: A Sequence of Questions

Q1: Is the restaurant full?

A1: Yes (Patrons = Full)

Q2: What is the expected waiting

time?

A2: 30 to 60 mins (WaitEstimate = 30-60)

Q3: Do we have an alternative restaurant nearby?

A3: Yes (Alternate = Yes)

Q4: Is it a Fri or Sat?

A4: Yes (Fri/Sat = Yes)

So, the customer decided to wait (probably because it was a Fri/Sat, which means other restaurants could also be full …)

IF (Patrons = ‘Full’) &&

(WaitEstimate = ‘30-60’) &&

(Alternate = ‘Yes’) &&

(Fri/Sat = ‘Yes’)

THEN

Wait = True

Meaning of a Decision Tree
• Perspective 1: A set of sequences of questions asked
• Each sequence of questions leads to a final answer on whether the customer should wait or not (i.e. the classification of the scenario)
• Perspective 2: A set of prediction rules
• Each rule is used to predict the outcome of a particular scenario depending on whether the scenario satisfies all the pre-conditions of the rule or not
• In the huge space of hypotheses, some are bound to be more complicated (bigger) and some are more simple
• Ex: In fitting a curve to a graph of points, there are many different curves that fit the points but they are of varying complexity
• The MDL principle states that if there are two hypotheses consistent with the examples of the target concept, the “smaller” one always generalize better than the “bigger” one
• Remember the Occam’s Razor bias?
• MDL is consistent with the Occam’s Razor bias but it is mathematically much more rigorous
Decision tree learning
• Goal: find a “small” tree consistent with the training examples
• Idea: (recursively) choose the "most significant" attribute as root of the tree or sub-tree until all leaf nodes are class labels
• Form class node if possible
• If the set of examples E given are in the same class C, return the decision tree T with the single node labeled with class C
• Choose attribute
• If not, choose the most discriminative attributeA and add A to the current decision tree as a tree node
• Notice: A will “split” the set E into a bunch of subsets such that each contains examples all having a particular value of A
• Recursively construct sub-trees
• For each subset S in all the subsets produced by A in step 2, start at step 1 again where E is replaced by S
Decision tree learning (cont’d)
• Goal: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of (sub)tree

Majority-Class

Choosing an attribute
• Idea: a good attribute splits the examples into subsets that are ideally "all positive" or "all negative“ (i.e. discriminative)
• Patrons is a better choice … Why?

Positive examples

Negative

examples

{X11, X7}

{X1, X3,

X6, X8}

+: {X4, X12}

-: {X2, X5, X9, X10}

Choosing an attribute
• Patrons is a better choice … Why?
• Intuitively, an attribute is more “discriminative” if knowing its values would allow one to make good guesses on the class label of an example
• Patrons:
• Knowing the value ‘none’ allows me to predict that the class is negative
• Knowing the value ‘some’ allows me to predict that the class is positive
• Knowing the value ‘full’ allows me to make a guess that the class would be negative (no. of negative examples > no. of positive examples)
• Type:
• Knowing each of the values (e.g. French) gives me no clue at all whether the class would be positive or negative because they are both equally likely
Choosing an attribute

Q: Is there a way to mathematically quantify how discriminative an attribute is – so that we can have a basis for choosing one over another?

A: Yes – entropy and information gain

Entropy

I(p+, p-)

In the book

• The entropy (impurity, disorder) of a set of examples, S, relative to a binary classification is:
• If all examples belong to the same category, entropy is 0 (by def. 0log0 is defined to be 0).
• If examples are equally mixed (p+ = p- = 0.5), then entropy is a maximum at 1.0 – least amount of “pattern” can be found in examples
• For multiple-category problems with c categories:
Information Gain
• The information gain of an attribute A is the expected reduction in
• entropy caused by partitioning S using attribute A:

Total

weighted

entropy of

the partition

• IDEA: If we can find an attribute A that can “group” a lot
• of (or even all of) the examples in S under the same class,
• we want to use A as a test in the decision tree since it is very
• Discriminative – knowing A allows us to almost predict the class
Information gain (cont’d)
• Given a set of attributes, compute the information gain for each attribute in the set
• Choose the attribute with the largest information gain – maximum reduction in entropy of the existing set of examples

2

4

6

2

4

=

-

+

+

=

IG

(

Patrons

)

1

[

I

(

0

,

1

)

I

(

1

,

0

)

I

(

,

)]

.

0541

12

12

12

6

6

2

1

1

2

1

1

4

2

2

4

2

2

=

-

+

+

+

=

IG

(

Type

)

1

[

I

(

,

)

I

(

,

)

I

(

,

)

I

(

,

)]

0

12

2

2

12

2

2

12

4

4

12

4

4

Information gain (cont’d)

For the training set, p = n = 6, I(6/12, 6/12) = 1.0 (entropy)

Consider the attributes Patrons and Type (and others too):

Patrons has the highest IG of all attributes and so it is chosen to be the root of the decision tree

Example contd.
• Decision tree learned from the 12 examples:
• Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by the small amount of data
Performance measurement
• How do we know that h ≈ f ?
• Try h on a new test set of examples (use same distribution over example space as training set)
• The higher the test accuracy, the higher is the probability that the induced hypothesis from the data is the true hypothesis

Learning curve = % correct on test set as a function of training set size

Exercise
• Compute the information gain for the attribute Raining
Summary
• Learning needed for unknown environments, lazy designers
• For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples
• Decision tree learning using information gain can learn a classification function given a set of attribute-value vectors
• Learning performance = prediction accuracy measured on test set