CS B351: Decision Trees

1 / 46

# CS B351: Decision Trees - PowerPoint PPT Presentation

CS B351: Decision Trees. Agenda. Decision trees Learning curves Combatting overfitting. a small one!. Classification Tasks. Supervised learning setting The target function f(x) takes on values True and False A example is positive if f is True, else it is negative

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'CS B351: Decision Trees' - orsin

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### CS B351: Decision Trees

Agenda
• Decision trees
• Learning curves
• Combatting overfitting

a small one!

• Supervised learning setting
• The target function f(x) takes on values True and False
• A example is positive if f is True, else it is negative
• The set X of all possible examples is the example set
• The training set is a subset of X
Logical Classification Dataset
• Here, examples (x, f(x)) take on discrete values
Logical Classification Dataset
• Here, examples (x, f(x)) take on discrete values

Concept

Note that the training set does not say whether

an observable predicate is pertinent or not

• Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …)where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x)  A(x)  (B(x) v C(x))

A?

True

False

B?

False

False

True

C?

True

True

False

True

False

Predicate as a Decision Tree

The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can

be represented by the following decision tree:

• Example:A mushroom is poisonous iffit is yellow and small, or yellow,
• big and spotted
• x is a mushroom
• CONCEPT = POISONOUS
• A = YELLOW
• B = BIG
• C = SPOTTED

A?

True

False

B?

False

False

True

C?

True

True

False

True

False

Predicate as a Decision Tree

The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can

be represented by the following decision tree:

• Example:A mushroom is poisonous iffit is yellow and small, or yellow,
• big and spotted
• x is a mushroom
• CONCEPT = POISONOUS
• A = YELLOW
• B = BIG
• C = SPOTTED
• D = FUNNEL-CAP
• E = BULKY

D

E

B

A

A

T

F

C

T

F

T

F

T

E

A

F

T

T

F

Possible Decision Tree

D

E

B

A

A

T

F

C

A?

CONCEPT  A (B v C)

True

False

T

F

B?

False

False

T

F

T

True

E

C?

True

A

False

True

True

False

F

T

T

F

Possible Decision Tree

CONCEPT 

(D(EvA))v(D(C(Bv(B((EA)v(EA))))))

D

E

B

A

A

T

F

C

A?

CONCEPT  A (B v C)

True

False

T

F

B?

False

False

T

F

T

True

E

C?

True

A

False

True

True

False

F

T

T

F

Possible Decision Tree

CONCEPT 

(D(EvA))v(D(C(Bv(B((EA)v(EA))))))

KIS bias  Build smallest decision tree

Computationally intractable problem greedy algorithm

Getting Started:Top-Down Induction of Decision Tree

The distribution of training set is:

True: 6, 7, 8, 9, 10,13

False: 1, 2, 3, 4, 5, 11, 12

Getting Started:Top-Down Induction of Decision Tree

The distribution of training set is:

True: 6, 7, 8, 9, 10,13

False: 1, 2, 3, 4, 5, 11, 12

Without testing any observable predicate, we

could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Assuming that we will only include one observable

predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)?  Greedy algorithm

A

F

T

6, 7, 8, 9, 10, 13

11, 12

True:

False:

1, 2, 3, 4, 5

If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise

 The number of misclassified examples from the training set is 2

Assume It’s A

B

F

T

9, 10

2, 3, 11, 12

True:

False:

6, 7, 8, 13

1, 4, 5

If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise

 The number of misclassified examples from the training set is 5

Assume It’s B

C

F

T

6, 8, 9, 10, 13

1, 3, 4

True:

False:

7

1, 5, 11, 12

If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise

 The number of misclassified examples from the training set is 4

Assume It’s C

D

F

T

7, 10, 13

3, 5

True:

False:

6, 8, 9

1, 2, 4, 11, 12

If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise

 The number of misclassified examples from the training set is 5

Assume It’s D

E

F

T

8, 9, 10, 13

1, 3, 5, 12

True:

False:

6, 7

2, 4, 11

If we test only E we will report that CONCEPT is False,

independent of the outcome

 The number of misclassified examples from the training set is 6

Assume It’s E

E

F

T

8, 9, 10, 13

1, 3, 5, 12

True:

False:

6, 7

2, 4, 11

If we test only E we will report that CONCEPT is False,

independent of the outcome

 The number of misclassified examples from the training set is 6

Assume It’s E

So, the best predicate to test is A

6, 8, 9, 10, 13

True:

False:

7

11, 12

Choice of Second Predicate

A

F

T

False

C

F

T

 The number of misclassified examples from the training set is 1

11,12

True:

False:

7

Choice of Third Predicate

A

F

T

False

C

F

T

True

B

T

F

A

True

False

A?

C

False

False

True

True

False

B?

False

True

B

False

True

True

False

C?

True

False

True

True

False

True

False

Final Tree

CONCEPT  A (C v B)

CONCEPT  A (B v C)

A

True

False

C

False

False

True

True

B

True

False

False

True

Subset of examples that satisfy A

Top-DownInduction of a DT

DTL(D, Predicates)

• If all examples in D are positive then return True
• If all examples in D are negative then return False
• If Predicates is empty then return failure
• A  error-minimizing predicate in Predicates
• Return the tree whose:

- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

A

True

False

C

False

False

True

True

B

True

False

Noise in training set!

May return majority rule,instead of failure

False

True

Top-DownInduction of a DT

DTL(D, Predicates)

• If all examples in D are positive then return True
• If all examples in D are negative then return False
• If Predicates is empty then return failure
• A  error-minimizing predicate in Predicates
• Return the tree whose:

- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

A

True

False

C

False

False

True

True

B

True

False

False

True

Top-DownInduction of a DT

DTL(D, Predicates)

• If all examples in D are positive then return True
• If all examples in D are negative then return False
• If Predicates is empty then return majority rule
• A  error-minimizing predicate in Predicates
• Return the tree whose:

- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

• Widely used algorithm
• Easy to extend to k-class classification
• Greedy
• Robust to noise (incorrect examples)
• Not incremental
• DTs also have the advantage of being easily understood by humans
• Legal requirement in many areas
• Loans & mortgages
• Health insurance
• Welfare
Learnable Concepts
• Some simple concepts cannot be represented compactly in DTs
• Parity(x) = X1 xor X2 xor … xor Xn
• Majority(x) = 1 if most of Xi’s are 1, 0 otherwise
• Exponential size in # of attributes
• Need exponential # of examples to learn exactly
• The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

100

% correct on test set

size of training set

Typical learning curve

Performance Issues
• Assessing performance:
• Training set and test set
• Learning curve
Performance Issues
• Assessing performance:
• Training set and test set
• Learning curve

100

Some concepts are unrealizable within a machine’s capacity

% correct on test set

size of training set

Typical learning curve

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

100

% correct on test set

size of training set

Typical learning curve

Performance Issues
• Assessing performance:
• Training set and test set
• Learning curve
• Overfitting
Performance Issues
• Assessing performance:
• Training set and test set
• Learning curve
• Overfitting
• Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

Terminate recursion when

# errors (or information gain) is small

Performance Issues
• Assessing performance:
• Training set and test set
• Learning curve
• Overfitting
• Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

The resulting decision tree + majority rule may not classify correctly all examples in the training set

Terminate recursion when

# errors (or information gain) is small

Statistical Methods for Addressing Overfitting / Noise
• There may be few training examples that match the path leading to a deep node in the decision tree
• More susceptible to choosing irrelevant/incorrect attributes when sample is small
• Idea:
• Make a statistical estimate of predictive power (which increases with larger samples)
• Prune branches with low predictive power
• Chi-squared pruning
Top-down DT pruning
• Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly
• At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk
• Chi-squared statistical significance test:
• Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant)
• Alternate hypothesis: examples not randomly chosen (X is relevant)
• Prune X if testing X is not statistically significant
Chi-Squared test
• Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’
• Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds
• Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom
• Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)
Performance Issues
• Assessing performance:
• Training set and test set
• Learning curve
• Overfitting
• Tree pruning
• Incorrect examples
• Missing data
• Multi-valued and continuous attributes
Multi-Valued Attributes
• Simple change: consider splits on all values A can take on
• Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant
• More values => dataset split into smaller example sets when picking attributes
• Smaller example sets => more likely to fit well to spurious noise

7

7

6

5

6

5

4

5

4

3

4

5

4

5

6

7

Continuous Attributes
• Continuous attributes can be converted into logical ones via thresholds
• X => X<a
• When considering splitting on X, pick the threshold a to minimize # of errors / entropy
Decision Boundaries
• With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

F

T

T

F

x1

Decision Boundaries
• With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

F

F

x2>=10

F

T

F

T

x1

Decision Boundaries
• With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

T

F

x2>=10

x2>=15

F

T

F

T

T

F

x1

Decision Boundaries
• With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples
Exercise
• With 2 attributes, what kinds of decision boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth:
• 1?
• 2?
• 3?
• Describe the appearance and the complexity of these decision boundaries