Cs b351 decision trees
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

CS B351: Decision Trees PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

CS B351: Decision Trees. Agenda. Decision trees Learning curves Combatting overfitting. a small one!. Classification Tasks. Supervised learning setting The target function f(x) takes on values True and False A example is positive if f is True, else it is negative

Download Presentation

CS B351: Decision Trees

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs b351 decision trees

CS B351: Decision Trees


Agenda

Agenda

  • Decision trees

  • Learning curves

    • Combatting overfitting


Classification tasks

a small one!

Classification Tasks

  • Supervised learning setting

  • The target function f(x) takes on values True and False

  • A example is positive if f is True, else it is negative

  • The set X of all possible examples is the example set

  • The training set is a subset of X


Logical classification dataset

Logical Classification Dataset

  • Here, examples (x, f(x)) take on discrete values


Logical classification dataset1

Logical Classification Dataset

  • Here, examples (x, f(x)) take on discrete values

Concept

Note that the training set does not say whether

an observable predicate is pertinent or not


Logical classification task

Logical Classification Task

  • Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …)where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x)  A(x)  (B(x) v C(x))


Predicate as a decision tree

A?

True

False

B?

False

False

True

C?

True

True

False

True

False

Predicate as a Decision Tree

The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can

be represented by the following decision tree:

  • Example:A mushroom is poisonous iffit is yellow and small, or yellow,

  • big and spotted

  • x is a mushroom

  • CONCEPT = POISONOUS

  • A = YELLOW

  • B = BIG

  • C = SPOTTED


Predicate as a decision tree1

A?

True

False

B?

False

False

True

C?

True

True

False

True

False

Predicate as a Decision Tree

The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can

be represented by the following decision tree:

  • Example:A mushroom is poisonous iffit is yellow and small, or yellow,

  • big and spotted

  • x is a mushroom

  • CONCEPT = POISONOUS

  • A = YELLOW

  • B = BIG

  • C = SPOTTED

  • D = FUNNEL-CAP

  • E = BULKY


Training set

Training Set


Possible decision tree

D

E

B

A

A

T

F

C

T

F

T

F

T

E

A

F

T

T

F

Possible Decision Tree


Possible decision tree1

D

E

B

A

A

T

F

C

A?

CONCEPT  A (B v C)

True

False

T

F

B?

False

False

T

F

T

True

E

C?

True

A

False

True

True

False

F

T

T

F

Possible Decision Tree

CONCEPT 

(D(EvA))v(D(C(Bv(B((EA)v(EA))))))


Possible decision tree2

D

E

B

A

A

T

F

C

A?

CONCEPT  A (B v C)

True

False

T

F

B?

False

False

T

F

T

True

E

C?

True

A

False

True

True

False

F

T

T

F

Possible Decision Tree

CONCEPT 

(D(EvA))v(D(C(Bv(B((EA)v(EA))))))

KIS bias  Build smallest decision tree

Computationally intractable problem greedy algorithm


Getting started top down induction of decision tree

Getting Started:Top-Down Induction of Decision Tree

The distribution of training set is:

True: 6, 7, 8, 9, 10,13

False: 1, 2, 3, 4, 5, 11, 12


Getting started top down induction of decision tree1

Getting Started:Top-Down Induction of Decision Tree

The distribution of training set is:

True: 6, 7, 8, 9, 10,13

False: 1, 2, 3, 4, 5, 11, 12

Without testing any observable predicate, we

could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Assuming that we will only include one observable

predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)?  Greedy algorithm


Assume it s a

A

F

T

6, 7, 8, 9, 10, 13

11, 12

True:

False:

1, 2, 3, 4, 5

If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise

 The number of misclassified examples from the training set is 2

Assume It’s A


Assume it s b

B

F

T

9, 10

2, 3, 11, 12

True:

False:

6, 7, 8, 13

1, 4, 5

If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise

 The number of misclassified examples from the training set is 5

Assume It’s B


Assume it s c

C

F

T

6, 8, 9, 10, 13

1, 3, 4

True:

False:

7

1, 5, 11, 12

If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise

 The number of misclassified examples from the training set is 4

Assume It’s C


Assume it s d

D

F

T

7, 10, 13

3, 5

True:

False:

6, 8, 9

1, 2, 4, 11, 12

If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise

 The number of misclassified examples from the training set is 5

Assume It’s D


Assume it s e

E

F

T

8, 9, 10, 13

1, 3, 5, 12

True:

False:

6, 7

2, 4, 11

If we test only E we will report that CONCEPT is False,

independent of the outcome

 The number of misclassified examples from the training set is 6

Assume It’s E


Assume it s e1

E

F

T

8, 9, 10, 13

1, 3, 5, 12

True:

False:

6, 7

2, 4, 11

If we test only E we will report that CONCEPT is False,

independent of the outcome

 The number of misclassified examples from the training set is 6

Assume It’s E

So, the best predicate to test is A


Choice of second predicate

6, 8, 9, 10, 13

True:

False:

7

11, 12

Choice of Second Predicate

A

F

T

False

C

F

T

 The number of misclassified examples from the training set is 1


Choice of third predicate

11,12

True:

False:

7

Choice of Third Predicate

A

F

T

False

C

F

T

True

B

T

F


Final tree

A

True

False

A?

C

False

False

True

True

False

B?

False

True

B

False

True

True

False

C?

True

False

True

True

False

True

False

Final Tree

CONCEPT  A (C v B)

CONCEPT  A (B v C)


Top down induction of a dt

A

True

False

C

False

False

True

True

B

True

False

False

True

Subset of examples that satisfy A

Top-DownInduction of a DT

DTL(D, Predicates)

  • If all examples in D are positive then return True

  • If all examples in D are negative then return False

  • If Predicates is empty then return failure

  • A  error-minimizing predicate in Predicates

  • Return the tree whose:

    - root is A,

    - left branch is DTL(D+A,Predicates-A),

    - right branch is DTL(D-A,Predicates-A)


Top down induction of a dt1

A

True

False

C

False

False

True

True

B

True

False

Noise in training set!

May return majority rule,instead of failure

False

True

Top-DownInduction of a DT

DTL(D, Predicates)

  • If all examples in D are positive then return True

  • If all examples in D are negative then return False

  • If Predicates is empty then return failure

  • A  error-minimizing predicate in Predicates

  • Return the tree whose:

    - root is A,

    - left branch is DTL(D+A,Predicates-A),

    - right branch is DTL(D-A,Predicates-A)


Top down induction of a dt2

A

True

False

C

False

False

True

True

B

True

False

False

True

Top-DownInduction of a DT

DTL(D, Predicates)

  • If all examples in D are positive then return True

  • If all examples in D are negative then return False

  • If Predicates is empty then return majority rule

  • A  error-minimizing predicate in Predicates

  • Return the tree whose:

    - root is A,

    - left branch is DTL(D+A,Predicates-A),

    - right branch is DTL(D-A,Predicates-A)


Comments

Comments

  • Widely used algorithm

  • Easy to extend to k-class classification

  • Greedy

  • Robust to noise (incorrect examples)

  • Not incremental


Human readability

Human-Readability

  • DTs also have the advantage of being easily understood by humans

  • Legal requirement in many areas

    • Loans & mortgages

    • Health insurance

    • Welfare


Learnable concepts

Learnable Concepts

  • Some simple concepts cannot be represented compactly in DTs

    • Parity(x) = X1 xor X2 xor … xor Xn

    • Majority(x) = 1 if most of Xi’s are 1, 0 otherwise

  • Exponential size in # of attributes

  • Need exponential # of examples to learn exactly

  • The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT


Performance issues

100

% correct on test set

size of training set

Typical learning curve

Performance Issues

  • Assessing performance:

    • Training set and test set

    • Learning curve


Performance issues1

Performance Issues

  • Assessing performance:

    • Training set and test set

    • Learning curve

100

Some concepts are unrealizable within a machine’s capacity

% correct on test set

size of training set

Typical learning curve


Performance issues2

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

100

% correct on test set

size of training set

Typical learning curve

Performance Issues

  • Assessing performance:

    • Training set and test set

    • Learning curve

  • Overfitting


Performance issues3

Performance Issues

  • Assessing performance:

    • Training set and test set

    • Learning curve

  • Overfitting

    • Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

Terminate recursion when

# errors (or information gain) is small


Performance issues4

Performance Issues

  • Assessing performance:

    • Training set and test set

    • Learning curve

  • Overfitting

    • Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

The resulting decision tree + majority rule may not classify correctly all examples in the training set

Terminate recursion when

# errors (or information gain) is small


Statistical methods for addressing overfitting noise

Statistical Methods for Addressing Overfitting / Noise

  • There may be few training examples that match the path leading to a deep node in the decision tree

    • More susceptible to choosing irrelevant/incorrect attributes when sample is small

  • Idea:

    • Make a statistical estimate of predictive power (which increases with larger samples)

    • Prune branches with low predictive power

    • Chi-squared pruning


Top down dt pruning

Top-down DT pruning

  • Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly

  • At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk

  • Chi-squared statistical significance test:

    • Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant)

    • Alternate hypothesis: examples not randomly chosen (X is relevant)

  • Prune X if testing X is not statistically significant


Chi squared test

Chi-Squared test

  • Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’

    • Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds

  • Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom

  • Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)


Performance issues5

Performance Issues

  • Assessing performance:

    • Training set and test set

    • Learning curve

  • Overfitting

    • Tree pruning

  • Incorrect examples

  • Missing data

  • Multi-valued and continuous attributes


Multi valued attributes

Multi-Valued Attributes

  • Simple change: consider splits on all values A can take on

  • Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant

    • More values => dataset split into smaller example sets when picking attributes

    • Smaller example sets => more likely to fit well to spurious noise


Continuous attributes

7

7

6

5

6

5

4

5

4

3

4

5

4

5

6

7

Continuous Attributes

  • Continuous attributes can be converted into logical ones via thresholds

    • X => X<a

  • When considering splitting on X, pick the threshold a to minimize # of errors / entropy


Decision boundaries

Decision Boundaries

  • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

F

T

T

F

x1


Decision boundaries1

Decision Boundaries

  • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

F

F

x2>=10

F

T

F

T

x1


Decision boundaries2

Decision Boundaries

  • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

T

F

x2>=10

x2>=15

F

T

F

T

T

F

x1


Decision boundaries3

Decision Boundaries

  • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples


Exercise

Exercise

  • With 2 attributes, what kinds of decision boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth:

    • 1?

    • 2?

    • 3?

  • Describe the appearance and the complexity of these decision boundaries


Reading

Reading

  • Next class:

    • Neural networks & function learning

    • R&N 18.6-7


  • Login