cs b351 decision trees
Download
Skip this Video
Download Presentation
CS B351: Decision Trees

Loading in 2 Seconds...

play fullscreen
1 / 46

CS B351: Decision Trees - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

CS B351: Decision Trees. Agenda. Decision trees Learning curves Combatting overfitting. a small one!. Classification Tasks. Supervised learning setting The target function f(x) takes on values True and False A example is positive if f is True, else it is negative

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS B351: Decision Trees' - orsin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
agenda
Agenda
  • Decision trees
  • Learning curves
    • Combatting overfitting
classification tasks

a small one!

Classification Tasks
  • Supervised learning setting
  • The target function f(x) takes on values True and False
  • A example is positive if f is True, else it is negative
  • The set X of all possible examples is the example set
  • The training set is a subset of X
logical classification dataset
Logical Classification Dataset
  • Here, examples (x, f(x)) take on discrete values
logical classification dataset1
Logical Classification Dataset
  • Here, examples (x, f(x)) take on discrete values

Concept

Note that the training set does not say whether

an observable predicate is pertinent or not

logical classification task
Logical Classification Task
  • Find a representation of CONCEPT in the form: CONCEPT(x)  S(A,B, …)where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x)  A(x)  (B(x) v C(x))
predicate as a decision tree

A?

True

False

B?

False

False

True

C?

True

True

False

True

False

Predicate as a Decision Tree

The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can

be represented by the following decision tree:

  • Example:A mushroom is poisonous iffit is yellow and small, or yellow,
  • big and spotted
  • x is a mushroom
  • CONCEPT = POISONOUS
  • A = YELLOW
  • B = BIG
  • C = SPOTTED
predicate as a decision tree1

A?

True

False

B?

False

False

True

C?

True

True

False

True

False

Predicate as a Decision Tree

The predicate CONCEPT(x)  A(x) (B(x) v C(x)) can

be represented by the following decision tree:

  • Example:A mushroom is poisonous iffit is yellow and small, or yellow,
  • big and spotted
  • x is a mushroom
  • CONCEPT = POISONOUS
  • A = YELLOW
  • B = BIG
  • C = SPOTTED
  • D = FUNNEL-CAP
  • E = BULKY
possible decision tree

D

E

B

A

A

T

F

C

T

F

T

F

T

E

A

F

T

T

F

Possible Decision Tree
possible decision tree1

D

E

B

A

A

T

F

C

A?

CONCEPT  A (B v C)

True

False

T

F

B?

False

False

T

F

T

True

E

C?

True

A

False

True

True

False

F

T

T

F

Possible Decision Tree

CONCEPT 

(D(EvA))v(D(C(Bv(B((EA)v(EA))))))

possible decision tree2

D

E

B

A

A

T

F

C

A?

CONCEPT  A (B v C)

True

False

T

F

B?

False

False

T

F

T

True

E

C?

True

A

False

True

True

False

F

T

T

F

Possible Decision Tree

CONCEPT 

(D(EvA))v(D(C(Bv(B((EA)v(EA))))))

KIS bias  Build smallest decision tree

Computationally intractable problem greedy algorithm

getting started top down induction of decision tree
Getting Started:Top-Down Induction of Decision Tree

The distribution of training set is:

True: 6, 7, 8, 9, 10,13

False: 1, 2, 3, 4, 5, 11, 12

getting started top down induction of decision tree1
Getting Started:Top-Down Induction of Decision Tree

The distribution of training set is:

True: 6, 7, 8, 9, 10,13

False: 1, 2, 3, 4, 5, 11, 12

Without testing any observable predicate, we

could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Assuming that we will only include one observable

predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)?  Greedy algorithm

assume it s a

A

F

T

6, 7, 8, 9, 10, 13

11, 12

True:

False:

1, 2, 3, 4, 5

If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise

 The number of misclassified examples from the training set is 2

Assume It’s A
assume it s b

B

F

T

9, 10

2, 3, 11, 12

True:

False:

6, 7, 8, 13

1, 4, 5

If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise

 The number of misclassified examples from the training set is 5

Assume It’s B
assume it s c

C

F

T

6, 8, 9, 10, 13

1, 3, 4

True:

False:

7

1, 5, 11, 12

If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise

 The number of misclassified examples from the training set is 4

Assume It’s C
assume it s d

D

F

T

7, 10, 13

3, 5

True:

False:

6, 8, 9

1, 2, 4, 11, 12

If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise

 The number of misclassified examples from the training set is 5

Assume It’s D
assume it s e

E

F

T

8, 9, 10, 13

1, 3, 5, 12

True:

False:

6, 7

2, 4, 11

If we test only E we will report that CONCEPT is False,

independent of the outcome

 The number of misclassified examples from the training set is 6

Assume It’s E
assume it s e1

E

F

T

8, 9, 10, 13

1, 3, 5, 12

True:

False:

6, 7

2, 4, 11

If we test only E we will report that CONCEPT is False,

independent of the outcome

 The number of misclassified examples from the training set is 6

Assume It’s E

So, the best predicate to test is A

choice of second predicate

6, 8, 9, 10, 13

True:

False:

7

11, 12

Choice of Second Predicate

A

F

T

False

C

F

T

 The number of misclassified examples from the training set is 1

choice of third predicate

11,12

True:

False:

7

Choice of Third Predicate

A

F

T

False

C

F

T

True

B

T

F

final tree

A

True

False

A?

C

False

False

True

True

False

B?

False

True

B

False

True

True

False

C?

True

False

True

True

False

True

False

Final Tree

CONCEPT  A (C v B)

CONCEPT  A (B v C)

top down induction of a dt

A

True

False

C

False

False

True

True

B

True

False

False

True

Subset of examples that satisfy A

Top-DownInduction of a DT

DTL(D, Predicates)

  • If all examples in D are positive then return True
  • If all examples in D are negative then return False
  • If Predicates is empty then return failure
  • A  error-minimizing predicate in Predicates
  • Return the tree whose:

- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

top down induction of a dt1

A

True

False

C

False

False

True

True

B

True

False

Noise in training set!

May return majority rule,instead of failure

False

True

Top-DownInduction of a DT

DTL(D, Predicates)

  • If all examples in D are positive then return True
  • If all examples in D are negative then return False
  • If Predicates is empty then return failure
  • A  error-minimizing predicate in Predicates
  • Return the tree whose:

- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

top down induction of a dt2

A

True

False

C

False

False

True

True

B

True

False

False

True

Top-DownInduction of a DT

DTL(D, Predicates)

  • If all examples in D are positive then return True
  • If all examples in D are negative then return False
  • If Predicates is empty then return majority rule
  • A  error-minimizing predicate in Predicates
  • Return the tree whose:

- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

comments
Comments
  • Widely used algorithm
  • Easy to extend to k-class classification
  • Greedy
  • Robust to noise (incorrect examples)
  • Not incremental
human readability
Human-Readability
  • DTs also have the advantage of being easily understood by humans
  • Legal requirement in many areas
    • Loans & mortgages
    • Health insurance
    • Welfare
learnable concepts
Learnable Concepts
  • Some simple concepts cannot be represented compactly in DTs
    • Parity(x) = X1 xor X2 xor … xor Xn
    • Majority(x) = 1 if most of Xi’s are 1, 0 otherwise
  • Exponential size in # of attributes
  • Need exponential # of examples to learn exactly
  • The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT
performance issues

100

% correct on test set

size of training set

Typical learning curve

Performance Issues
  • Assessing performance:
    • Training set and test set
    • Learning curve
performance issues1
Performance Issues
  • Assessing performance:
    • Training set and test set
    • Learning curve

100

Some concepts are unrealizable within a machine’s capacity

% correct on test set

size of training set

Typical learning curve

performance issues2

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

100

% correct on test set

size of training set

Typical learning curve

Performance Issues
  • Assessing performance:
    • Training set and test set
    • Learning curve
  • Overfitting
performance issues3
Performance Issues
  • Assessing performance:
    • Training set and test set
    • Learning curve
  • Overfitting
    • Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

Terminate recursion when

# errors (or information gain) is small

performance issues4
Performance Issues
  • Assessing performance:
    • Training set and test set
    • Learning curve
  • Overfitting
    • Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

The resulting decision tree + majority rule may not classify correctly all examples in the training set

Terminate recursion when

# errors (or information gain) is small

statistical methods for addressing overfitting noise
Statistical Methods for Addressing Overfitting / Noise
  • There may be few training examples that match the path leading to a deep node in the decision tree
    • More susceptible to choosing irrelevant/incorrect attributes when sample is small
  • Idea:
    • Make a statistical estimate of predictive power (which increases with larger samples)
    • Prune branches with low predictive power
    • Chi-squared pruning
top down dt pruning
Top-down DT pruning
  • Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly
  • At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk
  • Chi-squared statistical significance test:
    • Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant)
    • Alternate hypothesis: examples not randomly chosen (X is relevant)
  • Prune X if testing X is not statistically significant
chi squared test
Chi-Squared test
  • Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’
    • Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds
  • Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom
  • Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)
performance issues5
Performance Issues
  • Assessing performance:
    • Training set and test set
    • Learning curve
  • Overfitting
    • Tree pruning
  • Incorrect examples
  • Missing data
  • Multi-valued and continuous attributes
multi valued attributes
Multi-Valued Attributes
  • Simple change: consider splits on all values A can take on
  • Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant
    • More values => dataset split into smaller example sets when picking attributes
    • Smaller example sets => more likely to fit well to spurious noise
continuous attributes

7

7

6

5

6

5

4

5

4

3

4

5

4

5

6

7

Continuous Attributes
  • Continuous attributes can be converted into logical ones via thresholds
    • X => X<a
  • When considering splitting on X, pick the threshold a to minimize # of errors / entropy
decision boundaries
Decision Boundaries
  • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

F

T

T

F

x1

decision boundaries1
Decision Boundaries
  • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

F

F

x2>=10

F

T

F

T

x1

decision boundaries2
Decision Boundaries
  • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

T

F

x2>=10

x2>=15

F

T

F

T

T

F

x1

decision boundaries3
Decision Boundaries
  • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples
exercise
Exercise
  • With 2 attributes, what kinds of decision boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth:
    • 1?
    • 2?
    • 3?
  • Describe the appearance and the complexity of these decision boundaries
reading
Reading
  • Next class:
    • Neural networks & function learning
    • R&N 18.6-7
ad