- 88 Views
- Uploaded on
- Presentation posted in: General

CS B351: Decision Trees

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CS B351: Decision Trees

- Decision trees
- Learning curves
- Combatting overfitting

a small one!

- Supervised learning setting
- The target function f(x) takes on values True and False
- A example is positive if f is True, else it is negative
- The set X of all possible examples is the example set
- The training set is a subset of X

- Here, examples (x, f(x)) take on discrete values

- Here, examples (x, f(x)) take on discrete values

Concept

Note that the training set does not say whether

an observable predicate is pertinent or not

- Find a representation of CONCEPT in the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built with the observable attributes, e.g.: CONCEPT(x) A(x) (B(x) v C(x))

A?

True

False

B?

False

False

True

C?

True

True

False

True

False

The predicate CONCEPT(x) A(x) (B(x) v C(x)) can

be represented by the following decision tree:

- Example:A mushroom is poisonous iffit is yellow and small, or yellow,
- big and spotted
- x is a mushroom
- CONCEPT = POISONOUS
- A = YELLOW
- B = BIG
- C = SPOTTED

A?

True

False

B?

False

False

True

C?

True

True

False

True

False

The predicate CONCEPT(x) A(x) (B(x) v C(x)) can

be represented by the following decision tree:

- Example:A mushroom is poisonous iffit is yellow and small, or yellow,
- big and spotted
- x is a mushroom
- CONCEPT = POISONOUS
- A = YELLOW
- B = BIG
- C = SPOTTED
- D = FUNNEL-CAP
- E = BULKY

D

E

B

A

A

T

F

C

T

F

T

F

T

E

A

F

T

T

F

D

E

B

A

A

T

F

C

A?

CONCEPT A (B v C)

True

False

T

F

B?

False

False

T

F

T

True

E

C?

True

A

False

True

True

False

F

T

T

F

CONCEPT

(D(EvA))v(D(C(Bv(B((EA)v(EA))))))

D

E

B

A

A

T

F

C

A?

CONCEPT A (B v C)

True

False

T

F

B?

False

False

T

F

T

True

E

C?

True

A

False

True

True

False

F

T

T

F

CONCEPT

(D(EvA))v(D(C(Bv(B((EA)v(EA))))))

KIS bias Build smallest decision tree

Computationally intractable problem greedy algorithm

The distribution of training set is:

True: 6, 7, 8, 9, 10,13

False: 1, 2, 3, 4, 5, 11, 12

The distribution of training set is:

True: 6, 7, 8, 9, 10,13

False: 1, 2, 3, 4, 5, 11, 12

Without testing any observable predicate, we

could report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Assuming that we will only include one observable

predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)? Greedy algorithm

A

F

T

6, 7, 8, 9, 10, 13

11, 12

True:

False:

1, 2, 3, 4, 5

If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise

The number of misclassified examples from the training set is 2

B

F

T

9, 10

2, 3, 11, 12

True:

False:

6, 7, 8, 13

1, 4, 5

If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise

The number of misclassified examples from the training set is 5

C

F

T

6, 8, 9, 10, 13

1, 3, 4

True:

False:

7

1, 5, 11, 12

If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise

The number of misclassified examples from the training set is 4

D

F

T

7, 10, 13

3, 5

True:

False:

6, 8, 9

1, 2, 4, 11, 12

If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise

The number of misclassified examples from the training set is 5

E

F

T

8, 9, 10, 13

1, 3, 5, 12

True:

False:

6, 7

2, 4, 11

If we test only E we will report that CONCEPT is False,

independent of the outcome

The number of misclassified examples from the training set is 6

E

F

T

8, 9, 10, 13

1, 3, 5, 12

True:

False:

6, 7

2, 4, 11

If we test only E we will report that CONCEPT is False,

independent of the outcome

The number of misclassified examples from the training set is 6

So, the best predicate to test is A

6, 8, 9, 10, 13

True:

False:

7

11, 12

A

F

T

False

C

F

T

The number of misclassified examples from the training set is 1

11,12

True:

False:

7

A

F

T

False

C

F

T

True

B

T

F

A

True

False

A?

C

False

False

True

True

False

B?

False

True

B

False

True

True

False

C?

True

False

True

True

False

True

False

CONCEPT A (C v B)

CONCEPT A (B v C)

A

True

False

C

False

False

True

True

B

True

False

False

True

Subset of examples that satisfy A

DTL(D, Predicates)

- If all examples in D are positive then return True
- If all examples in D are negative then return False
- If Predicates is empty then return failure
- A error-minimizing predicate in Predicates
- Return the tree whose:
- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

A

True

False

C

False

False

True

True

B

True

False

Noise in training set!

May return majority rule,instead of failure

False

True

DTL(D, Predicates)

- If all examples in D are positive then return True
- If all examples in D are negative then return False
- If Predicates is empty then return failure
- A error-minimizing predicate in Predicates
- Return the tree whose:
- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

A

True

False

C

False

False

True

True

B

True

False

False

True

DTL(D, Predicates)

- If all examples in D are positive then return True
- If all examples in D are negative then return False
- If Predicates is empty then return majority rule
- A error-minimizing predicate in Predicates
- Return the tree whose:
- root is A,

- left branch is DTL(D+A,Predicates-A),

- right branch is DTL(D-A,Predicates-A)

- Widely used algorithm
- Easy to extend to k-class classification
- Greedy
- Robust to noise (incorrect examples)
- Not incremental

- DTs also have the advantage of being easily understood by humans
- Legal requirement in many areas
- Loans & mortgages
- Health insurance
- Welfare

- Some simple concepts cannot be represented compactly in DTs
- Parity(x) = X1 xor X2 xor … xor Xn
- Majority(x) = 1 if most of Xi’s are 1, 0 otherwise

- Exponential size in # of attributes
- Need exponential # of examples to learn exactly
- The ease of learning is dependent on shrewdly (or luckily) chosen attributes that correlate with CONCEPT

100

% correct on test set

size of training set

Typical learning curve

- Assessing performance:
- Training set and test set
- Learning curve

- Assessing performance:
- Training set and test set
- Learning curve

100

Some concepts are unrealizable within a machine’s capacity

% correct on test set

size of training set

Typical learning curve

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

100

% correct on test set

size of training set

Typical learning curve

- Assessing performance:
- Training set and test set
- Learning curve

- Overfitting

- Assessing performance:
- Training set and test set
- Learning curve

- Overfitting
- Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

Terminate recursion when

# errors (or information gain) is small

- Assessing performance:
- Training set and test set
- Learning curve

- Overfitting
- Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesisthat agrees with all examplesin the training set

The resulting decision tree + majority rule may not classify correctly all examples in the training set

Terminate recursion when

# errors (or information gain) is small

- There may be few training examples that match the path leading to a deep node in the decision tree
- More susceptible to choosing irrelevant/incorrect attributes when sample is small

- Idea:
- Make a statistical estimate of predictive power (which increases with larger samples)
- Prune branches with low predictive power
- Chi-squared pruning

- Consider an inner node X that by itself (majority rule) predicts p examples correctly and n examples incorrectly
- At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk
- Chi-squared statistical significance test:
- Null hypothesis: example labels randomly chosen with distribution p/(p+n) (X is irrelevant)
- Alternate hypothesis: examples not randomly chosen (X is relevant)

- Prune X if testing X is not statistically significant

- Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’
- Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds

- Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom
- Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)

- Assessing performance:
- Training set and test set
- Learning curve

- Overfitting
- Tree pruning

- Incorrect examples
- Missing data
- Multi-valued and continuous attributes

- Simple change: consider splits on all values A can take on
- Caveat: the more values A can take on, the more important it may appear to be, even if it is irrelevant
- More values => dataset split into smaller example sets when picking attributes
- Smaller example sets => more likely to fit well to spurious noise

7

7

6

5

6

5

4

5

4

3

4

5

4

5

6

7

- Continuous attributes can be converted into logical ones via thresholds
- X => X<a

- When considering splitting on X, pick the threshold a to minimize # of errors / entropy

- With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

F

T

T

F

x1

- With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

F

F

x2>=10

F

T

F

T

x1

- With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples

x2

x1>=20

T

F

x2>=10

x2>=15

F

T

F

T

T

F

x1

- With 2 attributes, what kinds of decision boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth:
- 1?
- 2?
- 3?

- Describe the appearance and the complexity of these decision boundaries

- Next class:
- Neural networks & function learning
- R&N 18.6-7