- 307 Views
- Uploaded on

Download Presentation
## Decision tree

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Basic concepts
- Main issues
- Advanced topics

Classification and estimation problems

- Given
- a finite set of (input) attributes: features
- Ex: District, House type, Income, Previous customer
- a target attribute: the goal
- Ex: Outcome: {Nothing, Respond}
- training data: a set of classified examples in attribute-value representation
- Ex: the previous table
- Predict the value of the goal given the values of input attributes
- The goal is a discrete variable classification problem
- The goal is a continuous variable estimation problem

Suburban

(3/5)

Urban

(3/5)

Rural

(4/4)

House type

Previous customer

Respond

Detached

(2/2)

Yes(3/3)

No (2/2)

Semi-detached

(3/3)

Nothing

Nothing

Respond

Respond

Decision treeDecision tree representation

- Each internal node is a test:
- Theoretically, a node can test multiple attributes
- In most systems, a node tests exactly one attribute
- Each branch corresponds to test results
- A branch corresponds to an attribute value or a range of attribute values
- Each leaf node assigns
- a class: decision tree
- a real value: regression tree

What’s a best decision tree?

- “Best”: You need a bias (e.g., prefer the “smallest” tree): least depth? Fewest nodes? Which trees are the best predictors of unseen data?
- Occam's Razor: we prefer the simplest hypothesis that fits the data.

Find a decision tree that is as small as possible and fits the data

Finding a smallest decision tree

- A decision tree can represent any discrete function of the inputs: y=f(x1, x2, …, xn)
- The space of decision trees is too big for systemic search for a smallest decision tree.
- Solution: greedy algorithm

Basic algorithm: top-down induction

- Find the “best” decision attribute, A, and assign A as decision attribute for node
- For each value of A, create new branch, and divide up training examples
- Repeat the process until the gain is small enough

Major issues

Q1: Choosing best attribute: what quality measure to use?

Q2: Determining when to stop splitting: avoid overfitting

Q3: Handling continuous attributes

Q4: Handling training data with missing attribute values

Q5: Handing attributes with different costs

Q6: Dealing with continuous goal attribute

Q1: What quality measure

- Information gain
- Gain Ratio
- …

Entropy of a training set

- S is a sample of training examples
- Entropy is one way of measuring the impurity of S
- Pc is the proportion of examples in S whose target attribute has value c.

Information Gain

- Gain(S,A)=expected reduction in entropy due to sorting on A.
- Choose the A with the max information gain.

(a.k.a. choose the A with the min average entropy)

E=0.940

S=[9+,5-]

E=0.940

Wind

Humidity

High

Weak

Normal

Strong

[3+,4-]

[6+,2-]

[6+,1-]

[3+,3-]

An exampleE=0.985

E=0.592

E=0.811

E=1.00

InfoGain(S, Humidity)

=0.940-(7/14)*0.985-(7/14)0.592

=0.151

InfoGain(S, Wind)

=0.940-(8/14)*0.811-(6/14)*1.0

=0.048

Other quality measures

- Problem of information gain:
- Information Gain prefers attributes with many values.
- An alternative: Gain Ratio

Where Si is subset of S for which A has value vi.

Q2: avoiding overfitting

- Overfitting occurs when our decision tree characterizes too much detail, or noise in our training data.
- Consider error of hypothesis h over
- Training data: ErrorTrain(h)
- Entire distribution D of data: ErrorD(h)
- A hypothesis h overfits training data if there is an alternative hypothesis h’, such that
- ErrorTrain(h) < ErrorTrain(h’), and
- ErrorD(h) > errorD(h’)

How to avoiding Overfitting

- Stop growing the tree earlier
- Ex: InfoGain < threshold
- Ex: Size of examples in a node < threshold
- …
- Grow full tree, then post-prune

In practice, the latter works better than the former.

Post-pruning

- Split data into training and validation set
- Do until further pruning is harmful:
- Evaluate impact on validation set of pruning each possible node (plus those below it)
- Greedily remove the ones that don’t improve the performance on validation set
- Produces a smaller tree with best performance measure

Performance measure

- Accuracy:
- on validation data
- K-fold cross validation
- Misclassification cost: Sometimes more accuracy is desired for some classes than others.
- MDL: size(tree) + errors(tree)

Rule post-pruning

- Convert tree to equivalent set of rules
- Prune each rule independently of others
- Sort final rules into desired sequence for use
- Perhaps most frequently used method (e.g., C4.5)

Q3: handling numeric attributes

- Continuous attribute discrete attribute
- Example
- Original attribute: Temperature = 82.5
- New attribute: (temperature > 72.3) = t, f

Question: how to choose thresholds?

Choosing thresholds for a continuous attribute

- Sort the examples according to the continuous attribute.
- Identify adjacent examples that differ in their target classification a set of candidate thresholds
- Choose the candidate with the highest information gain.

Q4: Unknown attribute values

- Assume an attribute can take the value “blank”.
- Assign most common value of A among training data at node n.
- Assign most common value of A among training data at node n which have the same target class.
- Assign prob pi to each possible value vi of A
- Assign a fraction (pi) of example to each descendant in tree
- This method is used in C4.5.

Q5: Attributes with cost

- Consider medical diagnosis (e.g., blood test) has a cost
- Question: how to learn a consistent tree with low expected cost?
- One approach: replace gain by
- Tan and Schlimmer (1990)

Q6: Dealing with continuous goal attribute Regression tree

- A variant of decision trees
- Estimation problem: approximate real-valued functions: e.g., the crime rate
- A leaf node is marked with a real value or a linear function: e.g., the mean of the target values of the examples at the node.
- Measure of impurity: e.g., variance, standard deviation, …

Summary of Major issues

Q1: Choosing best attribute: different quality measure.

Q2: Determining when to stop splitting: stop earlier or post-pruning

Q3: Handling continuous attributes: find the breakpoints

Summary of major issues (cont)

Q4: Handling training data with missing attribute values: blank value, most common value, or fractional count

Q5: Handing attributes with different costs: use a quality measure that includes the cost factors.

Q6: Dealing with continuous goal attribute: various ways of building regression trees.

Common algorithms

- ID3
- C4.5
- CART

ID3

- Proposed by Quinlan (so is C4.5)
- Can handle basic cases: discrete attributes, no missing information, etc.
- Information gain as quality measure

C4.5

- An extension of ID3:
- Several quality measures
- Incomplete information (missing attribute values)
- Numerical (continuous) attributes
- Pruning of decision trees
- Rule derivation
- Random mood and batch mood

CART

- CART (classification and regression tree)
- Proposed by Breiman et. al. (1984)
- Constant numerical values in leaves
- Variance as measure of impurity

Strengths of decision tree methods

- Ability to generate understandable rules
- Ease of calculation at classification time
- Ability to handle both continuous and categorical variables
- Ability to clearly indicate best attributes

The weaknesses of decision tree methods

- Greedy algorithm: no global optimization
- Error-prone with too many classes: numbers of training examples become smaller quickly in a tree with many levels/branches.
- Expensive to train: sorting, combination of attributes, calculating quality measures, etc.
- Trouble with non-rectangular regions: the rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.

Combining multiple models

- The inherent instability of top-down decision tree induction: different training datasets from a given problem domain will produce quite different trees.
- Techniques:
- Bagging
- Boosting

Bagging

- Introduced by Breiman
- It first creates multiple decision trees based on different training sets.
- Then, it compares each tree and incorporates the best features of each.
- This addresses some of the problems inherent in regular ID3.

Boosting

- Introduced by Freund and Schapire
- It examines the trees that incorrectly classify an instance and assign them a weight.
- These weights are used to eliminate hypotheses or refocus the algorithm on the hypotheses that are performing well.

Summary

- Basic case:
- Discrete input attributes
- Discrete goal attribute
- No missing attribute values
- Same cost for all tests and all kinds of misclassification.
- Extended cases:
- Continuous attributes
- Real-valued goal attribute
- Some examples miss some attribute values
- Some tests are more expensive than others.
- Some misclassifications are more serious than others.

Summary (cont)

- Basic algorithm:
- greedy algorithm
- top-down induction
- Bias for small trees
- Major issues:

Uncovered issues

- Incremental decision tree induction?
- How can a decision relate to other decisions? what's the order of making the decisions? (e.g., POS tagging)
- What's the difference between decision tree and decision list?

Download Presentation

Connecting to Server..