Decision tree. LING 572 Fei Xia 1/10/06. Outline. Basic concepts Main issues Advanced topics. Basic concepts. A classification problem. Classification and estimation problems. Given a finite set of (input) attributes: features Ex: District, House type, Income, Previous customer
Find a decision tree that is as small as possible and fits the data
Q1: Choosing best attribute: what quality measure to use?
Q2: Determining when to stop splitting: avoid overfitting
Q3: Handling continuous attributes
Q4: Handling training data with missing attribute values
Q5: Handing attributes with different costs
Q6: Dealing with continuous goal attribute
(a.k.a. choose the A with the min average entropy)
Where Si is subset of S for which A has value vi.
In practice, the latter works better than the former.
Question: how to choose thresholds?
Q1: Choosing best attribute: different quality measure.
Q2: Determining when to stop splitting: stop earlier or post-pruning
Q3: Handling continuous attributes: find the breakpoints
Q4: Handling training data with missing attribute values: blank value, most common value, or fractional count
Q5: Handing attributes with different costs: use a quality measure that includes the cost factors.
Q6: Dealing with continuous goal attribute: various ways of building regression trees.