- 66 Views
- Uploaded on
- Presentation posted in: General

Classification with Decision Trees

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Classification with Decision Trees

Instructor: Qiang Yang

Hong Kong University of Science and Technology

Thanks: Eibe Frank and Jiawei Han

- An internal node represents a test on an attribute.
- A branch represents an outcome of the test, e.g., Color=red.
- A leaf node represents a class label or class label distribution.
- At each node, one attribute is chosen to split training examples into distinct classes as much as possible
- A new case is classified by following a matching path to a leaf node.

Outlook

sunny

overcast

rain

overcast

humidity

windy

P

high

normal

false

true

N

P

N

P

- Top-down tree construction
- At start, all training examples are at the root.
- Partition the examples recursively by choosing one attribute each time.

- Bottom-up tree pruning
- Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.

- At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose.
- Typical goodness functions:
- information gain (ID3/C4.5)
- information gain ratio
- gini index

- Which is the best attribute?
- The one which will result in the smallest tree
- Heuristic: choose the attribute that produces the “purest” nodes

- Popular impurity criterion: information gain
- Information gain increases with the average purity of the subsets that an attribute produces

- Strategy: choose attribute that results in greatest information gain

- Information is measured in bits
- Given a probability distribution, the info required to predict an event is the distribution’s entropy
- Entropy gives the information required in bits (this can involve fractions of bits!)

- Formula for computing the entropy:
- Suppose a set S has n values: V1, V2, …Vn, where Vi has proportion pi,
- E.g., the weather data has 2 values: Play=P and Play=N. Thus, p1=9/14, p2=5/14.

- “Outlook” = “Sunny”:
- “Outlook” = “Overcast”:
- “Outlook” = “Rainy”:
- Expected information for attribute:

Note: this is

normally not

defined.

- Information gain: information before splitting – information after splitting
- Information gain for attributes from weather data:

- Note: not all leaves need to be pure; sometimes identical instances have different classes
Splitting stops when data can’t be split any further

- Problematic: attributes with a large number of values (extreme case: ID code)
- Subsets are more likely to be pure if there is a large number of values
- Information gain is biased towards choosing attributes with a large number of values
- This may result in overfitting (selection of an attribute that is non-optimal for prediction)

- Another problem: fragmentation

- Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes
- Gain ratio takes number and size of branches into account when choosing an attribute
- It corrects the information gain by taking the intrinsic information of a split into account
- Also called split ratio

- Intrinsic information: entropy of distribution of instances into branches
- (i.e. how much info do we need to tell which branch an instance belongs to)

- IntrinsicInfo should be
- Large when data is evenly spread over all branches
- Small when all data belong to one branch

- Gain ratio (Quinlan’86) normalizes info gain by IntrinsicInfo:

- Example: intrinsic information for ID code
- Importance of attribute decreases as intrinsic information gets larger
- Example of gain ratio:
- Example:

- “Outlook” still comes out top
- However: “ID code” has greater gain ratio
- Standard fix: ad hoc test to prevent splitting on that type of attribute

- Problem with gain ratio: it may overcompensate
- May choose an attribute just because its intrinsic information is very low
- Standard fix:
- First, only consider attributes with greater than average information gain
- Then, compare them on gain ratio

- If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed.

- After splitting T into two subsets T1 and T2 with sizes N1 and N2, the gini index of the split data is defined as
- The attribute providing smallest ginisplit(T) is chosen to split the node.

- When all cases have the same class. The leaf node is labeled by this class.
- When there is no available attribute. The leaf node is labeled by the majority class.
- When the number of cases is less than a specified threshold. The leaf node is labeled by the majority class.
- You can make a decision at every node in a decision tree!
- How?

- Pruning simplifies a decision tree to prevent overfitting to noise in the data
- Two main pruning strategies:
- Postpruning: takes a fully-grown decision tree and discards unreliable parts
- Prepruning: stops growing a branch when information becomes unreliable

- Postpruning preferred in practice because of early stopping in prepruning

- Usually based on statistical significance test
- Stops growing the tree when there is no statistically significant association between any attribute and the class at a particular node
- Most popular test: chi-squared test
- ID3 used chi-squared test in addition to information gain
- Only statistically significant attributes where allowed to be selected by information gain procedure

- Builds full tree first and prunes it afterwards
- Attribute interactions are visible in fully-grown tree

- Problem: identification of subtrees and nodes that are due to chance effects
- Two main pruning operations:
- Subtree replacement
- Subtree raising

- Possible strategies: error estimation, significance testing, MDL principle

- Bottom-up: tree is considered for replacement once all its subtrees have been considered

- Pruning operation is performed if this does not increase the estimated error
- Of course, error on the training data is not a useful estimator (would result in almost no pruning)
- One possibility: using hold-out set for pruning (reduced-error pruning)
- C4.5’s method: using upper limit of 25% confidence interval derived from the training data
- Standard Bernoulli-process-based method

- Bottom-up pruning: at each non-leaf node v, if merging the subtree at v into a leaf node improves accuracy, perform the merging.
- Method 1: compute accuracy using examples not seen by the algorithm.
- Method 2: estimate accuracy using the training examples:

- Consider classifying E examples incorrectly out of N examples as observing E events in N trials in the binomial distribution.
- For a given confidence level CF, the upper limit on the error rate over the whole population is with CF% confidence.

Pessimistic Estimate

- Usage in Statistics: Sampling error estimation
- Example:
- population: 1,000,000 people,
- population mean: percentage of the left handed people
- sample: 100 people
- sample mean: 6 left-handed
- How to estimate the REAL population mean?

- Example:

Possibility(%)

25% confidence interval

6

2

10

L0.25(100,6)

U0.25(100,6)

- Error estimate for subtree is weighted sum of error estimates for all its leaves
- Error estimate for a node:
- If c = 25% then z = 0.69 (from normal distribution)
- f is the error on the training data
- N is the number of instances covered by the leaf

Outlook

?

sunny

cloudy

yes

overcast

yes

yes

no

Example cont.

- Consider a subtree rooted at Outlook with 3 leaf nodes:
- Sunny: Play = yes : (0 error, 6 instances)
- Overcast: Play= yes: (0 error, 9 instances)
- Cloudy: Play = no (0 error, 1 instance)

- The estimated error for this subtree is
- 6*0.074+9*0.050+1*0.323=1.217

- If the subtree is replaced with the leaf “yes”, the estimated error is
- So no pruning is performed

If we have a

single leaf node

If we split, we can

compute average

error using

ratios 6:2:6

this gives 0.51

f=5/14 e=0.46

f=0.33 e=0.47

f=0.5 e=0.72

f=0.33 e=0.47

- Classification Trees
- Current node
- Children nodes (L, R):

- Decision Trees
- Current node
- Children nodes (L, R):

- GINI index used in CART (STD= )
- Current node
- Children nodes (L, R):

- Most algorithms assume data can fit in memory.
- Recent efforts focus on disk-resident implementation for decision trees.
- Random sampling
- Partitioning
- Examples
- SLIQ (EDBT’96 -- [MAR96])
- SPRINT (VLDB96 -- [SAM96])
- PUBLIC (VLDB98 -- [RS98])
- RainForest (VLDB98 -- [GRG98])

- Consider the following variations of decision trees

- Instead of choosing a class label as the majority class label, use KNN to choose a class label

- For each leave node, use all the available information we know about the test case to make decisions
- Instead of using the majority rule, use probability/likelihood to make decisions

- If a node has N1 positive class labels P, and N2 negative class labels N,
- If N1> N2, then choose P
- The error rate = N2/(N1+N2) at this node
- The expected error at a parent node can be calculated as weighted sum of the error rates at each child node
- The weights are the proportion of training data in each child

- When the FP and FN have different costs, the leaf node label is different depending on the costs:
- If growing a tree has a smaller total cost,
- then choose an attribute with minimal total cost.
- Otherwise, stop and form a leaf.

- Label leaf according to minimal total cost:
- Suppose the leaf have P positive examples and N negative examples

- FP denotes the cost of a false positive example and FN false negative
- If (P×FN N×FP)THEN label = positive ELSE label = negative

- If growing a tree has a smaller total cost,

- Attribute selection criterion: minimal total cost(Ctotal = Cmc + Ctest) instead of minimal entropy in C4.5
- Typically, if there are missing values, then to obtain a value for a missing attribute (say Temperature) will incur new cost
- But may increase accuracy of prediction, thus reducing the miss classification costs

- In general, there is a balance between the two costs
- We care about the total cost

- Suppose that you have built a decision tree from a set of training data
- Now suppose that some additional training data comes at a regular interval, in fast speed
- How do you adapt the existing tree to fit the new data?
- Suggest an ‘online’ algorithm.
- Hint: find a paper online on this topic