Classification with decision trees
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Classification with Decision Trees PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Classification with Decision Trees. Instructor: Qiang Yang Hong Kong University of Science and Technology [email protected] Thanks: Eibe Frank and Jiawei Han. DECISION TREE [Quinlan93]. An internal node represents a test on an attribute.

Download Presentation

Classification with Decision Trees

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Classification with Decision Trees

Instructor: Qiang Yang

Hong Kong University of Science and Technology

[email protected]

Thanks: Eibe Frank and Jiawei Han


  • An internal node represents a test on an attribute.

  • A branch represents an outcome of the test, e.g., Color=red.

  • A leaf node represents a class label or class label distribution.

  • At each node, one attribute is chosen to split training examples into distinct classes as much as possible

  • A new case is classified by following a matching path to a leaf node.

Training Set


















Building Decision Tree [Q93]

  • Top-down tree construction

    • At start, all training examples are at the root.

    • Partition the examples recursively by choosing one attribute each time.

  • Bottom-up tree pruning

    • Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.

Choosing the Splitting Attribute

  • At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose.

  • Typical goodness functions:

    • information gain (ID3/C4.5)

    • information gain ratio

    • gini index

Which attribute to select?

A criterion for attribute selection

  • Which is the best attribute?

    • The one which will result in the smallest tree

    • Heuristic: choose the attribute that produces the “purest” nodes

  • Popular impurity criterion: information gain

    • Information gain increases with the average purity of the subsets that an attribute produces

  • Strategy: choose attribute that results in greatest information gain

Computing information

  • Information is measured in bits

    • Given a probability distribution, the info required to predict an event is the distribution’s entropy

    • Entropy gives the information required in bits (this can involve fractions of bits!)

  • Formula for computing the entropy:

    • Suppose a set S has n values: V1, V2, …Vn, where Vi has proportion pi,

    • E.g., the weather data has 2 values: Play=P and Play=N. Thus, p1=9/14, p2=5/14.

Example: attribute “Outlook”

  • “Outlook” = “Sunny”:

  • “Outlook” = “Overcast”:

  • “Outlook” = “Rainy”:

  • Expected information for attribute:

Note: this is

normally not


Computing the information gain

  • Information gain: information before splitting – information after splitting

  • Information gain for attributes from weather data:

Continuing to split

The final decision tree

  • Note: not all leaves need to be pure; sometimes identical instances have different classes

     Splitting stops when data can’t be split any further

Highly-branching attributes

  • Problematic: attributes with a large number of values (extreme case: ID code)

  • Subsets are more likely to be pure if there is a large number of values

    • Information gain is biased towards choosing attributes with a large number of values

    • This may result in overfitting (selection of an attribute that is non-optimal for prediction)

  • Another problem: fragmentation

The gain ratio

  • Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes

  • Gain ratio takes number and size of branches into account when choosing an attribute

    • It corrects the information gain by taking the intrinsic information of a split into account

    • Also called split ratio

  • Intrinsic information: entropy of distribution of instances into branches

    • (i.e. how much info do we need to tell which branch an instance belongs to)

Gain Ratio

  • IntrinsicInfo should be

    • Large when data is evenly spread over all branches

    • Small when all data belong to one branch

  • Gain ratio (Quinlan’86) normalizes info gain by IntrinsicInfo:

Computing the gain ratio

  • Example: intrinsic information for ID code

  • Importance of attribute decreases as intrinsic information gets larger

  • Example of gain ratio:

  • Example:

Gain ratios for weather data

More on the gain ratio

  • “Outlook” still comes out top

  • However: “ID code” has greater gain ratio

    • Standard fix: ad hoc test to prevent splitting on that type of attribute

  • Problem with gain ratio: it may overcompensate

    • May choose an attribute just because its intrinsic information is very low

    • Standard fix:

      • First, only consider attributes with greater than average information gain

      • Then, compare them on gain ratio

Gini Index

  • If a data set T contains examples from n classes, gini index, gini(T) is defined as

    where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed.

  • After splitting T into two subsets T1 and T2 with sizes N1 and N2, the gini index of the split data is defined as

  • The attribute providing smallest ginisplit(T) is chosen to split the node.

Stopping Criteria

  • When all cases have the same class. The leaf node is labeled by this class.

  • When there is no available attribute. The leaf node is labeled by the majority class.

  • When the number of cases is less than a specified threshold. The leaf node is labeled by the majority class.

  • You can make a decision at every node in a decision tree!

    • How?


  • Pruning simplifies a decision tree to prevent overfitting to noise in the data

  • Two main pruning strategies:

    • Postpruning: takes a fully-grown decision tree and discards unreliable parts

    • Prepruning: stops growing a branch when information becomes unreliable

  • Postpruning preferred in practice because of early stopping in prepruning


  • Usually based on statistical significance test

  • Stops growing the tree when there is no statistically significant association between any attribute and the class at a particular node

  • Most popular test: chi-squared test

  • ID3 used chi-squared test in addition to information gain

    • Only statistically significant attributes where allowed to be selected by information gain procedure


  • Builds full tree first and prunes it afterwards

    • Attribute interactions are visible in fully-grown tree

  • Problem: identification of subtrees and nodes that are due to chance effects

  • Two main pruning operations:

    • Subtree replacement

    • Subtree raising

  • Possible strategies: error estimation, significance testing, MDL principle

Subtree replacement

  • Bottom-up: tree is considered for replacement once all its subtrees have been considered

Estimating error rates

  • Pruning operation is performed if this does not increase the estimated error

  • Of course, error on the training data is not a useful estimator (would result in almost no pruning)

  • One possibility: using hold-out set for pruning (reduced-error pruning)

  • C4.5’s method: using upper limit of 25% confidence interval derived from the training data

    • Standard Bernoulli-process-based method

Post-pruning in C4.5

  • Bottom-up pruning: at each non-leaf node v, if merging the subtree at v into a leaf node improves accuracy, perform the merging.

    • Method 1: compute accuracy using examples not seen by the algorithm.

    • Method 2: estimate accuracy using the training examples:

  • Consider classifying E examples incorrectly out of N examples as observing E events in N trials in the binomial distribution.

    • For a given confidence level CF, the upper limit on the error rate over the whole population is with CF% confidence.

Pessimistic Estimate

  • Usage in Statistics: Sampling error estimation

    • Example:

      • population: 1,000,000 people,

      • population mean: percentage of the left handed people

      • sample: 100 people

      • sample mean: 6 left-handed

      • How to estimate the REAL population mean?


25% confidence interval






C4.5’s method

  • Error estimate for subtree is weighted sum of error estimates for all its leaves

  • Error estimate for a node:

  • If c = 25% then z = 0.69 (from normal distribution)

  • f is the error on the training data

  • N is the number of instances covered by the leaf











Example cont.

  • Consider a subtree rooted at Outlook with 3 leaf nodes:

    • Sunny: Play = yes : (0 error, 6 instances)

    • Overcast: Play= yes: (0 error, 9 instances)

    • Cloudy: Play = no (0 error, 1 instance)

  • The estimated error for this subtree is

    • 6*0.074+9*0.050+1*0.323=1.217

  • If the subtree is replaced with the leaf “yes”, the estimated error is

  • So no pruning is performed

Another Example

If we have a

single leaf node

If we split, we can

compute average

error using

ratios 6:2:6

this gives 0.51

f=5/14 e=0.46

f=0.33 e=0.47

f=0.5 e=0.72

f=0.33 e=0.47

Other Trees

  • Classification Trees

    • Current node

    • Children nodes (L, R):

  • Decision Trees

    • Current node

    • Children nodes (L, R):

  • GINI index used in CART (STD= )

    • Current node

    • Children nodes (L, R):

Efforts on Scalability

  • Most algorithms assume data can fit in memory.

  • Recent efforts focus on disk-resident implementation for decision trees.

  • Random sampling

  • Partitioning

  • Examples

    • SLIQ (EDBT’96 -- [MAR96])

    • SPRINT (VLDB96 -- [SAM96])

    • PUBLIC (VLDB98 -- [RS98])

    • RainForest (VLDB98 -- [GRG98])


  • Consider the following variations of decision trees

1. Apply KNN to each leaf node

  • Instead of choosing a class label as the majority class label, use KNN to choose a class label

2. Apply Naïve Bayesian at each leaf node

  • For each leave node, use all the available information we know about the test case to make decisions

  • Instead of using the majority rule, use probability/likelihood to make decisions

3. Use error rates instead of entropy

  • If a node has N1 positive class labels P, and N2 negative class labels N,

    • If N1> N2, then choose P

    • The error rate = N2/(N1+N2) at this node

    • The expected error at a parent node can be calculated as weighted sum of the error rates at each child node

      • The weights are the proportion of training data in each child

Cost Sensitive Decision Trees

  • When the FP and FN have different costs, the leaf node label is different depending on the costs:

    • If growing a tree has a smaller total cost,

      • then choose an attribute with minimal total cost.

      • Otherwise, stop and form a leaf.

    • Label leaf according to minimal total cost:

      • Suppose the leaf have P positive examples and N negative examples

    • FP denotes the cost of a false positive example and FN false negative

      • If (P×FN N×FP)THEN label = positive ELSE label = negative

5. When there is missing value in the test data, we allow tests to be done

  • Attribute selection criterion: minimal total cost(Ctotal = Cmc + Ctest) instead of minimal entropy in C4.5

  • Typically, if there are missing values, then to obtain a value for a missing attribute (say Temperature) will incur new cost

    • But may increase accuracy of prediction, thus reducing the miss classification costs

  • In general, there is a balance between the two costs

  • We care about the total cost

Stream Data

  • Suppose that you have built a decision tree from a set of training data

  • Now suppose that some additional training data comes at a regular interval, in fast speed

  • How do you adapt the existing tree to fit the new data?

    • Suggest an ‘online’ algorithm.

    • Hint: find a paper online on this topic

  • Login