Efficient Determination of Dynamic Split Points in a Decision Tree

1 / 47

# Efficient Determination of Dynamic Split Points in a Decision Tree - PowerPoint PPT Presentation

Efficient Determination of Dynamic Split Points in a Decision Tree. Max Chickering Chris Meek Robert Rounthwaite. Outline. Probabilistic decision trees Simple algorithm for choosing split points Empirical evaluation. Probabilistic Decision Trees. Input variables: X 1 , …, X n

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Efficient Determination of Dynamic Split Points in a Decision Tree' - hedya

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Efficient Determination of Dynamic Split Points in a Decision Tree

Max Chickering

Chris Meek

Robert Rounthwaite

Outline
• Probabilistic decision trees
• Simple algorithm for choosing split points
• Empirical evaluation
Probabilistic Decision Trees

Input variables: X1, …, Xn

Target variable: Y

variable = feature = attribute = ...

A probabilistic decision tree is a mapping from X1, …, Xn

to

p(Y | X1, …, Xn)

Example

X1

0

1

X1 binary

X2 continuous

Y binary

X2

< 0

≥ 0

0.5, 0.5

0.2, 0.8

0.4, 0.6

p(Y=0| X1 = 1, X2 = -15.6) = 0.2

Applications Using Probabilities

• Recommending TV shows

p(watch show i | other viewing preferences)

• Predicting demographics from Web usage

p(Age > 34 | web sites visited)

Record X Y Z

• 0 0 red
• 2 0 blue
• . . . .
• . . . .
• . . . .
• n 1 7 green
Learning Trees from Data

Z

red

green,blue

0.2, 0.8

0.4, 0.6

Automatically Learn a tree for p(X | Y, Z)

X1

X2

Xn

Xn

X1

X3

Score3(Data)

Score1(Data)

Scoren(Data)

Score2(Data)

Score1(Data)

Scoren(Data)

X2

X2

X2

X2

Greedy Algorithm

X1

X3

Learning Trees From Data

DL: Data relevant to leaf L

X2

X2

X2

0

1

0

1

0

1

L

Most scores: increase from split can be scored locally

Z

1

2

0

Candidate Splits: Discrete Predictors
• Complete Split
• Hold-one-out
• Arbitrary subsets

Z

Z

Z

0

1,2

1

0,2

2

0,1

Continuous Predictors
• Static Discretization

+ Simple: reduces problem to all discrete

- Not context-dependent

• Dynamic Discretization

Binary splits: choose a single split point

Which candidate split points do we consider during search?

Z

< c

≥ c

Continuous Predictors

Sorted Data for Predictor Z

49.1

-19.3

X

X

X

X

X

X

X

X

X

c1

c2

c3

Most scoring functions (e.g. Bayesian, entropy)

cannot distinguish candidate c1 from candidate c2

Continuous Predictors

Try all midpoints (or all values) as candidate splits

X

X

X

X

X

X

X

X

X

Approach taken by many DT learning systems

Continuous Predictors

Optimization for Discrete Target

Maximum entropy, Bayesian scores

0

1

1

Target

0

1

0

1

0

0

X

X

X

X

X

X

X

X

X

Complexity Analysis

nL records in the data relevant to leaf L

n total data records

m continuous predictor variables

• Sorting the data

Approach 1: O(mnLlog nL) time / leaf

Approach 2: O(m n log n) space / tree

• Scoring the candidate splits(majority of learning time)

O(m nL) scores / leaf

Expensive for Large Datasets!

Can we do something more scalable?
• Can we choose candidate split points without sorting?
• How many candidate split points do we need to learn good models?
A Simple Solution: Quantile Method
• Fit a distribution to the predictor points
• Find quantiles

1/3

1/3

1/3

X

X

X

X

X

X

X

X

X

c1

c2

For 2 split points, divide into 3 equal-probability regions

A Simple Solution: Quantile Method

For every leaf node in the tree:

• Use the same type of distribution
• Use the same number of split points k
Gaussian Distribution
• Need mean, SD for every continuous predictor
• Constant-time calculation for each quantile point

X

X

X

X

X

X

X

X

X

Uniform Distribution
• Need min, max for every continuous predictor
• Constant-time calculation for each quantile point

X

X

X

X

X

X

X

X

X

Empirical Distribution: K-Tiles

K-tile approach: k=1 results in the median

O(k n) algorithm for small number of split points k

Sorting better for large k

X

X

X

X

X

X

X

X

X

Experiments
• Varied k for all three distributions
• Datasets
• Three small datasets from Irvine
• Two real-world datasets
• Bayesian Score
• Models
• Learned a decision tree for every variable
• Potential predictors were all other variables
Evaluation
• Split data into Dtrain and Dtest
• Learn a tree for target xi using Dtrain
• Evaluate Dtestvia log score:
Method Scores
• Gaus(xi, k), Uniform(xi, k), KTile(xi, k)

Log score of tree learned for xi using k split

points with the given method

• All(xi)

Log score of tree learned for xi using all split

points

Evaluation
• Relative improvement of Gaussian method
• Relative improvement of Uniform method
• Relative improvement of KTile method
Census Results
• 37 variables
• ~300,000 records
• 31 trees containing a continuous split using at least one method
• Report average of the 31 relative improvements
Census Results: Average Relative Improvement

Number of Candidate Splits

Average Relative Improvement

Census Results: Learning Time

Number of Candidate Splits

Gaussian/KTile about the same: savings is in the scoring

Census Results: k =15

Tree Index

Relative Improvement

With 15 points, within 1.5% of all-points method

Media Metrix Results
• “Nielsen on the web”: demographics / internet-use
• 37 variables
• ~5,000 records
• 16 trees containing a continuous split using at least one method
• Report simple average of the 16 relative improvements
Media Metrix Results: Average Relative Improvement

Candidate Splits

Average Relative Improvement

Media Metrix Results: Learning Time

Number of Candidate Splits

Media Metrix Results: k = 15

Relative Improvement

Tree Index

Summary
• Gaussian and K-tile approaches yield good results with very few points (k = 15)
• Uniform failure probably due to outliers
• Gaussian approach easy to implement
• Fewer split points lead to faster learning
Future Work
• Consider k as a function of the number of records at the leaf
• More experiments

http://research.microsoft.com/~dmax

Relevant Papers

WinMine Toolkit

Evaluation
• Log score for a tree

Log score = log S p(xi | other variables)

S*j : log predictive score of “all splits” method

sm(k,j) : log predictive score of method m using k split points

Relative increase for a method m

IncM(k,j)

Took simple average of increases for those trees containing a continuous split for at least one method

Results
• Gaussian and Empirical Distributions work well
• Only about 10 split points are needed to be as accurate as using all points
Example: Gaussian, k = 3

25%

25%

25%

25%

Estimate Gaussian for each continuous predictor:

Accumulate sums and sums of squares for each predictor

Calculate points that divide the Gaussian into 4 equal-probability regions

Distributions
• Gaussian
• Need mean, SD for every continuous predictor
• Constant-time lookup for quantile points
• Uniform
• Need min, max for every continuous predictor
• Constant-time lookup for quantile points
• Empirical
• O(m n) algorithm : sorting better for large m
Learning Trees: A Bayesian Approach

X1

0

1

X1

0

1

X2

< 0

>= 0

T2

T1

Choose most probable tree structure given the data

(choose most likely parameters after structure is determined)

Local Scoring

Typical Scoring Functions Decompose

A

A

0

0

B

B

0

0

X

0

1

Increase in Score = f(Data w/ A=0,B=0)

The score of a split on a leaf depends only the relevant its records

Greedy Algorithm
• Repeatedly replace a leaf node by the split that increases the score the most
• Stop when no replacement increases the score
Example: p(Y | X,Z)

X

0

1

Current State T0 :

Consider all splits on Z in left child

X

X

0

0

1

1

Z

Z

1

0

0,2

1,2

If one of the splits improves the score, apply

the best one