efficient determination of dynamic split points in a decision tree l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Efficient Determination of Dynamic Split Points in a Decision Tree PowerPoint Presentation
Download Presentation
Efficient Determination of Dynamic Split Points in a Decision Tree

Loading in 2 Seconds...

play fullscreen
1 / 47

Efficient Determination of Dynamic Split Points in a Decision Tree - PowerPoint PPT Presentation


  • 156 Views
  • Uploaded on

Efficient Determination of Dynamic Split Points in a Decision Tree. Max Chickering Chris Meek Robert Rounthwaite. Outline. Probabilistic decision trees Simple algorithm for choosing split points Empirical evaluation. Probabilistic Decision Trees. Input variables: X 1 , …, X n

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient Determination of Dynamic Split Points in a Decision Tree' - hedya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
efficient determination of dynamic split points in a decision tree

Efficient Determination of Dynamic Split Points in a Decision Tree

Max Chickering

Chris Meek

Robert Rounthwaite

outline
Outline
  • Probabilistic decision trees
  • Simple algorithm for choosing split points
  • Empirical evaluation
probabilistic decision trees
Probabilistic Decision Trees

Input variables: X1, …, Xn

Target variable: Y

variable = feature = attribute = ...

A probabilistic decision tree is a mapping from X1, …, Xn

to

p(Y | X1, …, Xn)

example
Example

X1

0

1

X1 binary

X2 continuous

Y binary

X2

< 0

≥ 0

0.5, 0.5

0.2, 0.8

0.4, 0.6

p(Y=0| X1 = 1, X2 = -15.6) = 0.2

applications using probabilities
Applications Using Probabilities
  • Which advertisements to show

p(click ad | pages visited)

  • Recommending TV shows

p(watch show i | other viewing preferences)

  • Predicting demographics from Web usage

p(Age > 34 | web sites visited)

learning trees from data

Record X Y Z

  • 0 0 red
  • 2 0 blue
  • . . . .
  • . . . .
  • . . . .
  • n 1 7 green
Learning Trees from Data

Z

red

green,blue

0.2, 0.8

0.4, 0.6

Automatically Learn a tree for p(X | Y, Z)

greedy algorithm

X1

X2

Xn

Xn

X1

X3

Score3(Data)

Score1(Data)

Scoren(Data)

Score2(Data)

Score1(Data)

Scoren(Data)

X2

X2

X2

X2

Greedy Algorithm
learning trees from data8

X1

X3

Learning Trees From Data

DL: Data relevant to leaf L

X2

X2

X2

0

1

0

1

0

1

L

Most scores: increase from split can be scored locally

candidate splits discrete predictors

Z

1

2

0

Candidate Splits: Discrete Predictors
  • Complete Split
  • Hold-one-out
  • Arbitrary subsets

Z

Z

Z

0

1,2

1

0,2

2

0,1

continuous predictors
Continuous Predictors
  • Static Discretization

+ Simple: reduces problem to all discrete

- Not context-dependent

  • Dynamic Discretization

Binary splits: choose a single split point

Which candidate split points do we consider during search?

Z

< c

≥ c

continuous predictors11
Continuous Predictors

Sorted Data for Predictor Z

49.1

-19.3

X

X

X

X

X

X

X

X

X

c1

c2

c3

Most scoring functions (e.g. Bayesian, entropy)

cannot distinguish candidate c1 from candidate c2

continuous predictors12
Continuous Predictors

Try all midpoints (or all values) as candidate splits

X

X

X

X

X

X

X

X

X

Approach taken by many DT learning systems

continuous predictors13
Continuous Predictors

Optimization for Discrete Target

(Fayyad and Irani,1993)

Maximum entropy, Bayesian scores

0

1

1

Target

0

1

0

1

0

0

X

X

X

X

X

X

X

X

X

complexity analysis
Complexity Analysis

nL records in the data relevant to leaf L

n total data records

m continuous predictor variables

  • Sorting the data

Approach 1: O(mnLlog nL) time / leaf

Approach 2: O(m n log n) space / tree

  • Scoring the candidate splits(majority of learning time)

O(m nL) scores / leaf

Expensive for Large Datasets!

can we do something more scalable
Can we do something more scalable?
  • Can we choose candidate split points without sorting?
  • How many candidate split points do we need to learn good models?
a simple solution quantile method
A Simple Solution: Quantile Method
  • Fit a distribution to the predictor points
  • Find quantiles

1/3

1/3

1/3

X

X

X

X

X

X

X

X

X

c1

c2

For 2 split points, divide into 3 equal-probability regions

a simple solution quantile method17
A Simple Solution: Quantile Method

For every leaf node in the tree:

  • Use the same type of distribution
  • Use the same number of split points k
gaussian distribution
Gaussian Distribution
  • Need mean, SD for every continuous predictor
  • Constant-time calculation for each quantile point

X

X

X

X

X

X

X

X

X

uniform distribution
Uniform Distribution
  • Need min, max for every continuous predictor
  • Constant-time calculation for each quantile point

X

X

X

X

X

X

X

X

X

empirical distribution k tiles
Empirical Distribution: K-Tiles

K-tile approach: k=1 results in the median

O(k n) algorithm for small number of split points k

Sorting better for large k

X

X

X

X

X

X

X

X

X

experiments
Experiments
  • Varied k for all three distributions
  • Datasets
    • Three small datasets from Irvine
    • Two real-world datasets
  • Bayesian Score
  • Models
    • Learned a decision tree for every variable
    • Potential predictors were all other variables
evaluation
Evaluation
  • Split data into Dtrain and Dtest
  • Learn a tree for target xi using Dtrain
  • Evaluate Dtestvia log score:
method scores
Method Scores
  • Gaus(xi, k), Uniform(xi, k), KTile(xi, k)

Log score of tree learned for xi using k split

points with the given method

  • All(xi)

Log score of tree learned for xi using all split

points

evaluation24
Evaluation
  • Relative improvement of Gaussian method
  • Relative improvement of Uniform method
  • Relative improvement of KTile method
census results
Census Results
  • 37 variables
  • ~300,000 records
  • 31 trees containing a continuous split using at least one method
  • Report average of the 31 relative improvements
census results average relative improvement
Census Results: Average Relative Improvement

Number of Candidate Splits

Average Relative Improvement

census results learning time
Census Results: Learning Time

Number of Candidate Splits

Gaussian/KTile about the same: savings is in the scoring

census results k 15
Census Results: k =15

Tree Index

Relative Improvement

With 15 points, within 1.5% of all-points method

media metrix results
Media Metrix Results
  • “Nielsen on the web”: demographics / internet-use
  • 37 variables
  • ~5,000 records
  • 16 trees containing a continuous split using at least one method
  • Report simple average of the 16 relative improvements
media metrix results average relative improvement
Media Metrix Results: Average Relative Improvement

Candidate Splits

Average Relative Improvement

media metrix results learning time
Media Metrix Results: Learning Time

Number of Candidate Splits

media metrix results k 15
Media Metrix Results: k = 15

Relative Improvement

Tree Index

summary
Summary
  • Gaussian and K-tile approaches yield good results with very few points (k = 15)
  • Uniform failure probably due to outliers
  • Gaussian approach easy to implement
  • Fewer split points lead to faster learning
future work
Future Work
  • Consider k as a function of the number of records at the leaf
  • More experiments
for more info
For More Info…

My home page:

http://research.microsoft.com/~dmax

Relevant Papers

WinMine Toolkit

evaluation37
Evaluation
  • Log score for a tree

Log score = log S p(xi | other variables)

S*j : log predictive score of “all splits” method

sm(k,j) : log predictive score of method m using k split points

Relative increase for a method m

IncM(k,j)

Took simple average of increases for those trees containing a continuous split for at least one method

results
Results
  • Gaussian and Empirical Distributions work well
  • Only about 10 split points are needed to be as accurate as using all points
example gaussian k 3
Example: Gaussian, k = 3

25%

25%

25%

25%

Estimate Gaussian for each continuous predictor:

Accumulate sums and sums of squares for each predictor

Calculate points that divide the Gaussian into 4 equal-probability regions

distributions
Distributions
  • Gaussian
    • Need mean, SD for every continuous predictor
    • Constant-time lookup for quantile points
  • Uniform
    • Need min, max for every continuous predictor
    • Constant-time lookup for quantile points
  • Empirical
    • O(m n) algorithm : sorting better for large m
learning trees a bayesian approach
Learning Trees: A Bayesian Approach

X1

0

1

X1

0

1

X2

< 0

>= 0

T2

T1

Choose most probable tree structure given the data

(choose most likely parameters after structure is determined)

local scoring
Local Scoring

Typical Scoring Functions Decompose

A

A

0

0

B

B

0

0

X

0

1

Increase in Score = f(Data w/ A=0,B=0)

The score of a split on a leaf depends only the relevant its records

greedy algorithm44
Greedy Algorithm
  • Start with empty tree (single leaf node)
  • Repeatedly replace a leaf node by the split that increases the score the most
  • Stop when no replacement increases the score
example p y x z
Example: p(Y | X,Z)

X

0

1

Current State T0 :

Consider all splits on Z in left child

X

X

0

0

1

1

Z

Z

1

0

0,2

1,2

If one of the splits improves the score, apply

the best one