Machine Learning in Real World: CART

1 / 40

# Machine Learning in Real World: CART - PowerPoint PPT Presentation

Machine Learning in Real World: CART. Outline. CART Overview and Gymtutor Tutorial Example Splitting Criteria Handling Missing Values Pruning Finding Optimal Tree. CART – Classification And Regression Tree. Developed 1974-1984 by 4 statistics professors

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Machine Learning in Real World: CART

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Machine Learning in Real World:CART

Outline
• CART Overview and Gymtutor Tutorial Example
• Splitting Criteria
• Handling Missing Values
• Pruning
• Finding Optimal Tree
CART – Classification And Regression Tree
• Developed 1974-1984 by 4 statistics professors
• Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)
• Focused on accurate assessment when data is noisy
• Currently distributed by Salford Systems
CART Tutorial Data: Gymtutor

CART HELP, Sec 3 in CARTManual.pdf

• ANYRAQT Racquet ball usage (binary indicator coded 0, 1)
• ONAER Number of on-peak aerobics classes attended
• NSUPPS Number of supplements purchased
• OFFAER Number of off-peak aerobics classes attended
• NFAMMEM Number of family members
• TANNING Number of visits to tanning salon
• ANYPOOL Pool usage (binary indicator coded 0, 1)
• SMALLBUS Small business discount (binary indicator coded 0, 1)
• FIT Fitness score
• HOME Home ownership (binary indicator coded 0, 1)
• PERSTRN Personal trainer (binary indicator coded 0, 1)
• CLASSES Number of classes taken.
• SEGMENT Member’s market segment (1, 2, 3) – target
View data
• CART Menu: View -> Data Info …
CART Model Setup
• Target -- required
• Predictors (default – all)
• Categorical
• ANYRAQT, ANYPOOL, SMALLBUS, HOME
• Categorical: if field name ends in “\$”, or from values
• Testing
• default – 10-fold cross-validation
Key CART features
• Automated field selection
• handles any number of fields
• automatically selects relevant fields
• No data preprocessing needed
• Does not require any kind of variable transforms
• Impervious to outliers
• Missing value tolerant
• Moderate loss of accuracy due to missing values
CART: Key Parts of Tree Structured Data Analysis
• Tree growing
• Splitting rules to generate tree
• Stopping criteria: how far to grow?
• Missing values: using surrogates
• Tree pruning
• Trimming off parts of the tree that don’t work
• Ordering the nodes of a large tree by contribution to tree accuracy … which nodes come off first?
• Optimal tree selection
• Deciding on the best tree after growing and pruning
• Balancing simplicity against accuracy
CART is a form of Binary Recursive Partitioning
• Data is split into two partitions
• Q: Does C4.5 always have binary partitions?
• Partitions can also be split into sub-partitions
• hence procedure is recursive
• CART tree is generated by repeated partitioning of data set
• parent gets two children
• each child produces two grandchildren
• four grandchildren produce 8 great grandchildren
Splits always determined by questions with YES/NO answers
• Is continuous variable X£c ?
• Does categorical variable D take on levels i, j, or k?
• is GENDER M or F ?
• Standard split:
• if answer to question is YES a case goes left; otherwise it goes right
• this is the form of all primary splits
• example : Is AGE  62.5?
• More complex conditions possible:
• Boolean combinations: AGE<=62 OR BP<=91
• Linear combinations: .66*AGE - .75*BP< -40
Searching all Possible Splits
• For any node CART will examine ALL possible splits.
• CART allows search over a random sample if desired
• Look at first variable in our data set AGE with minimum value 40
• Test split Is AGE £ 40?
• Will separate out the youngest persons to the left
• Could be many cases if many people have the same AGE
• Next increase the AGE threshold to the next youngest person
• Is AGE £ 43?
• This will direct additional cases to the left
• Continue increasing the splitting threshold value by value
• each value is tested for how good the split is . . . how effective it is in separating the classes from each other
• Q: Consider splits between values of the same class?

X

Split Tables

Q: Where splits need to be evaluated?

Sorted by Blood Pressure

Sorted byAge

X

CART Splitting Criteria: Gini Index
• If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.

gini(T) is minimized if the classes in T are skewed.

• Advanced: CART also has other splitting criteria
• Twoing is recommended for multi-class
Missing as a distinct splitter value
• CHAID treats missing as a distinct categorical value
• e.g AGE is 25-44, 45-64, 65-95 or missing
• method also adopted by C4.5
• If missing is a distinct value then all cases with missing go the same way in the tree
• Assumption: whatever the unknown value it is the same for all cases with missing value
• Problem: can be more than one reason for a database field to be missing:
• E.g. Income as a splitter wants to separate high from low
• Levels most likely to be missing? High Income AND Low Income!
• Don’t want to send both groups to same side of tree
CART Treatment of Missing Primary Splitters: Surrogates
• CART uses a more refined method —a surrogate is used as a stand in for a missing primary field
• surrogate should be a valid replacement for primary
• Consider our example of INCOME
• Other variables like Education or Occupation might work as good surrogates
• Higher education people usually have higher incomes
• People in high income occupations will usually (though not always) have higher incomes
• Using surrogate means that missing on primary not all treated same way
• Whether go left or right depends on surrogate value
• thus record specific . . . some cases go left others go right
Surrogates Mimicking Alternatives to Primary Splitters
• A primary splitter is the best splitter of a node
• A surrogate is a splitter that splits in a fashion similar to the primary
• Surrogate — variable with near equivalent information
• Why Useful?
• If the primary is expensive or difficult to gather and the surrogate is not
• Then consider using the surrogate instead
• Loss in predictive accuracy might be slight
• If primary splitter is MISSING then CART will use a surrogate
• if top surrogate missing CART uses 2nd best surrogate etc
• If all surrogates missing also CART uses majority rule
CART Pruning Method: Grow Full Tree, Then Prune
• You will never know when to stop . . . so don’t!
• Instead . . . grow trees that are obviously too big
• Largest tree grown is called “maximal” tree
• Maximal tree could have hundreds or thousands of nodes
• usually instruct CART to grow only moderately too big
• rule of thumb: should grow trees about twice the size of the truly best tree
• This becomes first stage in finding the best tree
• Next we will have to get rid the parts of the overgrown tree that don’t work (not supported by test data)
Tree Pruning
• Take a very large tree (“maximal” tree)
• Tree may be radically over-fit
• Tracks all the idiosyncrasies of THIS data set
• Tracks patterns that may not be found in other data sets
• At bottom of tree splits based on very few cases
• Analogous to a regression with very large number of variables
• PRUNE away branches from this large tree
• But which branch to cut first?
• CART determines a pruning sequence:
• the exact order in which each node should be removed
• pruning sequence determined for EVERY node
• sequence determined all the way back to root node
Order of Pruning: Weakest Link Goes First
• Prune away "weakest link" — the nodes that add least to overall accuracy of the tree
• contribution to overall tree a function of both increase in accuracy and size of node
• accuracy gain is weighted by share of sample
• small nodes tend to get removed before large ones
• If several nodes have same contribution they all prune away simultaneously
• Hence more than two terminal nodes could be cut off in one pruning
• Sequence determined all the way back to root node
• need to allow for possibility that entire tree is bad
• if target variable is unpredictable we will want to prune back to root . . . the no model solution
Pruning Sequence Example

24 Terminal Nodes

21 Terminal Nodes

18 Terminal Nodes

20 Terminal Nodes

Now we test every tree in the pruning sequence
• Take a test data set and drop it down the largest tree in the sequence and measure its predictive accuracy
• how many cases right and how many wrong
• measure accuracy overall and by class
• Do same for 2nd largest tree, 3rd largest tree, etc
• Performance of every tree in sequence is measured
• Results reported in table and graph formats
• Note that this critical stage is impossible to complete without test data
• CART procedure requires test data to guide tree evaluation
Training Data Vs. Test Data Error Rates

No.

Terminal Nodes

• Compare error rates measured by
• learn data
• large test set
• Learn R(T) always decreases as tree grows (Q: Why?)
• Test R(T) first declines then increases (Q: Why?)
• Overfitting is the result tree of too much reliance on learn R(T)
• Can lead to disasters when applied to new data

R(T)

Rts(T)

71 .00 .42

63 .00 .40

58 .03 .39

40 .10 .32

34 .12 .32

19 .20 .31

**10 .29 .30

9 .32 .34

7 .41 .47

6 .46 .54

5 .53 .61

2 .75 .82

1 .86 .91

Why look at training data error rates (or cost) at all?
• First, provides a rough guide of how you are doing
• Truth will typically be WORSE than training data measure
• If tree performing poorly on training data error may not want to pursue further
• Training data error rate more accurate for smaller trees
• So reasonable guide for smaller trees
• Poor guide for larger trees
• At optimal tree training and test error rates should be similar
• if not something is wrong
• useful to compare not just overall error rate but also within node performance between training and test data
CART: Optimal Tree
• Within a single CART run which tree is best?
• Process of pruning the maximal tree can yield many sub-trees
• Test data set or cross- validation measures the error rate of each tree
• Current wisdom — select the tree with smallest error rate
• Only drawback — minimum may not be precisely estimated
• Typical error rate as a function of tree size has flat region
• Minimum could be anywhere in this region
One SE Rule -- One Standard Error Rule
• Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run
• Instead suggest SMALLEST TREE within 1 SE of the minimum error tree
• Tends to provide very stable results from run to run
• Is possibly as accurate as minimum cost tree yet simpler
• Current learning — one SERULE is good for small data sets
• For large data sets one should pick most accurate tree
• known as the zero SE rule
In what sense is the optimal tree “best”?
• Optimal tree has lowest or near lowest cost as determined by a test procedure
• Tree should exhibit very similar accuracy when applied to new data
• BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database
• trees somewhat larger or smaller than “optimal” may be preferred
• Room for user judgment
• judgment not about split variable or values
• judgment as to how much of tree to keep
• determined by story tree is telling
• willingness to sacrifice a small amount of accuracy for simplicity
CART Summary
• CART Key Features
• binary splits
• gini index as splitting criteria
• grow, then prune
• surrogates for missing values
• optimal tree – 1 SE rule
• lots of nice graphics
Decision Tree Summary
• Decision Trees
• splits – binary, multi-way
• split criteria – entropy, gini, …
• missing value treatment
• pruning
• rule extraction from trees
• Both C4.5 and CART are robust tools
• No method is always superior – experiment!

witten & eibe