- 281 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Machine Learning in Real World: CART' - godfrey

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- CART Overview and Gymtutor Tutorial Example
- Splitting Criteria
- Handling Missing Values
- Pruning
- Finding Optimal Tree

CART – Classification And Regression Tree

- Developed 1974-1984 by 4 statistics professors
- Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)
- Focused on accurate assessment when data is noisy
- Currently distributed by Salford Systems

CART Tutorial Data: Gymtutor

CART HELP, Sec 3 in CARTManual.pdf

- ANYRAQT Racquet ball usage (binary indicator coded 0, 1)
- ONAER Number of on-peak aerobics classes attended
- NSUPPS Number of supplements purchased
- OFFAER Number of off-peak aerobics classes attended
- NFAMMEM Number of family members
- TANNING Number of visits to tanning salon
- ANYPOOL Pool usage (binary indicator coded 0, 1)
- SMALLBUS Small business discount (binary indicator coded 0, 1)
- FIT Fitness score
- HOME Home ownership (binary indicator coded 0, 1)
- PERSTRN Personal trainer (binary indicator coded 0, 1)
- CLASSES Number of classes taken.
- SEGMENT Member’s market segment (1, 2, 3) – target

View data

- CART Menu: View -> Data Info …

CART Model Setup

- Target -- required
- Predictors (default – all)
- Categorical
- ANYRAQT, ANYPOOL, SMALLBUS, HOME
- Categorical: if field name ends in “$”, or from values
- Testing
- default – 10-fold cross-validation
- …

Key CART features

- Automated field selection
- handles any number of fields
- automatically selects relevant fields
- No data preprocessing needed
- Does not require any kind of variable transforms
- Impervious to outliers
- Missing value tolerant
- Moderate loss of accuracy due to missing values

CART: Key Parts of Tree Structured Data Analysis

- Tree growing
- Splitting rules to generate tree
- Stopping criteria: how far to grow?
- Missing values: using surrogates
- Tree pruning
- Trimming off parts of the tree that don’t work
- Ordering the nodes of a large tree by contribution to tree accuracy … which nodes come off first?
- Optimal tree selection
- Deciding on the best tree after growing and pruning
- Balancing simplicity against accuracy

CART is a form of Binary Recursive Partitioning

- Data is split into two partitions
- Q: Does C4.5 always have binary partitions?
- Partitions can also be split into sub-partitions
- hence procedure is recursive
- CART tree is generated by repeated partitioning of data set
- parent gets two children
- each child produces two grandchildren
- four grandchildren produce 8 great grandchildren

Splits always determined by questions with YES/NO answers

- Is continuous variable X£c ?
- Does categorical variable D take on levels i, j, or k?
- is GENDER M or F ?
- Standard split:
- if answer to question is YES a case goes left; otherwise it goes right
- this is the form of all primary splits
- example : Is AGE 62.5?
- More complex conditions possible:
- Boolean combinations: AGE<=62 OR BP<=91
- Linear combinations: .66*AGE - .75*BP< -40

Searching all Possible Splits

- For any node CART will examine ALL possible splits.
- CART allows search over a random sample if desired
- Look at first variable in our data set AGE with minimum value 40
- Test split Is AGE £ 40?
- Will separate out the youngest persons to the left
- Could be many cases if many people have the same AGE
- Next increase the AGE threshold to the next youngest person
- Is AGE £ 43?
- This will direct additional cases to the left
- Continue increasing the splitting threshold value by value
- each value is tested for how good the split is . . . how effective it is in separating the classes from each other
- Q: Consider splits between values of the same class?

CART Splitting Criteria: Gini Index

- If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.

gini(T) is minimized if the classes in T are skewed.

- Advanced: CART also has other splitting criteria
- Twoing is recommended for multi-class

Missing as a distinct splitter value

- CHAID treats missing as a distinct categorical value
- e.g AGE is 25-44, 45-64, 65-95 or missing
- method also adopted by C4.5
- If missing is a distinct value then all cases with missing go the same way in the tree
- Assumption: whatever the unknown value it is the same for all cases with missing value
- Problem: can be more than one reason for a database field to be missing:
- E.g. Income as a splitter wants to separate high from low
- Levels most likely to be missing? High Income AND Low Income!
- Don’t want to send both groups to same side of tree

CART Treatment of Missing Primary Splitters: Surrogates

- CART uses a more refined method —a surrogate is used as a stand in for a missing primary field
- surrogate should be a valid replacement for primary
- Consider our example of INCOME
- Other variables like Education or Occupation might work as good surrogates
- Higher education people usually have higher incomes
- People in high income occupations will usually (though not always) have higher incomes
- Using surrogate means that missing on primary not all treated same way
- Whether go left or right depends on surrogate value
- thus record specific . . . some cases go left others go right

Surrogates Mimicking Alternatives to Primary Splitters

- A primary splitter is the best splitter of a node
- A surrogate is a splitter that splits in a fashion similar to the primary
- Surrogate — variable with near equivalent information
- Why Useful?
- If the primary is expensive or difficult to gather and the surrogate is not
- Then consider using the surrogate instead
- Loss in predictive accuracy might be slight
- If primary splitter is MISSING then CART will use a surrogate
- if top surrogate missing CART uses 2nd best surrogate etc
- If all surrogates missing also CART uses majority rule

CART Pruning Method: Grow Full Tree, Then Prune

- You will never know when to stop . . . so don’t!
- Instead . . . grow trees that are obviously too big
- Largest tree grown is called “maximal” tree
- Maximal tree could have hundreds or thousands of nodes
- usually instruct CART to grow only moderately too big
- rule of thumb: should grow trees about twice the size of the truly best tree
- This becomes first stage in finding the best tree
- Next we will have to get rid the parts of the overgrown tree that don’t work (not supported by test data)

Tree Pruning

- Take a very large tree (“maximal” tree)
- Tree may be radically over-fit
- Tracks all the idiosyncrasies of THIS data set
- Tracks patterns that may not be found in other data sets
- At bottom of tree splits based on very few cases
- Analogous to a regression with very large number of variables
- PRUNE away branches from this large tree
- But which branch to cut first?
- CART determines a pruning sequence:
- the exact order in which each node should be removed
- pruning sequence determined for EVERY node
- sequence determined all the way back to root node

Order of Pruning: Weakest Link Goes First

- Prune away "weakest link" — the nodes that add least to overall accuracy of the tree
- contribution to overall tree a function of both increase in accuracy and size of node
- accuracy gain is weighted by share of sample
- small nodes tend to get removed before large ones
- If several nodes have same contribution they all prune away simultaneously
- Hence more than two terminal nodes could be cut off in one pruning
- Sequence determined all the way back to root node
- need to allow for possibility that entire tree is bad
- if target variable is unpredictable we will want to prune back to root . . . the no model solution

Now we test every tree in the pruning sequence

- Take a test data set and drop it down the largest tree in the sequence and measure its predictive accuracy
- how many cases right and how many wrong
- measure accuracy overall and by class
- Do same for 2nd largest tree, 3rd largest tree, etc
- Performance of every tree in sequence is measured
- Results reported in table and graph formats
- Note that this critical stage is impossible to complete without test data
- CART procedure requires test data to guide tree evaluation

Training Data Vs. Test Data Error Rates

No.

Terminal Nodes

- Compare error rates measured by
- learn data
- large test set
- Learn R(T) always decreases as tree grows (Q: Why?)
- Test R(T) first declines then increases (Q: Why?)
- Overfitting is the result tree of too much reliance on learn R(T)
- Can lead to disasters when applied to new data

R(T)

Rts(T)

71 .00 .42

63 .00 .40

58 .03 .39

40 .10 .32

34 .12 .32

19 .20 .31

**10 .29 .30

9 .32 .34

7 .41 .47

6 .46 .54

5 .53 .61

2 .75 .82

1 .86 .91

Why look at training data error rates (or cost) at all?

- First, provides a rough guide of how you are doing
- Truth will typically be WORSE than training data measure
- If tree performing poorly on training data error may not want to pursue further
- Training data error rate more accurate for smaller trees
- So reasonable guide for smaller trees
- Poor guide for larger trees
- At optimal tree training and test error rates should be similar
- if not something is wrong
- useful to compare not just overall error rate but also within node performance between training and test data

CART: Optimal Tree

- Within a single CART run which tree is best?
- Process of pruning the maximal tree can yield many sub-trees
- Test data set or cross- validation measures the error rate of each tree
- Current wisdom — select the tree with smallest error rate
- Only drawback — minimum may not be precisely estimated
- Typical error rate as a function of tree size has flat region
- Minimum could be anywhere in this region

One SE Rule -- One Standard Error Rule

- Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run
- Instead suggest SMALLEST TREE within 1 SE of the minimum error tree
- Tends to provide very stable results from run to run
- Is possibly as accurate as minimum cost tree yet simpler
- Current learning — one SERULE is good for small data sets
- For large data sets one should pick most accurate tree
- known as the zero SE rule

In what sense is the optimal tree “best”?

- Optimal tree has lowest or near lowest cost as determined by a test procedure
- Tree should exhibit very similar accuracy when applied to new data
- BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database
- trees somewhat larger or smaller than “optimal” may be preferred
- Room for user judgment
- judgment not about split variable or values
- judgment as to how much of tree to keep
- determined by story tree is telling
- willingness to sacrifice a small amount of accuracy for simplicity

CART Summary

- CART Key Features
- binary splits
- gini index as splitting criteria
- grow, then prune
- surrogates for missing values
- optimal tree – 1 SE rule
- lots of nice graphics

Decision Tree Summary

- Decision Trees
- splits – binary, multi-way
- split criteria – entropy, gini, …
- missing value treatment
- pruning
- rule extraction from trees
- Both C4.5 and CART are robust tools
- No method is always superior – experiment!

witten & eibe

Download Presentation

Connecting to Server..