Announcements • No reading assignment for next week • Prepare for exam • Midterm exam next week
Last Time • Neural nets (briefly) • Decision trees • Today • More decision trees • Ensembles • Exam Review • Next time • Exam • Advanced ML topics
A2 +3 +4 +5 +1 +2 -1 -2 -3 +3 +4 +5 A1 A1 A1 A1 A3 A3 A3 A3 A4 A4 A4 A4 A1 A3 -1 -2 A1 A3 Overview of ID3 Use Majority class at parent node *NULL* Splitting Attribute +1 +2 +3 +4 +5 ID3 + -1 -2 -3 - A1 A2 A4 A3 A4 + - Splitting Attribute
Color Shape Size Class Red BIG + Blue BIG + Red SMALL - Yellow SMALL - Red BIG + Example Info Gain Calculation
Info Gain Calculation (contd.) Note that “Size” provides complete classification.
Runtime Performance of ID3 • Let E = # examples F = # features • At level 1 Look at each feature Look at each ex (to get feature value) Work to choose 1 feature = O(F x E)
Runtime Performance of ID3 (cont.) • In worst case, need to consider all features along all paths (full tree) O(F2 x E) Reasonably efficient
COLOR ? Green Blue - SIZE ? Big + Small + - Generating Rules • Antecedent: Conjuction of all decisions leading to terminal node • Consequent: Label of terminal node • Example Red
Generating Rules (cont.) • Generates rules: Color=Green - Color=Blue + Color=Red and Size=Big + Color=Red and Size=Small - • Note: 1. Can “clean up” the rule set (see Quinlan’s) 2. Decision trees learn disjunctive concepts
Noise-A Major Issue in ML • Worst Case +, - at same point in feature space • Causes 1. Too few features (“hidden variables”) or too few possible values 2. Incorrectly reported/measured/judged feature values 3. mis-classified instances
+ + + + - + + + + + - + - - - - - - - - - - Noise-A Major Issue in ML (cont.) • Issue – overfitting Producing an “awkward” concept because of a few “noisy” points. Bad performance on future ex’s? Better performance?
Overfitting Viewed in Terms of Function-Fitting Data = Red Line + Noise Model + + + + + + + + + + + + + + f(x) x
Training set accuracy of S Training set accuracy of C > but Test set accuracy of S Test set accuracy of C < Definition of Overfitting • Assuming large enough test set so that it is representative. Concept C overfit the training data if there exists a “simpler” concept S so that
Remember! • It is easy to learn/fit the training data • What’s hard is generalizing well to future (“test set”) data! • Overfitting avoidance is a key issue in Machine Learning
Can One Underfit? • Sure, if not fully fitting the training set -eg, just return majority category (+ or -) in the trainset as the learned model. • But also if not enough data to illustrate the important distinctions.
ID3 & Noisy Data • To avoid overfitting, allow splitting to stop before all ex’s are of one class. • Option 1: if info left < E, don’t split -empirically failed; bad performance on error-free data (Quinlan)
ID3 & Noisy Data (cont.) • Option 2: Estimate if all remaining features are statistically independent of the class of remaining examples -uses “chi test” of original ID3 paper -works well on error-free data
ID3 & Noisy Data (cont.) • Option 3: (not in original ID3 paper) Build complete tree, then use some “spare” (tuning) examples to decide which parts of tree can be pruned.
ID3 & Noisy Data (cont.) • Pruning is currently the best choice—see c4.5 for technical details • Repeat using greedy algo.
best Stop if no improvement Greedily Pruning D-trees • Sample (Hill Climbing) Search Space
+ Pruning by Measuring Accuracy on Tune Set • Run ID3 to fully fit TRAIN’ Set, measure accuracy on TUNE • Consider all subtrees where ONE interior node removed and replaced by leaf -label with majority category in pruned subtree choose best subtree on TUNE if no improvement, quit 3. Go to 2
R A B C D F E Initial The Tradeoff in Greedy Algorithm • Efficiency vs Optimality Eg IF “Tune” best cuts is to discard C’s & F’s subtrees BUT The single best cut is too discard B’s subtrees Greedy Search will not find best tree Greedy Search: Powerful, General Purpose, Trick – of - Trade
R  Accuracy if we replace this node with a leaf (leaving rest of the tree the same) A B   C  D  F  E  Pruning @ B works best Full-Tree Accuracy = 85% on TUNE set Hypothetical Trace of a Greedy Algorithm
R  A B   Hypothetical Trace of a Greedy Algorithm (cont.) • Full-Tree Accuracy = 89% - STOP since no improvement by cutting again, and return above tree.
Another Possibility: Rule Post-Pruning(also greedy algoritm) • Induce a decision tree • Convert to rules (see earlier slide) • Consider dropping one rule antecedent • Delete the one that improves tuning set accuracy the most. • Repeat as long as progress being made.
Rule Post-Pruning (Continue) • Advantages • Allows an intermediate node to be pruned from some rules but retained in others. • Can correct poor early decisions in tree construction. • Final concept more understandable.
Training with Noisy Data • If we can clean up the training data, should we do so? • No (assuming one can’t clean up the testing data when the learned concept will be used). • Better to train with the same type of data as will be experienced when the result of learning is put into use.
Overfitting + Noise • Using the strict definition of overfitting presented earlier, is it possible to overfit noise-free data? • In general? • Using ID3?
Example of Overfitting of Noise-free Data Let • Correct concept = A ^ B • Feature C to be true 50% of the time, for both + and – examples • Prob(+ example) = 0.9 • Training Set: • +: ABCDE, ABC¬DE, ABCD¬E • -: A¬B¬CD¬E, ¬AB¬C¬DE
Example (Continued) Tree Trainset Accuracy TestSet Accuracy ID3’s 100% 50% Simpler “tree” 60% 90% C F T + - +
Post Pruning • There are more sophisticated methods of deciding where to prune than simply estimating accuracy on a tuning set. • See the C4.5 and CART books for details. • We won’t discuss them, except for MDL • Tuning sets also called • Pruning sets (in d-tree algorithms) • Validation sets (in general)
Tuning Sets vs MDL • Two ways to deal with overfitting • Tuning Sets • Empirically evaluate pruned trees • MDL (Minimal Description Length) • Theoretically evaluate/score pruned trees • Describe training data in as few bits as possible (“compression”)
MDL (continue) • No need to hold aside training data • But how good is the MDL hypothesis? • Heuristic: MDL => good generalization
The Minimal Description Length (MDL) Principle (Rissanen, 1986; Quinlan and Rivest, 1989) • Informally, we want to view a training set as data = general rule + exceptions to the rule (“noise”) • Tradeoff between • Simple rule, but many exceptions • Complex rule with few exceptions • How to make this tradeoff? • Try to minimize the “description length” of the rule + exceptions
Trading Off Simplicity vs Coverage A weighting factor, user-defined or use tuning set Description Length Size of Rules Size of Exceptions = + λ x # bits needed to represent a decision tree that covers (possibly incompletely) the training examples # bits needed to encode the exceptions to this decision tree minimize • Key issue: what’s the best coding strategy to use?
A Simple MDL Algorithm • Build the full tree using ID3 (and all the training examples) • Consider all/many subtrees, keeping the one that minimizes: • score = (# nodes in tree) + λ * (error rate on training set) (A crude scoring function) Some details: If # features = Nf and # examples = Ne then need Ceiling(log2Nf) bits to encode each tree node and Ceiling (log2Ne) bits to encode an exception.
Searching the Space of Pruned D-trees with MDL • Can use same greedy search algorithm used with pruning sets • But use MDL score rather than pruning set accuracy as the heuristic function
MDL Summarized The overfitting problem • Can exactly fit the training data, but will this generalize well to test data? • Tradeoff some training-set errors for fewer test-set errors • One solution – the MDL hypothesis • Solve the MDL problem (on the training data) and you are likely to generalize well (accuracy on the test data) The MDL Problem • Minimize |description of general concept| + λ | list of exceptions (in the train set) |
Small Disjuncts (Holte et al. IJCAI 1989) • Results of learning can often be viewed as a disjunction of conjunctions • Definition: small disjuncts – Disjuncts that correctly classify few training examples • Not necessarily small in area.
The Problem with Small Disjuncts • Collectively, cover much of the training data, but account for much of the testset error • One study • Cover 41% of training data and produce 95% of the test set error • The “small-disjuncts problem” still an open issue (See Quinlan paper in MLJ for additional discussion).
Overfitting Avoidance Wrapup • Note: fundamental issue in all of ML, not just decision trees; after all, easy to exactly match training data via “table lookup”) • Approaches • Use simple ML algorithm from the start. • Optimize accuracy on a tuning set. • Only make distinctions that are statistically justified. • Minimize |concept descriptions| + λ |exception list|. • Use ensembles to average out overfitting (next topic).
Decision “Stumps” • Holte (MLJ) compared: • Decision trees with only one decision (decision stumps) VS • Trees produced by C4.5 (with pruning algorithm used) • Decision “stumps” do remarkably well on UC Irvine data sets • Archive too easy? • Decision stumps are a “quick and dirty” control for comparing to new algorithms. • But C4.5 easy to use and probably a better control.
C4.5 Compared to 1R (“Decision Stumps”) • Test Set Accuracy • 1st column: UCI datasets • See Holte Paper for key • Max diff: 2nd row • Min Diff: 5th row • UCI datasets too easy?
Dealing with Missing Features • Bayes nets might be the best technique if many missing features (later) • Common technique: Use EM algorithm (later) • Quinlan’s suggested approach: • During Training (on each recursive call) • Fill in missing values proportionally • If 50% red, 30% blue and 20% green (for non-missing cases), then fill missing values according to this probability distribution • Do this per output category
Simple Example • Note: by “missing features” we really mean “missing feature values” Prob(red | +) = 2/3 Prob(blue | +) = 1/3 Prob(red | - ) = 1/2 Prob(blue | - ) = 1/2 Flip weighted Coins to fill in ?’s
Missing Feature During Testing • Follow all paths, weight answers proportional to probability of each path out+(color) = 0.4 out+(red) + 0.2 out+(blue) + 0.4 out+(green) Color 40% green 40% red votes for + being the category (repeat for -) 20% blue • Repeat throughout subtrees
Why are Features Missing? • Model on previous page implicitly assumes feature values are randomly deleted • as if hit by a cosmic ray! • But values might be missing for a reason • E.g., data collector decided the values for some features are not worth recording • One suggested solution: • Let “not-recorded” be another legal value (and, hence, a branch in the decision tree)
A D-Tree Variant that Exploits Info in “Missing” Feature Values • At each recursive call, only consider features that have no missing values • E.g. • Could generalize this algorithm by penalizing features with missing values Shape Color < maybe all the missing values for color take this path >
ID3 Recap ~ Questions Addressed • How closely should we fit the training data? • Completely, then prune • Use MDL or tuning sets to choose • How do we judge features? • Use info theory (Shannon) • What if a features has many values? • Correction factor based on info theory • What if some features values are unknown (in some examples)? • Distribute based on other examples (???)
ID3 Recap (cont.) • What if some features cost more to evaluate (CAT scan vs. Temperature)? • Ad hoc correction factor • Batch vs. incremental learning? • Basically a “batch” approach; incremental variants exist but since ID3 is so fast, why not simply rerun “from scratch”?