1 / 14

Evaluating What’s Been Learned

Evaluating What’s Been Learned. Cross-Validation. Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for training Separation should NOT be “convenience”, Should at least be random

Download Presentation

Evaluating What’s Been Learned

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating What’s Been Learned

  2. Cross-Validation • Foundation is a simple idea – “holdout” – holds out a certain amount for testing and uses rest for training • Separation should NOT be “convenience”, • Should at least be random • Better – “stratified” random – division preserves relative proportion of classes in both training and test data • Enhanced : repeated holdout • Enables using more data in training, while still getting a good test • 10-fold cross validation has become standard • This is improved if the folds are chosen in a “stratified” random way

  3. For Small Datasets • Leave One Out • Bootstrapping • To be discussed in turn

  4. Leave One Out • Train on all but one instance, test on that one (pct correct always equals 100% or 0%) • Repeat until have tested on all instances, average results • Really equivalent to N-fold cross validation where N = number of instances available • Plusses: • Always trains on maximum possible training data (without cheating) • Efficient to run – no repeated (since fold contents not randomized) • No stratification, no random sampling necessary • Minuses • Guarantees a non-stratified sample – the correct class will always be at least a little bit under-represented in the training data • Statistical tests are not appropriate

  5. Bootstrapping • Sampling done with replacement to form a training dataset • Particular approach – 0.632 bootstrap • Dataset of n instances is sampled n times • Some instances will be included multiple times • Those not picked will be used as test data • On large enough dataset, .632 of the data instances will end up in the training dataset, rest will be in test • This is a bit of a pessimistic estimate of performance, since only using 63% of data for training (vs 90% in 10-fold cross validation) • May try to balance by weighting in performance predicting training data (p 129) <but this doesn’t seem fair> • This procedure can be repeated any number of times, allowing statistical tests

  6. Counting the Cost • Some mistakes are more costly to make than others • Giving a loan to a defaulter is more costly than denying somebody who would be a good customer • Sending mail solicitation to somebody who won’t buy is less costly than missing somebody who would buy (opportunity cost) • Looking at a confusion matrix, each position could have an associated cost (or benefit from correct positions) • Measurement could be average profit/ loss per prediction • To be fair in cost benefit analysis, should also factor in cost of collecting and preparing the data, building the model …

  7. Lift Charts • In practice, costs are frequently not known • Decisions may be made by comparing possible scenarios • Book Example – Promotional Mailing • Situation 1 – previous experience predicts that 0.1% of all (1,000,000) households will respond • Situation 2 – classifier predicts that 0.4% of the 100000 most promising households will respond • Situation 3 – classifier predicts that 0.2% of the 400000 most promising households will respond • The increase in response rate is the lift ( 0.4 / 0.1 = 4 in situation 2 compared to sending to all)

  8. Information Retrieval (IR) Measures • E.g., Given a WWW search, a search engine produces a list of hits supposedly relevant • Which is better? • Retrieving 100, of which 40 are actually relevant • Retrieving 400, of which 80 are actually relevant • Really depends on the costs

  9. Information Retrieval (IR) Measures • IR community has developed 3 measures: • Recall = number of documents retrieved that are relevant total number of documents that are relevant • Precision = number of documents retrieved that are relevant total number of documents that are retrieved • F-measure = 2 * recall * precision recall + precision

  10. WEKA • Part of the results provided by WEKA (that we’ve ignored so far) • Let’s look at an example (Naïve Bayes on my-weather-nominal) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.667 0.125 0.8 0.667 0.727 yes 0.875 0.333 0.778 0.875 0.824 no === Confusion Matrix === a b <-- classified as 4 2 | a = yes 1 7 | b = no • TP rate and recall are the same = TP / (TP + FN) • For Yes = 4 / (4 + 2); For No = 7 / (7 + 1) • FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4) • Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2) • F-measure = 2TP / (2TP + FP + FN) • For Yes = 2*4 / (2*4 + 1 + 2) = 8 / 11 • For No = 2 * 7 / (2*7 + 2 + 1) = 14/17

  11. In terms of true positives etc • True positives = TP; False positives = FP • True Negatives = TN; False negatives = FN • Recall = TP / (TP + FN) // true positives / actually positive • Precision = TP / (TP + FP) // true positives / predicted positive • F-measure = 2TP / (2TP + FP + FN) • This has been generated using algebra from the formula previous • Easier to understand this way – correct predictions are double counted – once for recall, once for precision. denominator includes corrects and incorrects (either based on recall or precision idea – relevant but not retrieved or retrieved but not relevant) • There is no mathematics that says recall and precision can be combined this way – it is ad hoc – but it does balance the two

  12. WEKA • For many occasions, this borders on “too much information”, but it’s all there • We can decide, are we more interested in Yes , or No? • Are we more interested in recall or precision?

  13. WEKA – with more than two classes • Contact Lenses with Naïve Bayes === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.8 0.053 0.8 0.8 0.8 soft 0.25 0.1 0.333 0.25 0.286 hard 0.8 0.444 0.75 0.8 0.774 none === Confusion Matrix === a b c <-- classified as 4 0 1 | a = soft 0 1 3 | b = hard 1 2 12 | c = none • Class exercise – show how to calculate recall, precision, f-measure for each class

  14. Applying Action Rules to change Detractor to Passive /Accuracy- Precision, Coverage- Recall/ Let’s assume that we built action rules from the classifiers for Promoter & Detractor. The goal is to change Detractors -> Promoters The confidence of action rule – 0.993 * 0.849 = 0.84 Our action rule can target only 4.2 (out of 10.2) detractors. So, we can expect 4.2*0.84 = 3.52 detractors moving to the promoter status

More Related