Evaluating What’s Been Learned

Evaluating What’s Been Learned

Cross-Validation • Foundation is a simple idea – “holdout” – holds out a certain amount for testing and uses rest for training • Separation should NOT be “convenience”, • Should at least be random • Better – “stratified” random – division preserves relative proportion of classes in both training and test data • Enhanced : repeated holdout • Enables using more data in training, while still getting a good test • 10-fold cross validation has become standard • This is improved if the folds are chosen in a “stratified” random way

For Small Datasets • Leave One Out • Bootstrapping • To be discussed in turn

Leave One Out • Train on all but one instance, test on that one (pct correct always equals 100% or 0%) • Repeat until have tested on all instances, average results • Really equivalent to N-fold cross validation where N = number of instances available • Plusses: • Always trains on maximum possible training data (without cheating) • Efficient to run – no repeated (since fold contents not randomized) • No stratification, no random sampling necessary • Minuses • Guarantees a non-stratified sample – the correct class will always be at least a little bit under-represented in the training data • Statistical tests are not appropriate

Bootstrapping • Sampling done with replacement to form a training dataset • Particular approach – 0.632 bootstrap • Dataset of n instances is sampled n times • Some instances will be included multiple times • Those not picked will be used as test data • On large enough dataset, .632 of the data instances will end up in the training dataset, rest will be in test • This is a bit of a pessimistic estimate of performance, since only using 63% of data for training (vs 90% in 10-fold cross validation) • May try to balance by weighting in performance predicting training data (p 129) <but this doesn’t seem fair> • This procedure can be repeated any number of times, allowing statistical tests

Counting the Cost • Some mistakes are more costly to make than others • Giving a loan to a defaulter is more costly than denying somebody who would be a good customer • Sending mail solicitation to somebody who won’t buy is less costly than missing somebody who would buy (opportunity cost) • Looking at a confusion matrix, each position could have an associated cost (or benefit from correct positions) • Measurement could be average profit/ loss per prediction • To be fair in cost benefit analysis, should also factor in cost of collecting and preparing the data, building the model …

Lift Charts • In practice, costs are frequently not known • Decisions may be made by comparing possible scenarios • Book Example – Promotional Mailing • Situation 1 – previous experience predicts that 0.1% of all (1,000,000) households will respond • Situation 2 – classifier predicts that 0.4% of the 100000 most promising households will respond • Situation 3 – classifier predicts that 0.2% of the 400000 most promising households will respond • The increase in response rate is the lift ( 0.4 / 0.1 = 4 in situation 2 compared to sending to all)

Information Retrieval (IR) Measures • E.g., Given a WWW search, a search engine produces a list of hits supposedly relevant • Which is better? • Retrieving 100, of which 40 are actually relevant • Retrieving 400, of which 80 are actually relevant • Really depends on the costs

Information Retrieval (IR) Measures • IR community has developed 3 measures: • Recall = number of documents retrieved that are relevant total number of documents that are relevant • Precision = number of documents retrieved that are relevant total number of documents that are retrieved • F-measure = 2 * recall * precision recall + precision

WEKA • Part of the results provided by WEKA (that we’ve ignored so far) • Let’s look at an example (Naïve Bayes on my-weather-nominal) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.667 0.125 0.8 0.667 0.727 yes 0.875 0.333 0.778 0.875 0.824 no === Confusion Matrix === a b <-- classified as 4 2 | a = yes 1 7 | b = no • TP rate and recall are the same = TP / (TP + FN) • For Yes = 4 / (4 + 2); For No = 7 / (7 + 1) • FP rate = FP / (FP + TN) – For Yes = 1 / (1 + 7); For No = 2 / (2 + 4) • Precision = TP / (TP + FP) – For yes = 4 / (4 + 1); For No = 7 / (7 + 2) • F-measure = 2TP / (2TP + FP + FN) • For Yes = 2*4 / (2*4 + 1 + 2) = 8 / 11 • For No = 2 * 7 / (2*7 + 2 + 1) = 14/17

In terms of true positives etc • True positives = TP; False positives = FP • True Negatives = TN; False negatives = FN • Recall = TP / (TP + FN) // true positives / actually positive • Precision = TP / (TP + FP) // true positives / predicted positive • F-measure = 2TP / (2TP + FP + FN) • This has been generated using algebra from the formula previous • Easier to understand this way – correct predictions are double counted – once for recall, once for precision. denominator includes corrects and incorrects (either based on recall or precision idea – relevant but not retrieved or retrieved but not relevant) • There is no mathematics that says recall and precision can be combined this way – it is ad hoc – but it does balance the two

WEKA • For many occasions, this borders on “too much information”, but it’s all there • We can decide, are we more interested in Yes , or No? • Are we more interested in recall or precision?

WEKA – with more than two classes • Contact Lenses with Naïve Bayes === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.8 0.053 0.8 0.8 0.8 soft 0.25 0.1 0.333 0.25 0.286 hard 0.8 0.444 0.75 0.8 0.774 none === Confusion Matrix === a b c <-- classified as 4 0 1 | a = soft 0 1 3 | b = hard 1 2 12 | c = none • Class exercise – show how to calculate recall, precision, f-measure for each class

Applying Action Rules to change Detractor to Passive /Accuracy- Precision, Coverage- Recall/ Let’s assume that we built action rules from the classifiers for Promoter & Detractor. The goal is to change Detractors -> Promoters The confidence of action rule – 0.993 * 0.849 = 0.84 Our action rule can target only 4.2 (out of 10.2) detractors. So, we can expect 4.2*0.84 = 3.52 detractors moving to the promoter status

Evaluating What’s Been Learned

Evaluating What’s Been Learned

Presentation Transcript

Evaluating the Strategies of Diversified Companies

Lessons Learned in T1 Research: mouse to human

Evaluating Cardiovascular Diseases with Cardiac SPECT and PET

Evaluating the Superintendent and the District

Evaluating the Superintendent and the District

Evaluating HRD Programs

Lessons Learned Through Dogfooding

Evaluating District-Wide Professional Learning to Build Capacity for RTI

Evaluating Nutrition Education Programs

Assessment Workshop Creating and Evaluating High Quality Assessments

Welcome to Humanities B

Evaluating the Use of Actemra ® ( tocilizumab ) for the Treatment of Rheumatoid Arthritis

Evaluating the Performance of Specialized Professionals in a MTSS

SWK 7401. Evaluating Social Work Practice

Evaluating Electronic Resources

Chapter 3: Evaluating a Company’s External Environment

Evaluating Health Communications Programs

Chapter 11 Media choices: evaluating media options

Synthesis

Evaluating Employee Performance

Agri Sci 102 Presentation

15 Important Lessons I Learned as an SEO Start up