The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller

The Expected Performance CurveSamy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen

Introduction to the paper • By Samy Bengio, Johnny Mariéthoz and Mikaela Keller, 2005 • For machine learning community and researchers ect, who need to compare models. Content of the paper: • Introduces ROC curves very briefly. • Points out some risks when using ROC curves for comparing different classifying models. • Argues that ROC curves can be misleading by showing some results. • The authors contributes with a so called “Expected Performance Curve”, and argues why it is better for comparing models. • Extends EPC with confidence intervals and statistical difference tests. • Concludes the paper summarizing their contribution and by listing strenghts and weaknesses of ROC and EPC. • Acknowledgement and references

Content • Motivation • Introduce terminology and notation, define problem. • Introduce ROC curves • Example: how to calculate a ROC • Present arguments of why ROC curves should be used with great care • Introduce EPC • Continue example showing how to calculate an EPC • Present arguments of why EPC might be better than ROC • Confidence interval • My opinion • Discussion

Motivation ROC analysis is an important why to compare binary classifier models. Can be used to select optimal models and discard suboptimal models. Area of use: • Medicine (diagnostic testing, evaluate evidence-based medicine approaches) • Epidemiology (factors affecting health, evaluate optimal treatment approaches) • Radiology (radar signals, evaluate new radiology techniques ) • Psychology (signal detection, assess human detection of weak signals) • Machine Learning (evaluation of machine learning techniques) • …

Definition of 2-class classifiers • Definition of 2-class classification problems: • Apply function and associated threshold on a seperate test data set (true class must be known) and count the outcome.

Confusion matrix Given a 2 class classifier and an instance, there are four possible outcomes: • TP: instance is positive and is classified as positive • FN: instance is positive and is classified as negative • TN: instance is negative and is classified as negative • FN: instance is negative and is classified as positive

Perfomance metrics • Selected measure is a pair which is generically called V1 and V2. • V1 and V2 can be calculated in many ways depending on the situation. All are simple combinations of TP, TN, FP and FN. • Exact calculation of V1 and V2 is not important in this paper.

Perfomance metrics • An unique measure generically called V combines V1 and V2 • V can also be calculated in several ways depending on the situation (Half Total Error Rate)

What is a ROC curve? ROC • Abbreviation for ”Receiver Operating Characteristics”. • Technique for visualizing, organizing and selecting classifiers based on their performance. • ROC can both be presented as a graph or a curve. Classifiers • Discrete classifiers (decision trees, rule sets ect.) • Probabilistic classifiers (Naive Bayes, neural network ect.) • Varying a threshold for a probabilistic classifier will trace a curve (ROC) Following example will show this.

Example

Example Threshold

Example

ROC curves • BEP = Breake Even Point • BEP corresponds to the threshold nearst to a solutions such that V1 = V2 • The selected threshold have a significant impact on the model. • The threshold represents the a trade-off between giving importance to V1 or V2.

Potential risk of using ROC • Each point corresponds to a particular setting of the threshold. But in “real applications” the thresholds need to be decided before seeing the test set. • Normally the threshold is found by searching for the BEP using some equation. • Possibility of mismatch because training set is different from the test set. • Situations may occur where the optimal threshold found be using the training set, doesn’t correspond to the optimal threshold on the test set. • One parameter, the threshold, is tuned using the training set. Potential risk to expect that the training error reflects the general error. “Real applications often suffer from an additional mismatch between training and test conditions”. • Risk of a different trade-off (V1, V2) in test set. • ROC curves does not take the risk of a mismatch into account. • This probalility should be reflected in the procedure when calculating the performance curve.

Potential risk of using ROC ROC’s of two real models for a Text-Independent Speaker Verifacation task. Looking at the curves only model B seems to be better than model A. Looking at the thresholds, A is actually the best model.

Expected performance curve • EPC present a range of possible expected performance on the test set. • The calculation takes into account the possible mismatch while estimating the desired threshold. • A parameter alpha is used to estimate the possible missmatch of the threshold. Framework: Paremetric performance measure: C( V1(θ, D), V2(θ, D);  ) Depends on: The parameter , V1 and V2 computed on some data D for the threshold θ. Example: C( V1(θ, D), V2(θ, D);  ) = C( Precision(θ, D), Recall(θ, D) ;  ) = - ( Precision(θ, D) + (1 - )  Recall(θ, D)) Procedure: Vary  inside a reasonable range and for each  estimate θthat minimizes C(-,-;) on a development set and then use the obtained θ to compute V on the test set. At last plot V with respect to .

EPC Algorithm

Example

Example of an typical EPC Alpha > 0,5 = more importance to false acceptance errors Alpha < 0,5 = more importance to false rejection errors

EPC in real applications Expected Performance Curves for person authentication, where one wants to trade-off false acceptance rates with false rejection rates. Expected Performance Curves for text categorization, where one wants to trade-off precision and recall and print the F1 measure.

Confidence Interval • Confidence intervals are used to indicate the reliability of an estimate

My opinion • The authors got a point and the idea is good. • Good for comparing models… • …but hard to read much from EPC, ROC more informative. • Cumbersome to compute EPC. • Useful… maybe? • Apparently only used by the authors?

End of Line • Questions • Discussion

The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller

The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller

Presentation Transcript

Mari Millery

Johnny Appleseed

Mari Basom

Giving Johnny The Keys

keller

Johnny Appleseed

Johnny Cash

EXAM 1 Performance (After the curve)

Presented by: Mikaela Phillips

JOHnny tremain

Johnny

The Helen Keller Project

Typical Performance Curve

Johnny and the PowerPoint

Johnny Depp

Johnny

Johnny Appleseed

Johnny Cash

Mikaela Gutierrez and Kaitlin McMurtrie

Johnny Depp

Johnny Appleseed

By Mikaela Ruth