620 likes | 645 Views
This work compares nine commonly used performance metrics in data mining to unveil relationships between them. Discover insights on optimization strategies, robustness, and design possibilities for new metrics. The study includes threshold, ordering/ranking, and probability metrics, exploring areas like lift, precision/recall, and ROC area.
E N D
Spooky StuffData Mining in Metric Space Rich Caruana Alex Niculescu Cornell University
Motivation #1: Many Learning Algorithms • Neural nets • Logistic regression • Linear perceptron • K-nearest neighbor • Decision trees • ILP (Inductive Logic Programming) • SVMs (Support Vector Machines) • Bagging X • Boosting X • Rule learners (C2, …) • Ripper • Random Forests (forests of decision trees) • Gaussian Processes • Bayes Nets • … • No one/few learning methods dominates the others
Motivation #2: SLAC B/Bbar • Particle accelerator generates B/Bbar particles • Use machine learning to classify tracks as B or Bbar • Domain specific performance measure: SLQ-Score • 5% increase in SLQ can save $1M in accelerator time • SLAC researchers tried various DM/ML methods • Good, but not great, SLQ performance • We tried standard methods, got similar results • We studied SLQ metric: • similar to probability calibration • tried bagged probabilistic decision trees (good on C-Section)
Motivation #2: Bagged Probabilistic Trees • Draw N bootstrap samples of data • Train tree on each sample ==> N trees • Final prediction = average prediction of N trees … Average prediction (0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24
Motivation #2: Improves Calibration Order of Magnitude single tree Poor Calibration 100 bagged trees Excellent Calibration
Motivation #2: Significantly Improves SLQ 100 bagged trees single tree
Motivation #2 • Can we automate this analysis of performance metrics so that it’s easier to recognize which metrics are similar to each other?
Scary Stuff • In ideal world: • Learn model that predicts correct conditional probabilities (Bayes optimal) • Yield optimal performance on any reasonable metric • In real world: • Finite data • 0/1 targets instead of conditional probabilities • Hard to learn this ideal model • Don’t have good metrics for recognizing ideal model • Ideal model isn’t always needed • In practice: • Do learning using many different metrics: ACC, AUC, CXE, RMS, … • Each metric represents different tradeoffs • Because of this, usually important to optimize to appropriate metric
In this work we compare nine commonly used performance metrics by applying data mining to the results of a massive empirical study • Goals: • Discover relationships between performance metrics • Are the metrics really that different? • If you optimize to metric X, also get good perf on metric Y? • Need to optimize to metric Y, which metric X should you optimize to? • Which metrics are more/less robust? • Design new, better metrics?
10 Binary Classification Performance Metrics • Threshold Metrics: • Accuracy • F-Score • Lift • Ordering/Ranking Metrics: • ROC Area • Average Precision • Precision/Recall Break-Even Point • Probability Metrics: • Root-Mean-Squared-Error • Cross-Entropy • Probability Calibration • SAR = ((1 - Squared Error) + Accuracy + ROC Area) / 3
Accuracy Predicted 1Predicted 0 correct a b True 0True 1 c d incorrect threshold accuracy = (a+d) / (a+b+c+d)
Lift • not interested in accuracy on entire dataset • want accurate predictions for 5%, 10%, or 20% of dataset • don’t care about remaining 95%, 90%, 80%, resp. • typical application: marketing • how much better than random prediction on the fraction of the dataset predicted true (f(x) > threshold)
Lift Predicted 1Predicted 0 a b True 0True 1 c d threshold
lift = 3.5 if mailings sent to 20% of the customers
Precision/Recall, F, Break-Even Pt harmonic average of precision and recall
better performance worse performance
Predicted 1Predicted 0 Predicted 1Predicted 0 true positive false negative FN TP True 0True 1 True 0True 1 false positive true negative FP TN Predicted 1Predicted 0 Predicted 1Predicted 0 misses P(pr0|tr1) hits P(pr1|tr1) True 0True 1 True 0True 1 false alarms correct rejections P(pr1|tr0) P(pr0|tr0)
ROC Plot and ROC Area • Receiver Operator Characteristic • Developed in WWII to statistically model false positive and false negative detections of radar operators • Better statistical foundations than most other measures • Standard measure in medicine and biology • Becoming more popular in ML • Sweep threshold and plot • TPR vs. FPR • Sensitivity vs. 1-Specificity • P(true|true) vs. P(true|false) • Sensitivity = a/(a+b) = Recall = LIFT numerator • 1 - Specificity = 1 - d/(c+d)
diagonal line is random prediction
Calibration • Good calibration: • If 1000 x’s have pred(x) = 0.2, ~200 should be positive
Calibration • Model can be accurate but poorly calibrated • good threshold with uncalibrated probabilities • Model can have good ROC but be poorly calibrated • ROC insensitive to scaling/stretching • only ordering has to be correct, not probabilities themselves • Model can have very high variance, but be well calibrated • Model can be stupid, but be well calibrated • Calibration is a real oddball
Measuring Calibration • Bucket method • In each bucket: • measure observed c-sec rate • predicted c-sec rate (average of probabilities) • if observed csec rate similar to predicted csec rate => good calibration in that bucket 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 # # # # # # # # # # # # # # # # # # # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Base-Level Learning Methods • Decision trees • K-nearest neighbor • Neural nets • SVMs • Bagged Decision Trees • Boosted Decision Trees • Boosted Stumps • Each optimizes different things • Each best in different regimes • Each algorithm has many variations and free parameters • Generate about 2000 models on each test problem
Data Sets • 7 binary classification data sets • Adult • Cover Type • Letter.p1 (balanced) • Letter.p2 (unbalanced) • Pneumonia (University of Pittsburgh) • Hyper Spectral (NASA Goddard Space Center) • Particle Physics (Stanford Linear Accelerator) • 4 k train sets • Large final test sets (usually 20k)
Massive Empirical Comparison 7 base-level learning methods X 100’s of parameter settings per method = ~ 2000 models per problem X 7 test problems = 14,000 models X 10 performance metrics = 140,000 model performance evaluations
Scaling, Ranking, and Normalizing • Problem: • some metrics, 1.00 is best (e.g. ACC) • some metrics, 0.00 is best (e.g. RMS) • some metrics, baseline is 0.50 (e.g. AUC) • some problems/metrics, 0.60 is excellent performance • some problems/metrics, 0.99 is poor performance • Solution 1: Normalized Scores: • baseline performance => 0.00 • best observed performance => 1.00 (proxy for Bayes optimal) • puts all metrics on equal footing • Solution 2: Scale by Standard Deviation • Solution 3: Rank Correlation
Multi Dimensional Scaling • Find low-dimension embedding of 10x14,000 data • The 10 metrics span a 2-5 dimension subspace
Multi Dimensional Scaling • Look at 2-D MDS plots: • Scaled by standard deviation • Normalized scores • MDS of rank correlations • MDS on each problem individually • MDS averaged across all problems
2-D Multi-Dimensional Scaling Normalized Scores Scaling Rank-Correlation Distance
Adult Covertype Hyper-Spectral Letter Medis SLAC
Correlation Analysis • 2000 performances for each metric on each problem • Correlation between all pairs of metrics • 10 metrics • 45 pairwise correlations • Average of correlations over 7 test problems • Standard correlation • Rank correlation • Present rank correlation here
Rank Correlations • Correlation analysis consistent with MDS analysis • Ordering metrics have high correlations to each other • ACC, AUC, RMS have best correlations of metrics in each metric class • RMS has good correlation to other metrics • SAR has best correlation to other metrics
Summary • 10 metrics span 2-5 Dim subspace • Consistent results across problems and scalings • Ordering Metrics Cluster: AUC ~ APR ~ BEP • CAL far from Ordering Metrics • CAL nearest to RMS/MXE • RMS ~ MXE, but RMS much more centrally located • Threshold Metrics ACC and FSC do not cluster as tightly as ordering metrics and RMS/MXE • Lift behaves more like Ordering than Threshold metrics • Old friends ACC, AUC, and RMS most representative • New SAR metric is good, but not much better than RMS
New Resources • Want to borrow 14,000 models? • margin analysis • comparison to new algorithm X • … • PERF code: software that calculates ~2 dozen performance metrics: • Accuracy (at different thresholds) • ROC Area and ROC plots • Precision and Recall plots • Break-even-point, F-score, Average Precision • Squared Error • Cross-Entropy • Lift • … • Currently, most metrics are for boolean classification problems • We are willing to add new metrics and new capabilities • Available at: http://www.cs.cornell.edu/~caruana
Future/Related Work • Ensemble method optimizes any metric (ICML*04) • Get good probs from Boosted Trees (AISTATS*05) • Comparison of learning algs on metrics (ICML*06) • First step in analyzing different performance metrics • Develop new metrics with better properties • SAR is a good general purpose metric • Does optimizing to SAR yield better models? • but RMS nearly as good • attempts to make SAR better did not help much • Extend to multi-class or hierarchical problems where evaluating performance is more difficult