1 / 62

Spooky Stuff in Metric Space

Spooky Stuff in Metric Space. Spooky Stuff Data Mining in Metric Space. Rich Caruana Alex Niculescu Cornell University. Motivation #1. Motivation #1: Pneumonia Risk Prediction. Motivation #1: Many Learning Algorithms. Neural nets Logistic regression Linear perceptron K-nearest neighbor

crofts
Download Presentation

Spooky Stuff in Metric Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spooky Stuff in Metric Space

  2. Spooky StuffData Mining in Metric Space Rich Caruana Alex Niculescu Cornell University

  3. Motivation #1

  4. Motivation #1: Pneumonia Risk Prediction

  5. Motivation #1: Many Learning Algorithms • Neural nets • Logistic regression • Linear perceptron • K-nearest neighbor • Decision trees • ILP (Inductive Logic Programming) • SVMs (Support Vector Machines) • Bagging X • Boosting X • Rule learners (C2, …) • Ripper • Random Forests (forests of decision trees) • Gaussian Processes • Bayes Nets • … • No one/few learning methods dominates the others

  6. Motivation #2

  7. Motivation #2: SLAC B/Bbar • Particle accelerator generates B/Bbar particles • Use machine learning to classify tracks as B or Bbar • Domain specific performance measure: SLQ-Score • 5% increase in SLQ can save $1M in accelerator time • SLAC researchers tried various DM/ML methods • Good, but not great, SLQ performance • We tried standard methods, got similar results • We studied SLQ metric: • similar to probability calibration • tried bagged probabilistic decision trees (good on C-Section)

  8. Motivation #2: Bagged Probabilistic Trees • Draw N bootstrap samples of data • Train tree on each sample ==> N trees • Final prediction = average prediction of N trees … Average prediction (0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24

  9. Motivation #2: Improves Calibration Order of Magnitude single tree Poor Calibration 100 bagged trees Excellent Calibration

  10. Motivation #2: Significantly Improves SLQ 100 bagged trees single tree

  11. Motivation #2 • Can we automate this analysis of performance metrics so that it’s easier to recognize which metrics are similar to each other?

  12. Motivation #3

  13. Motivation #3

  14. Scary Stuff • In ideal world: • Learn model that predicts correct conditional probabilities (Bayes optimal) • Yield optimal performance on any reasonable metric • In real world: • Finite data • 0/1 targets instead of conditional probabilities • Hard to learn this ideal model • Don’t have good metrics for recognizing ideal model • Ideal model isn’t always needed • In practice: • Do learning using many different metrics: ACC, AUC, CXE, RMS, … • Each metric represents different tradeoffs • Because of this, usually important to optimize to appropriate metric

  15. Scary Stuff

  16. Scary Stuff

  17. In this work we compare nine commonly used performance metrics by applying data mining to the results of a massive empirical study • Goals: • Discover relationships between performance metrics • Are the metrics really that different? • If you optimize to metric X, also get good perf on metric Y? • Need to optimize to metric Y, which metric X should you optimize to? • Which metrics are more/less robust? • Design new, better metrics?

  18. 10 Binary Classification Performance Metrics • Threshold Metrics: • Accuracy • F-Score • Lift • Ordering/Ranking Metrics: • ROC Area • Average Precision • Precision/Recall Break-Even Point • Probability Metrics: • Root-Mean-Squared-Error • Cross-Entropy • Probability Calibration • SAR = ((1 - Squared Error) + Accuracy + ROC Area) / 3

  19. Accuracy Predicted 1Predicted 0 correct a b True 0True 1 c d incorrect threshold accuracy = (a+d) / (a+b+c+d)

  20. Lift • not interested in accuracy on entire dataset • want accurate predictions for 5%, 10%, or 20% of dataset • don’t care about remaining 95%, 90%, 80%, resp. • typical application: marketing • how much better than random prediction on the fraction of the dataset predicted true (f(x) > threshold)

  21. Lift Predicted 1Predicted 0 a b True 0True 1 c d threshold

  22. lift = 3.5 if mailings sent to 20% of the customers

  23. Precision/Recall, F, Break-Even Pt harmonic average of precision and recall

  24. better performance worse performance

  25. Predicted 1Predicted 0 Predicted 1Predicted 0 true positive false negative FN TP True 0True 1 True 0True 1 false positive true negative FP TN Predicted 1Predicted 0 Predicted 1Predicted 0 misses P(pr0|tr1) hits P(pr1|tr1) True 0True 1 True 0True 1 false alarms correct rejections P(pr1|tr0) P(pr0|tr0)

  26. ROC Plot and ROC Area • Receiver Operator Characteristic • Developed in WWII to statistically model false positive and false negative detections of radar operators • Better statistical foundations than most other measures • Standard measure in medicine and biology • Becoming more popular in ML • Sweep threshold and plot • TPR vs. FPR • Sensitivity vs. 1-Specificity • P(true|true) vs. P(true|false) • Sensitivity = a/(a+b) = Recall = LIFT numerator • 1 - Specificity = 1 - d/(c+d)

  27. diagonal line is random prediction

  28. Calibration • Good calibration: • If 1000 x’s have pred(x) = 0.2, ~200 should be positive

  29. Calibration • Model can be accurate but poorly calibrated • good threshold with uncalibrated probabilities • Model can have good ROC but be poorly calibrated • ROC insensitive to scaling/stretching • only ordering has to be correct, not probabilities themselves • Model can have very high variance, but be well calibrated • Model can be stupid, but be well calibrated • Calibration is a real oddball

  30. Measuring Calibration • Bucket method • In each bucket: • measure observed c-sec rate • predicted c-sec rate (average of probabilities) • if observed csec rate similar to predicted csec rate => good calibration in that bucket 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 # # # # # # # # # # # # # # # # # # # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

  31. Calibration Plot

  32. Experiments

  33. Base-Level Learning Methods • Decision trees • K-nearest neighbor • Neural nets • SVMs • Bagged Decision Trees • Boosted Decision Trees • Boosted Stumps • Each optimizes different things • Each best in different regimes • Each algorithm has many variations and free parameters • Generate about 2000 models on each test problem

  34. Data Sets • 7 binary classification data sets • Adult • Cover Type • Letter.p1 (balanced) • Letter.p2 (unbalanced) • Pneumonia (University of Pittsburgh) • Hyper Spectral (NASA Goddard Space Center) • Particle Physics (Stanford Linear Accelerator) • 4 k train sets • Large final test sets (usually 20k)

  35. Massive Empirical Comparison 7 base-level learning methods X 100’s of parameter settings per method = ~ 2000 models per problem X 7 test problems = 14,000 models X 10 performance metrics = 140,000 model performance evaluations

  36. COVTYPE: Calibration vs. Accuracy

  37. Multi Dimensional Scaling

  38. Scaling, Ranking, and Normalizing • Problem: • some metrics, 1.00 is best (e.g. ACC) • some metrics, 0.00 is best (e.g. RMS) • some metrics, baseline is 0.50 (e.g. AUC) • some problems/metrics, 0.60 is excellent performance • some problems/metrics, 0.99 is poor performance • Solution 1: Normalized Scores: • baseline performance => 0.00 • best observed performance => 1.00 (proxy for Bayes optimal) • puts all metrics on equal footing • Solution 2: Scale by Standard Deviation • Solution 3: Rank Correlation

  39. Multi Dimensional Scaling • Find low-dimension embedding of 10x14,000 data • The 10 metrics span a 2-5 dimension subspace

  40. Multi Dimensional Scaling • Look at 2-D MDS plots: • Scaled by standard deviation • Normalized scores • MDS of rank correlations • MDS on each problem individually • MDS averaged across all problems

  41. 2-D Multi-Dimensional Scaling

  42. 2-D Multi-Dimensional Scaling Normalized Scores Scaling Rank-Correlation Distance

  43. Adult Covertype Hyper-Spectral Letter Medis SLAC

  44. Correlation Analysis • 2000 performances for each metric on each problem • Correlation between all pairs of metrics • 10 metrics • 45 pairwise correlations • Average of correlations over 7 test problems • Standard correlation • Rank correlation • Present rank correlation here

  45. Rank Correlations • Correlation analysis consistent with MDS analysis • Ordering metrics have high correlations to each other • ACC, AUC, RMS have best correlations of metrics in each metric class • RMS has good correlation to other metrics • SAR has best correlation to other metrics

  46. Summary • 10 metrics span 2-5 Dim subspace • Consistent results across problems and scalings • Ordering Metrics Cluster: AUC ~ APR ~ BEP • CAL far from Ordering Metrics • CAL nearest to RMS/MXE • RMS ~ MXE, but RMS much more centrally located • Threshold Metrics ACC and FSC do not cluster as tightly as ordering metrics and RMS/MXE • Lift behaves more like Ordering than Threshold metrics • Old friends ACC, AUC, and RMS most representative • New SAR metric is good, but not much better than RMS

  47. New Resources • Want to borrow 14,000 models? • margin analysis • comparison to new algorithm X • … • PERF code: software that calculates ~2 dozen performance metrics: • Accuracy (at different thresholds) • ROC Area and ROC plots • Precision and Recall plots • Break-even-point, F-score, Average Precision • Squared Error • Cross-Entropy • Lift • … • Currently, most metrics are for boolean classification problems • We are willing to add new metrics and new capabilities • Available at: http://www.cs.cornell.edu/~caruana

  48. Future Work

  49. Future/Related Work • Ensemble method optimizes any metric (ICML*04) • Get good probs from Boosted Trees (AISTATS*05) • Comparison of learning algs on metrics (ICML*06) • First step in analyzing different performance metrics • Develop new metrics with better properties • SAR is a good general purpose metric • Does optimizing to SAR yield better models? • but RMS nearly as good • attempts to make SAR better did not help much • Extend to multi-class or hierarchical problems where evaluating performance is more difficult

  50. Thank You.

More Related