What this Paper Offers

Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance ExplanationKun Zhang, Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia XuDec. 21, 2006

What this Paper Offers • Preference of a probability estimation tree (PET) • Many important and previously unrevealed properties of PETs • A practical guide for choosing the most appropriate PET algorithm

Since Unknown Distribution P(X,Y), y=F(x) LS= A Loss/Cost function L Learning Algorithm Model f(x,θ) 0-1 Loss: Cost-Sensitive Loss: f : Challenge: P(x,y) is unknown so is P(y|x)! Why Probability Estimation?- Theoretical Necessity • Statistical Supervised Leaning • Bayesian Decision Rule

Direct estimation of Probability • Decision threshold determination • Non-static, skewed distribution • Unequal loss (Yang,icdm05) Why Probability Estimation?- Practical Necessity Ozone level Prediction Medical Domain Direct Marketing

Posterior Probability Estimation • The true and unknown distribution follows a “particular form”. • Via maximum likelihood estimation • E.g. Naïve Bayes, logistic regression Parametric Methods Posterior Probability Estimation Non-Parametric Approaches a rather unbiased, flexible and convenient solution • Directly calculated without making any assumption • E.g. Decision trees, Nearest neighbors

PETs - Probabilistic View of Decision Trees • , E.g. (C4.5, CART) • Confidences in the predicted labels • Appropriately thesholding for classification w.r.t. different loss functions. • The dependence of P(y|x,θ) on θ is non-trivial

Problems of Traditional PETs • Probability estimates through frequency tend to be too close to the extremes of 1 and 0 • ------------------------------------------------- • Additional inaccuracies result from the small number of examples within a leaf. • ------------------------------------------------- • The same probability is assigned to the entire region of space defined by a given leaf. • C4.4 • (Provost,03) • CFT(Ling,03) • BPET(Breiman,96), • RDT(Fan,03)

Which one to choose? What performances to be expected ? Why should one PET be preferred over another? Popular PET Algorithms

Contributions • A large scale learning curve study using multiple evaluation metrics • Preference of a PET: signal-noise separability of datasets • Many important and previously unrevealed properties of PETs: • In ensembles, RDT is preferable on low-signal separability datasets, while BPET is favorable when the signal separability is high. • A practical guide for choosing the most appropriate PET algorithm

A synthetic scenario – tumor diagnosis • Tumor: signal present • No tumor: signal absent • Based on yes/no decision • P(yes|tumor): hit (TP) • P(yes|no tumor): false alarm (FP) • P(no|tumor): miss (FN) • P(no|no tumor): correct reject (TN) Analytical Tool # 1: AUC - Index of Signal Noise Separability • Signal-noise separability • Correct identification of information of interest and some other noise factors which may interfere this identification. • A good analogy for two different populations present in every learning domain with uncertainty

0.25 f(x|signal) f(x|noise1) f(x|noise2) 0.2 1 0.8 0.15 TPR 0.6 Correct reject Hit 0.4 0.1 Decision Criterion 0.2 False alarm Miss LowSepDist 0 0.05 0 0.5 1 HighSepDist FPR 0 8 6 4 2 0 -2 -4 -6 -8 NoiseSignal Analytical Tool # 1: AUC - Index of Signal Noise Separability • An Illustration AUC: an index for the separability of signal from noise Relative areas of the four different outcomes vary, the separation of the two distribution does not ! • Domains: high/low degree of signal separability • High: deterministic/ little noise • Low: Stochastic/Noisy

Analytical Tool # 2: Learning Curves • Instead of CV or training-test splitting based on fixed data set size • Generalization performance of different models as a function of the size of the training set • Correlation between performance metrics and training set sizes can be observed and possibly generalized over different data sets.

Analytical Tool # 3: Multiple Evaluation Metrics • Area Under ROC Curve (AUC) - Summarizes the “ranking capability” of a learning algorithm in ROC space • MSE (Brier Score) - - A proper assessment for the “accuracy” of probability estimation - Calibration-Refinement decomposition * * Calibration measures the absolute precision of probability estimation * Refinement indicates how confident the estimator is in its estimates * Visualization tools – reliability plotsandsharpness graphs • Error Rate • Inappropriate criterion for evaluating probability estimates -

Experiment Results

Conjectures in Summary • RDT and CFT are better on AUC • RDT is preferable on low-signal separability datasets, While BPET is favorable on high-signal separability data sets • High separability categorical datasets with limited feature values hurt RDT • Among single trees, CFT is preferable on low-signal separability datasets

Behind the Scenes- Why RDT and CFT better on AUC? • Superior capability on unique probability generation • AUC calculations: • Trapezoidal integration (Fawcett,03) • (Hand,01) • For larger AUC, P(y|x,θ) should vary from one test point to another • The number of unique probabilities is maximized as a result RDT > BPET > CFT > C4.4 > C4.5

Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? • The reasons: • RDT: discards any criterion for optimal feature selection • More like a structure for data summarization. • When the signal-separability is low, this property protects RDT from the danger of identifying noise as signal or overfitting on noise, which is very likely to be caused by massive searches or optimization adopted by BPET. • RDT provides an average of probability estimation which approaches the mean of true probabilistic values as more individual trees added.

Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? • The evidence (I) – Spect and Sonar, low-signal separability domains

Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? • The evidence (II) – Pima, a low-signal separability domain RDT: BPET:

Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets? • The evidence (III) - Spam, a high-signal separability domain RDT: BPET:

Behind the Scenes- Why high separability categorical datasets with limited feature values hurt RDT? • The observations – Tic-tac-toe and Chess

Behind the Scenes- Why high separability categorical datasets with limited feature values hurt RDT? • The reason: • High separability categorical datasets with limited values tend to restrict the degree of diversity that RDT’s random feature selection can explore • - Random feature selection mechanism of RDT • Categorical features: once; • Continuous features: multiple times, but different splitting value each time.

Behind the Scenes- Why CFT preferable on low-signal separability datasets ? • The reasons • Low-signal separability domains • Good performance benefits from the probability aggregation mechanism • Rectify errors introduced to the probability estimates due to the attribute noise • High-signal separability domains • Aggregation of the estimated probabilities from the other irrelevant leaves will adversely affect the final probability estimates.

Behind the Scenes - Why CFT preferable on low-signal separability datasets ? • The evidence (I) – Spect and Pima, low-signal separability domains

Behind the Scenes - Why CFT preferable on low-signal separability datasets ? • The evidence (II) - Liver, a low-signal separability domain CFT: C4.4: 20

Choosing the Appropriate PET Algorithm Given a New Problem Signal-noise separability estimation through RDT or BPET Given dataset AUC Score High signal-noise separability < 0.9 >=0.9 Low signal-noise separability Ensemble Ensemble or Single trees Single Tree Ensemble or Single trees Feature types and value characteristics AUC MSE Error Rate Single Trees (AUC,MSE,ErrorRate) Categorical feature (with limited values) Ensemble (AUC,MSE,ErrorRate) Continuous features (categorical feature with a large number of values) MSE, ErrorRate AUC, MSE, ErrorRate AUC, MSE, ErrorRate AUC BPET CFT RDT RDT ( BPET) CFT C4.5 or C4.4

Summary • AUC: index of signal noise separability • Preference of a PET on multiple evaluation metrics • “signal-noise separability” of the dataset • other observable statistics. • Many important and unrevealed properties of PETs are analyzed • A practical guide for choosing the most appropriate PET algorithm

Thank you! Questions?

What this Paper Offers

What this Paper Offers

Presentation Transcript

What this paper is and is not about.....

What is this paper about?

What Jump already offers

Read this paper

What is Paper?

What is this piece of paper in front of me?

What Is This Paper

What do we do in this paper?

What this Paper Offers

What the program offers

Outline for this paper

THIS PAPER PRESENTS :

This paper

THIS PAPER IS UNCLASSIFIED

This paper is about …

What is the “research” part of this paper?

This paper demonstrates:

Contribution of this paper

‘core’ idea of this paper

In this paper

What Archipro offers Employers: