Testing the Significance of Attribute Interactions

Testing the Significance of Attribute Interactions Aleks Jakulin & Ivan Bratko Faculty of Computer and Information Science University of Ljubljana Slovenia

Overview • Interactions: • The key to understanding many peculiarities in machine learning. • Feature importance measures the 2-way interaction between an attribute and the label, but there are interactions of higher orders. • An information-theoretic view of interactions: • Information theory provides a simple “algebra” of interactions, based on summing and subtracting entropy terms (e.g., mutual information). • Part-to-whole approximations: • An interaction is an irreducible dependence. Information-theoretic expressions are model comparisons! • Significance testing: • As with all model comparisons, we can investigate the significance of the model difference.

Example 1: Feature Subset Selection with NBC The calibration of the classifier (expected likelihood of an instance’s label) first improves then deteriorates as we add attributes. The optimal number is ~8 attributes. The first few attributes are important, the rest is noise?

Example 1: Feature Subset Selection with NBC NO! We sorted the attributes from the worst to the best. It is some of the best attributes that deteriorate the performance! Why?

Example 2: Spiral/XOR/Parity Problems Either attribute (x, y) is irrelevant when alone. Together, they make a perfect blue/red classifier.

label C importance of attribute A importance of attribute B attribute attribute A B attribute correlation 3-Way Interaction: What is common to A, B and C together; and cannot be inferred from any subset of attributes. 2-Way Interactions What is going on?Interactions

Entropy given C’s empirical probability distribution (p = [0.2, 0.8]). H(C|A) = H(C)-I(A;C) Conditional entropy - Remaining uncertainty in C after learning A. H(A) Information which came with the knowledge of A H(AB) Joint entropy I(A;C)=H(A)+H(C)-H(AC) Mutual information or information gain --- How much have A and C in common? Quantification: Shannon’s Entropy A C

Interaction Information How informative are A and B together? I(A;B;C) := I(AB;C) - I(A;C) - I(B;C) = I(B;C|A) - I(B;C) = I(A;C|B) - I(A;C) • (Partial) history of independent reinventions: • Quastler ‘53 (Info. Theory in Biology) - measure of specificity • McGill ‘54 (Psychometrika) - interaction information • Han ‘80 (Information & Control) - multiple mutual information • Yeung ‘91 (IEEE Trans. On Inf. Theory) - mutual information • Grabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf interaction index • Matsuda ‘00 (Physical Review E) - higher-order mutual inf. • Brenner et al. ‘00 (Neural Computation) - average synergy • Demšar ’02 (A thesis in machine learning) - relative information gain • Bell ‘03 (NIPS02, ICA2003) - co-information • Jakulin ‘02 - interaction gain

Information gain: 100% I(A;C)/H(C) The attribute “explains” 1.98% of label entropy A positive interaction: 100% I(A;B;C)/H(C) The two attributes are in a synergy: treating them holistically may result in 1.85% extra uncertainty explained. A negative interaction: 100% I(A;B;C)/H(C) The two attributes are slightly redundant: 1.15% of label uncertainty is explained by each of the two attributes. Applications: Interaction Graphs CMC domain: the label is the ‘contraceptive method’ used by a couple.

uninformative attribute informative attribute information gain Interaction as Attribute Proximity weakly interacting strongly interacting cluster “tightness” loose tight

Part-to-Whole Approximation Mutual information: • Whole: P(A,B) Parts:{P(A), P(B)} • Approximation: • Kullback-Leibler divergence as the measure of difference: • Also applies for predictive accuracy:

Kirkwood Superposition Approximation It is a closed form part-to-whole approximation, a special case of Kikuchi and mean-field approximations. is not normalized, explaining the negative interaction information. It is not optimal (loglinear models beat it).

Significance Testing • Tries to answer the question: “When is P much better than P’?” • It is based on the realization that even the correct probabilistic model P can expect to make an error for a sample of finite size. • The notion of self-loss captures the distribution of loss of the complex model (“variance”). • The notion of approximation loss captures the loss caused by using a simpler model (“bias”). • P is significantly better than P’ when the error made by P’ is greater than the self-loss in 99.5% of cases. The P-value can be at most 0.05.

Test-Bootstrap Protocol To obtain the self-loss distribution, we perturb the test data, which is a bootstrap sample from the whole data set. As the loss function, we employ KL-divergence: VERY similar to assuming that D(P’||P) has a χ2 distribution.

Self-Loss

Cross-Validation Protocol • P-values ignore the variation in approximation loss and the generalization power of a classifier. • CV-values are based on the following perturbation procedure:

The Myth of Average Performance The distribution of How much do the mode/median/mean of the above distribution tell you about which model to select? ← interaction (complex) winsapproximation(simple) wins →

Summary • The existence of an interaction implies the need for a more complex model that joins the attributes. • Feature relevance is an interaction of order 2. • If there is no interaction, a complex model is unnecessary. • Information theory provides an approximate “algebra” for investigating interactions. • The difference between two models is a distribution, not a scalar. • Occam’s P-Razor: Pick the simplest model among those that are not significantly worse than the best one.

Testing the Significance of Attribute Interactions

Testing the Significance of Attribute Interactions

Presentation Transcript

Multi-Attribute Utility Models with Interactions

Correlation and the logic of significance testing

Statistical Significance Testing

Significance Testing of Microarray Data

Significance Testing

Significance Testing

Hypothesis Testing: Significance

Significance Testing

Hypothesis Testing: Significance

Zen and the Art of Significance Testing

Significance Testing

Critical review of significance testing

Significance testing

Statistical Significance Testing

Testing the Significance of Attribute Interactions

Statistical Significance Testing

THE SIGNIFICANCE OF MILK AND TESTING

Significance of Materials Testing Machine

Significance of Software Testing

What is the Significance of functional testing