1 / 19

Analyzing Attribute Dependencies

Analyzing Attribute Dependencies. Aleks Jakulin & Ivan Bratko Faculty of Computer and Information Science University of Ljubljana Slovenia. Overview. Problem : Generalize the notion of “correlation” from two variables to three or more variables. Approach :

sheri
Download Presentation

Analyzing Attribute Dependencies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing Attribute Dependencies Aleks Jakulin & Ivan Bratko Faculty of Computer and Information Science University of Ljubljana Slovenia

  2. Overview • Problem: • Generalize the notion of “correlation” from two variables to three or more variables. • Approach: • Use the Shannon’s entropy as the foundation for quantifying interaction. • Application: • Visualization, with focus on supervised learning domains. • Result: • We can explain several “mysteries” of machine learning through higher-order dependencies.

  3. label C importance of attribute A importance of attribute B attribute attribute A B attribute correlation 3-Way Interaction: What is common to A, B and C together; and cannot be inferred from any subset of attributes. 2-Way Interactions Problem: Attribute Dependencies

  4. Entropy given C’s empirical probability distribution (p = [0.2, 0.8]). H(C|A) = H(C)-I(A;C) Conditional entropy --- Remaining uncertainty in C after knowing A. H(A) Information which came with knowledge of A H(AB) Joint entropy I(A;C)=H(A)+H(C)-H(AC) Mutual information or information gain --- How much have A and C in common? Approach: Shannon’s Entropy A C

  5. Interaction Information I(A;B;C) := I(AB;C) - I(A;C) - I(B;C) = I(A;B|C) - I(A;B) • (Partial) history of independent reinventions: • McGill ‘54 (Psychometrika) - interaction information • Han ‘80 (Information & Control) - multiple mutual information • Yeung ‘91 (IEEE Trans. On Inf. Theory) - mutual information • Grabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf interaction index • Matsuda ‘00 (Physical Review E) - higher-order mutual inf. • Brenner et al. ‘00 (Neural Computation) - average synergy • Demšar ’02 (A thesis in machine learning) - relative information gain • Bell ‘03 (NIPS02, ICA2003) - co-information • Jakulin ‘03 - interaction gain

  6. Properties • Invariance with respect to attribute/label division: I(A;B;C) = I(A;C;B) = I(C;A;B) = = I(B;A;C) = I(C;B;A) = I(B;C;A). • Decomposition of mutual information: I(AB;C) = I(A;C)+I(B;C)+I(A;B;C) I(A;B;C) is “synergistic information.” • A, B, C are independent  I(A;B;C) = 0.

  7. Positive and Negative Interactions • If any pair of the attributes is conditionally independent w/r to a third attribute, the 3-information “neutralizes” the 2-information: I(A;B|C) = 0  I(A;B;C) = -I(A;B) • Interaction information may be positive or negative: • Positive: XOR problem (A = B  C) synergy • Negative: conditional independence, redundant attributes redundancy • Zero: Independence of one of the attributes or a mix of synergy and redundancy.

  8. Applications • Visualization • Interaction graphs • Interaction dendrograms • Model construction • Feature construction • Feature selection • Ensemble construction • Evaluation on the CMC domain: predicting contraception method from demographics.

  9. Information gain: 100% I(A;C)/H(C) The attribute “explains” 1.98% of label entropy A positive interaction: 100% I(A;B;C)/H(C) The two attributes are in a synergy: treating them holistically may result in 1.85% extra uncertainty explained. A negative interaction: 100% I(A;B;C)/H(C) The two attributes are slightly redundant: 1.15% of label uncertainty is explained by each of the two attributes. Interaction Graphs

  10. CMC

  11. Application: Feature Construction NBC Model Predictive perf. (Brier score)__ {} 0.2157  0.0013 {Wedu, Hedu} 0.2087  0.0024 {Wedu} 0.2068  0.0019 {WeduHedu} 0.2067  0.0019 {Age, Child} 0.1951  0.0023 {AgeChild} 0.1918  0.0026 {ACWH} 0.1873  0.0027 {A, C, W, H} 0.1870  0.0030 {A, C, W} 0.1850  0.0027 {AC, WH} 0.1831  0.0032 {AC, W} 0.1814  0.0033

  12. Alternatives TAN NBC 0.1874  0.0032 0.1849  0.0028 BEST: >100000 models {AC, WH, MediaExp} GBN 0.1811  0.0032 0.1815  0.0029

  13. Dissimilarity Measures • The relationships between attributes are to some extent transitive. • Algorithm: • Define a dissimilarity measure between two attributes in the context of the label C: • Apply hierarchical clustering to summarize the dissimilarity matrix.

  14. uninformative attribute informative attribute information gain Interaction Dendrogram weakly interacting strongly interacting cluster “tightness” loose tight

  15. Application: Feature Selection • Soybean domain: • predict disease from symptoms; • predominantly negative interactions. • Global optimization procedure for feature selection: >5000 NBC models tested (B-Course) • Selected features balance dissimilarity and importance. • We can understand what global optimization did from the dendrogram.

  16. Application: Ensembles

  17. A A and B are independent. They may both inform us about C, but they have nothing in common. Assumed by: myopic feature importance measures (information gain), discretization algorithms. C B I(AB;C)=I(A;C)+I(B;C) C A and B are conditionally independent given C. If A and B have something in common, it is all due to C. Assumed by: Naïve Bayes, Bayesian networks (A  C  B); B A - I(A;B|C)=0 Implication: Assumptions in Machine Learning

  18. Work in Progress • Overfitting: the interaction information computations do not account for the increase in complexity. • Support for numerical and ordered attributes. • Inductive learning algorithms which use these heuristics automatically. • Models that are based on the real relationships in the data, not on our assumptions about them.

  19. Summary • There are relationships exclusive to groups of n attributes. • Interaction information is a heuristic for quantification of relationships with entropy. • Two visualization methods: • Interaction graphs • Interaction dendrograms

More Related