analyzing attribute dependencies n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Analyzing Attribute Dependencies PowerPoint Presentation
Download Presentation
Analyzing Attribute Dependencies

Loading in 2 Seconds...

play fullscreen
1 / 19

Analyzing Attribute Dependencies - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

Analyzing Attribute Dependencies. Aleks Jakulin & Ivan Bratko Faculty of Computer and Information Science University of Ljubljana Slovenia. Overview. Problem : Generalize the notion of “correlation” from two variables to three or more variables. Approach :

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Analyzing Attribute Dependencies' - sheri


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
analyzing attribute dependencies

Analyzing Attribute Dependencies

Aleks Jakulin & Ivan Bratko

Faculty of Computer and Information Science

University of Ljubljana

Slovenia

overview
Overview
  • Problem:
    • Generalize the notion of “correlation” from two variables to three or more variables.
  • Approach:
    • Use the Shannon’s entropy as the foundation for quantifying interaction.
  • Application:
    • Visualization, with focus on supervised learning domains.
  • Result:
    • We can explain several “mysteries” of machine learning through higher-order dependencies.
problem attribute dependencies

label

C

importance of attribute A

importance of attribute B

attribute

attribute

A

B

attribute correlation

3-Way Interaction:

What is common to A, B and C together;

and cannot be inferred from any subset of attributes.

2-Way Interactions

Problem: Attribute Dependencies
approach shannon s entropy

Entropy given C’s empirical probability distribution (p = [0.2, 0.8]).

H(C|A) = H(C)-I(A;C)

Conditional entropy ---

Remaining uncertainty

in C after knowing A.

H(A)

Information

which came with

knowledge of A

H(AB)

Joint entropy

I(A;C)=H(A)+H(C)-H(AC)

Mutual information or information gain ---

How much have A and C in common?

Approach: Shannon’s Entropy

A

C

interaction information
Interaction Information

I(A;B;C) :=

I(AB;C)

- I(A;C)

- I(B;C)

= I(A;B|C) - I(A;B)

  • (Partial) history of independent reinventions:
      • McGill ‘54 (Psychometrika) - interaction information
      • Han ‘80 (Information & Control) - multiple mutual information
      • Yeung ‘91 (IEEE Trans. On Inf. Theory) - mutual information
      • Grabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf interaction index
      • Matsuda ‘00 (Physical Review E) - higher-order mutual inf.
      • Brenner et al. ‘00 (Neural Computation) - average synergy
      • Demšar ’02 (A thesis in machine learning) - relative information gain
      • Bell ‘03 (NIPS02, ICA2003) - co-information
      • Jakulin ‘03 - interaction gain
properties
Properties
  • Invariance with respect to attribute/label division:

I(A;B;C) = I(A;C;B) = I(C;A;B) = = I(B;A;C) = I(C;B;A) = I(B;C;A).

  • Decomposition of mutual information:

I(AB;C) = I(A;C)+I(B;C)+I(A;B;C)

I(A;B;C) is “synergistic information.”

  • A, B, C are independent  I(A;B;C) = 0.
positive and negative interactions
Positive and Negative Interactions
  • If any pair of the attributes is conditionally independent w/r to a third attribute, the 3-information “neutralizes” the 2-information:

I(A;B|C) = 0  I(A;B;C) = -I(A;B)

  • Interaction information may be positive or negative:
    • Positive: XOR problem (A = B  C) synergy
    • Negative: conditional independence, redundant attributes redundancy
    • Zero: Independence of one of the attributes or a mix of synergy and redundancy.
applications
Applications
  • Visualization
    • Interaction graphs
    • Interaction dendrograms
  • Model construction
    • Feature construction
    • Feature selection
    • Ensemble construction
  • Evaluation on the CMC domain: predicting contraception method from demographics.
interaction graphs

Information gain:

100% I(A;C)/H(C)

The attribute “explains”

1.98% of label entropy

A positive interaction:

100% I(A;B;C)/H(C)

The two attributes are in a synergy:

treating them holistically may result

in 1.85% extra uncertainty explained.

A negative interaction:

100% I(A;B;C)/H(C)

The two attributes are slightly redundant:

1.15% of label uncertainty is explained

by each of the two attributes.

Interaction Graphs
application feature construction
Application: Feature Construction

NBC Model Predictive perf.

(Brier score)__

{} 0.2157  0.0013

{Wedu, Hedu} 0.2087  0.0024

{Wedu} 0.2068  0.0019

{WeduHedu} 0.2067  0.0019

{Age, Child} 0.1951  0.0023

{AgeChild} 0.1918  0.0026

{ACWH} 0.1873  0.0027

{A, C, W, H} 0.1870  0.0030

{A, C, W} 0.1850  0.0027

{AC, WH} 0.1831  0.0032

{AC, W} 0.1814  0.0033

alternatives
Alternatives

TAN

NBC

0.1874  0.0032

0.1849  0.0028

BEST: >100000 models

{AC, WH, MediaExp}

GBN

0.1811  0.0032

0.1815  0.0029

dissimilarity measures
Dissimilarity Measures
  • The relationships between attributes are to some extent transitive.
  • Algorithm:
    • Define a dissimilarity measure between two attributes in the context of the label C:
    • Apply hierarchical clustering to summarize the dissimilarity matrix.
interaction dendrogram

uninformative

attribute

informative

attribute

information gain

Interaction Dendrogram

weakly interacting

strongly interacting

cluster “tightness”

loose

tight

application feature selection
Application: Feature Selection
  • Soybean domain:
    • predict disease from symptoms;
    • predominantly negative interactions.
  • Global optimization procedure for feature selection: >5000 NBC models tested (B-Course)
  • Selected features balance dissimilarity and importance.
  • We can understand what global optimization did from the dendrogram.
implication assumptions in machine learning

A

A and B are independent. They may

both inform us about C, but they have

nothing in common.

Assumed by: myopic feature importance

measures (information gain),

discretization algorithms.

C

B

I(AB;C)=I(A;C)+I(B;C)

C

A and B are conditionally independent

given C. If A and B have something in

common, it is all due to C.

Assumed by: Naïve Bayes, Bayesian

networks (A  C  B);

B

A

-

I(A;B|C)=0

Implication: Assumptions in Machine Learning
work in progress
Work in Progress
  • Overfitting: the interaction information computations do not account for the increase in complexity.
  • Support for numerical and ordered attributes.
  • Inductive learning algorithms which use these heuristics automatically.
  • Models that are based on the real relationships in the data, not on our assumptions about them.
summary
Summary
  • There are relationships exclusive to groups of n attributes.
  • Interaction information is a heuristic for quantification of relationships with entropy.
  • Two visualization methods:
    • Interaction graphs
    • Interaction dendrograms