Feature selection as relevant information encoding
Sponsored Links
This presentation is the property of its rightful owner.
1 / 25

Feature Selection as Relevant Information Encoding PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on
  • Presentation posted in: General

Feature Selection as Relevant Information Encoding . Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS 2001. Many thanks to: Noam Slonim Amir Globerson Bill Bialek Fernando Pereira Nir Friedman. Feature Selection?.

Download Presentation

Feature Selection as Relevant Information Encoding

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Feature Selectionas Relevant Information Encoding

Naftali Tishby

School of Computer Science and Engineering

The Hebrew University, Jerusalem, Israel

NIPS 2001


Many thanks to:

Noam Slonim

Amir Globerson

Bill Bialek

Fernando Pereira

Nir Friedman


Feature Selection?

  • NOT generative modeling!

    • no assumptions about the source of the data

  • Extracting relevant structure from data

    • functions of the data (statistics) that preserve information

  • Information about what?

  • Approximate Sufficient Statistics

  • Need a principle that is both general and precise.

    • Good Principles survive longer!


ASimple Example...


Simple Example


A new compact representation

The document clusters preserve the relevant

information between the documents and words


Documents

Words


Mutual information

How much X is telling about Y?

I(X;Y): function of the joint probability distribution p(x,y) -

minimal number of yes/no questions (bits) needed to ask about x, in order to learn all we can about Y.

Uncertainty removed about X when we know Y:

I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X)

I(X;Y)

H(X|Y)

H(Y|X)


Relevant Coding

  • What are thequestionsthat we need to askaboutXin orderto learn about Y?

  • Need to partition X into relevant domains, or clusters, between which we really need to distinguish...

P(x|y1)

X|y1

y1

y2

X|y2

P(x|y2)

X

Y


Bottlenecks and Neural Nets

  • Auto association: forcing compact representations

  • is a relevant code of w.r.t.

Input

Output

Sample 1

Sample 2

Past

Future


  • Q: How many bits are needed to determine the relevant representation?

    • need to index the max number of non-overlapping green blobs inside the blue blob:

      (mutual information!)


  • The idea: find a compressed signal

    that needs short encoding ( small )

    while preserving as much as possible the information on the relevant signal ( )


A Variational Principle

We want a short representation of X that keeps the information about another variable, Y, if possible.


TheSelf Consistent Equations

  • Marginal:

  • Markov condition:

  • Bayes’ rule:


The emerged effective distortion measure:

  • Regular if is absolutely continuous w.r.t.

  • Small if predicts y as well as x:


The iterative algorithm: (Generalized Blahut-Arimoto)

Generalized

BA-algorithm


The Information BottleneckAlgorithm

“free energy”


  • The Information - plane, the optimal for agiven is a concave function:

impossible

Possible phase


Manifold of relevance

The self consistent equations:

Assuming acontinuous manifoldfor

Coupled (local in ) eigenfunction equations, with  as an eigenvalue.


Document classification - information curves


Multivariate Information Bottleneck

  • Complex relationship between many variables

  • Multiple unrelated dimensionality reduction schemes

  • Trade between known and desired dependencies

  • Express IB in the language of Graphical Models

  • Multivariate extension of Rate-Distortion Theory


Multivariate Information Bottleneck:

Extending the dependency graphs

(Multi-information)


Sufficient Dimensionality Reduction(with Amir Globerson)

  • Exponential families have sufficient statistics

  • Given a joint distribution , find an approximation of the exponential form:

This can be done by alternating maximization of Entropy under the constraints:

The resulting functions are our relevant features at rank d.


Summary

  • We present a general information theoretic approach for extracting relevant information.

  • It is a natural generalization of Rate-Distortion theory with similar convergence and optimality proofs.

  • Unifies learning, feature extraction, filtering, and prediction...

  • Applications (so far) include:

    • Word sense disambiguation

    • Document classification and categorization

    • Spectral analysis

    • Neural codes

    • Bioinformatics,…

    • Data clustering based on multi-distance distributions


  • Login