Feature selection as relevant information encoding
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Feature Selection as Relevant Information Encoding PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Feature Selection as Relevant Information Encoding . Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS 2001. Many thanks to: Noam Slonim Amir Globerson Bill Bialek Fernando Pereira Nir Friedman. Feature Selection?.

Download Presentation

Feature Selection as Relevant Information Encoding

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Feature selection as relevant information encoding

Feature Selectionas Relevant Information Encoding

Naftali Tishby

School of Computer Science and Engineering

The Hebrew University, Jerusalem, Israel

NIPS 2001


Many thanks to noam slonim amir globerson bill bialek fernando pereira nir friedman

Many thanks to:

Noam Slonim

Amir Globerson

Bill Bialek

Fernando Pereira

Nir Friedman


Feature selection

Feature Selection?

  • NOT generative modeling!

    • no assumptions about the source of the data

  • Extracting relevant structure from data

    • functions of the data (statistics) that preserve information

  • Information about what?

  • Approximate Sufficient Statistics

  • Need a principle that is both general and precise.

    • Good Principles survive longer!


A simple example

ASimple Example...


Simple example

Simple Example


A new compact representation

A new compact representation

The document clusters preserve the relevant

information between the documents and words


Feature selection as relevant information encoding

Documents

Words


Mutual information

Mutual information

How much X is telling about Y?

I(X;Y): function of the joint probability distribution p(x,y) -

minimal number of yes/no questions (bits) needed to ask about x, in order to learn all we can about Y.

Uncertainty removed about X when we know Y:

I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X)

I(X;Y)

H(X|Y)

H(Y|X)


Relevant coding

Relevant Coding

  • What are thequestionsthat we need to askaboutXin orderto learn about Y?

  • Need to partition X into relevant domains, or clusters, between which we really need to distinguish...

P(x|y1)

X|y1

y1

y2

X|y2

P(x|y2)

X

Y


Feature selection as relevant information encoding

Bottlenecks and Neural Nets

  • Auto association: forcing compact representations

  • is a relevant code of w.r.t.

Input

Output

Sample 1

Sample 2

Past

Future


Feature selection as relevant information encoding

  • Q: How many bits are needed to determine the relevant representation?

    • need to index the max number of non-overlapping green blobs inside the blue blob:

      (mutual information!)


Feature selection as relevant information encoding

  • The idea: find a compressed signal

    that needs short encoding ( small )

    while preserving as much as possible the information on the relevant signal ( )


A variational principle

A Variational Principle

We want a short representation of X that keeps the information about another variable, Y, if possible.


The self consistent equations

TheSelf Consistent Equations

  • Marginal:

  • Markov condition:

  • Bayes’ rule:


Feature selection as relevant information encoding

The emerged effective distortion measure:

  • Regular if is absolutely continuous w.r.t.

  • Small if predicts y as well as x:


Feature selection as relevant information encoding

The iterative algorithm: (Generalized Blahut-Arimoto)

Generalized

BA-algorithm


The information bottleneck algorithm

The Information BottleneckAlgorithm

“free energy”


Feature selection as relevant information encoding

  • The Information - plane, the optimal for agiven is a concave function:

impossible

Possible phase


Manifold of relevance

Manifold of relevance

The self consistent equations:

Assuming acontinuous manifoldfor

Coupled (local in ) eigenfunction equations, with  as an eigenvalue.


Document classification information curves

Document classification - information curves


Multivariate information bottleneck

Multivariate Information Bottleneck

  • Complex relationship between many variables

  • Multiple unrelated dimensionality reduction schemes

  • Trade between known and desired dependencies

  • Express IB in the language of Graphical Models

  • Multivariate extension of Rate-Distortion Theory


Feature selection as relevant information encoding

Multivariate Information Bottleneck:

Extending the dependency graphs

(Multi-information)


Sufficient dimensionality reduction with amir globerson

Sufficient Dimensionality Reduction(with Amir Globerson)

  • Exponential families have sufficient statistics

  • Given a joint distribution , find an approximation of the exponential form:

This can be done by alternating maximization of Entropy under the constraints:

The resulting functions are our relevant features at rank d.


Summary

Summary

  • We present a general information theoretic approach for extracting relevant information.

  • It is a natural generalization of Rate-Distortion theory with similar convergence and optimality proofs.

  • Unifies learning, feature extraction, filtering, and prediction...

  • Applications (so far) include:

    • Word sense disambiguation

    • Document classification and categorization

    • Spectral analysis

    • Neural codes

    • Bioinformatics,…

    • Data clustering based on multi-distance distributions


  • Login