# Feature Selection as Relevant Information Encoding - PowerPoint PPT Presentation

1 / 25

Feature Selection as Relevant Information Encoding . Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS 2001. Many thanks to: Noam Slonim Amir Globerson Bill Bialek Fernando Pereira Nir Friedman. Feature Selection?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Feature Selection as Relevant Information Encoding

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Feature Selectionas Relevant Information Encoding

Naftali Tishby

School of Computer Science and Engineering

The Hebrew University, Jerusalem, Israel

NIPS 2001

Many thanks to:

Noam Slonim

Amir Globerson

Bill Bialek

Fernando Pereira

Nir Friedman

### Feature Selection?

• NOT generative modeling!

• no assumptions about the source of the data

• Extracting relevant structure from data

• functions of the data (statistics) that preserve information

• Approximate Sufficient Statistics

• Need a principle that is both general and precise.

• Good Principles survive longer!

### A new compact representation

The document clusters preserve the relevant

information between the documents and words

Documents

Words

### Mutual information

How much X is telling about Y?

I(X;Y): function of the joint probability distribution p(x,y) -

minimal number of yes/no questions (bits) needed to ask about x, in order to learn all we can about Y.

Uncertainty removed about X when we know Y:

I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X)

I(X;Y)

H(X|Y)

H(Y|X)

### Relevant Coding

• Need to partition X into relevant domains, or clusters, between which we really need to distinguish...

P(x|y1)

X|y1

y1

y2

X|y2

P(x|y2)

X

Y

Bottlenecks and Neural Nets

• Auto association: forcing compact representations

• is a relevant code of w.r.t.

Input

Output

Sample 1

Sample 2

Past

Future

• Q: How many bits are needed to determine the relevant representation?

• need to index the max number of non-overlapping green blobs inside the blue blob:

(mutual information!)

• The idea: find a compressed signal

that needs short encoding ( small )

while preserving as much as possible the information on the relevant signal ( )

### A Variational Principle

We want a short representation of X that keeps the information about another variable, Y, if possible.

### TheSelf Consistent Equations

• Marginal:

• Markov condition:

• Bayes’ rule:

The emerged effective distortion measure:

• Regular if is absolutely continuous w.r.t.

• Small if predicts y as well as x:

The iterative algorithm: (Generalized Blahut-Arimoto)

Generalized

BA-algorithm

### The Information BottleneckAlgorithm

“free energy”

• The Information - plane, the optimal for agiven is a concave function:

impossible

Possible phase

### Manifold of relevance

The self consistent equations:

Assuming acontinuous manifoldfor

Coupled (local in ) eigenfunction equations, with  as an eigenvalue.

### Multivariate Information Bottleneck

• Complex relationship between many variables

• Multiple unrelated dimensionality reduction schemes

• Trade between known and desired dependencies

• Express IB in the language of Graphical Models

• Multivariate extension of Rate-Distortion Theory

Multivariate Information Bottleneck:

Extending the dependency graphs

(Multi-information)

### Sufficient Dimensionality Reduction(with Amir Globerson)

• Exponential families have sufficient statistics

• Given a joint distribution , find an approximation of the exponential form:

This can be done by alternating maximization of Entropy under the constraints:

The resulting functions are our relevant features at rank d.

### Summary

• We present a general information theoretic approach for extracting relevant information.

• It is a natural generalization of Rate-Distortion theory with similar convergence and optimality proofs.

• Unifies learning, feature extraction, filtering, and prediction...

• Applications (so far) include:

• Word sense disambiguation

• Document classification and categorization

• Spectral analysis

• Neural codes

• Bioinformatics,…

• Data clustering based on multi-distance distributions