Feature Selection as Relevant Information Encoding

1 / 25

# Feature Selection as Relevant Information Encoding - PowerPoint PPT Presentation

Feature Selection as Relevant Information Encoding . Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS 2001. Many thanks to: Noam Slonim Amir Globerson Bill Bialek Fernando Pereira Nir Friedman. Feature Selection?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Feature Selection as Relevant Information Encoding ' - napua

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Feature Selectionas Relevant Information Encoding

Naftali Tishby

School of Computer Science and Engineering

The Hebrew University, Jerusalem, Israel

NIPS 2001

Many thanks to:

Noam Slonim

Amir Globerson

Bill Bialek

Fernando Pereira

Nir Friedman

Feature Selection?
• NOT generative modeling!
• no assumptions about the source of the data
• Extracting relevant structure from data
• functions of the data (statistics) that preserve information
• Approximate Sufficient Statistics
• Need a principle that is both general and precise.
• Good Principles survive longer!
A new compact representation

The document clusters preserve the relevant

information between the documents and words

Mutual information

How much X is telling about Y?

I(X;Y): function of the joint probability distribution p(x,y) -

minimal number of yes/no questions (bits) needed to ask about x, in order to learn all we can about Y.

Uncertainty removed about X when we know Y:

I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X)

I(X;Y)

H(X|Y)

H(Y|X)

Relevant Coding
• Need to partition X into relevant domains, or clusters, between which we really need to distinguish...

P(x|y1)

X|y1

y1

y2

X|y2

P(x|y2)

X

Y

Bottlenecks and Neural Nets

• Auto association: forcing compact representations
• is a relevant code of w.r.t.

Input

Output

Sample 1

Sample 2

Past

Future

Q: How many bits are needed to determine the relevant representation?
• need to index the max number of non-overlapping green blobs inside the blue blob:

(mutual information!)

The idea: find a compressed signal

that needs short encoding ( small )

while preserving as much as possible the information on the relevant signal ( )

A Variational Principle

We want a short representation of X that keeps the information about another variable, Y, if possible.

TheSelf Consistent Equations
• Marginal:
• Markov condition:
• Bayes’ rule:

The emerged effective distortion measure:

• Regular if is absolutely continuous w.r.t.
• Small if predicts y as well as x:
The Information - plane, the optimal for agiven is a concave function:

impossible

Possible phase

Manifold of relevance

The self consistent equations:

Assuming acontinuous manifoldfor

Coupled (local in ) eigenfunction equations, with  as an eigenvalue.

Multivariate Information Bottleneck
• Complex relationship between many variables
• Multiple unrelated dimensionality reduction schemes
• Trade between known and desired dependencies
• Express IB in the language of Graphical Models
• Multivariate extension of Rate-Distortion Theory

Multivariate Information Bottleneck:

Extending the dependency graphs

(Multi-information)

Sufficient Dimensionality Reduction(with Amir Globerson)
• Exponential families have sufficient statistics
• Given a joint distribution , find an approximation of the exponential form:

This can be done by alternating maximization of Entropy under the constraints:

The resulting functions are our relevant features at rank d.

Summary
• We present a general information theoretic approach for extracting relevant information.
• It is a natural generalization of Rate-Distortion theory with similar convergence and optimality proofs.
• Unifies learning, feature extraction, filtering, and prediction...
• Applications (so far) include:
• Word sense disambiguation
• Document classification and categorization
• Spectral analysis
• Neural codes
• Bioinformatics,…
• Data clustering based on multi-distance distributions