# Minimum Information Inference - PowerPoint PPT Presentation

1 / 43

Minimum Information Inference. Naftali Tishby Amir Globerson ICNC, CSE The Hebrew University TAU, Jan. 2, 2005. Talk outline. Classification with probabilistic models: Generative vs. Discriminative The Minimum Information Principle Generalization error bounds

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Minimum Information Inference

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Minimum Information Inference

Naftali Tishby

Amir Globerson

ICNC, CSE

The Hebrew University

TAU, Jan. 2, 2005

### Talk outline

• Classification with probabilistic models: Generative vs. Discriminative

• The Minimum Information Principle

• Generalization error bounds

• Game theoretic motivation

• Joint typicality

• The MinMI algorithms

• Empirical evaluations

• Related extensions: SDR and IB

### The Classification Problem

• Learn how to classify (complex) observationsX into (simple) classes Y

• Given labeled examples (xi,yi)

• Use them to construct a classifiery=g(x)

• What is a good classifier?

• Denote by p *(x,y) the true underlying law

• Want to minimize the generalization error

Problem …

Generalization – Can’t be computed directly 

p*(x,y)

y=g(x)

(xi,yi), i=1…n

Observed

Learned

Truth

### Choosing a classifier

• Need to limit search to some set of rules. If every rule is possible we will surely over-fit. Use a family g(x) where  is a parameter.

• Would be nice if the true rule is in g(x)

• How do we choose in g(x) ?

### Common approach:Empirical Risk Minimization

• A reasonable strategy. Find the classifier which minimizes the empirical (sample) error:

• Not necessarily provides the best generalization, although theoretical bounds exist.

• Computationally hard to minimize directly. Many works minimize upper bounds on the error.

• Here we focus on a different strategy.

### Probabilistic models for classification

• Had we known p*(x,y) the optimal predictor would be

• But we don’t know it. We can try to estimate it. Two general approaches: generative and discriminative.

### Generative Models

• Assume p(x|y) has some parametric form, e.g. a Gaussian.

• Each y has a different set of parameters y

• How do we estimate y, p(y) ? Maximum Likelihood!

### Generative Models -Estimation

• Easy to see that p(y) should be set to the empirical frequency of the classes

• The parameters yobtained by collecting all x values for the class y, and generating a maximum likelihood estimate.

### Example: Gaussians

• Assume the class conditional distribution is Gaussian

• Then are the empirical mean

and variance of the samples in class y.

y=1

y=2

### Example: Naïve Bayes

• Say X=[X1,…,Xn] is an n dimensional observation

• Assume:

• Parameters are p(xi=k|y). Calculated by counting how many times xi=k in class y.

• Empirical means of

indicator functions:

• Sometimes it makes sense to assume a generation process for p(x|y)(e.g. speech or DNA).

• Estimation is easy. Closed form solutions in many cases (through empirical means).

• The parameters can be estimated with relatively high confidence from small samples (e.g. empirical mean and variance). See Ng and Jordan (2001).

• Performance is not bad at all.

### Discriminative Classifiers

• But, to classify we need onlyp(y|x).

Why not estimate it directly? Generative classifiers (implicitly) estimate p(x), which is not really needed or known.

• Assume a parametric form for p(y|x):

### Discriminative Models - Estimation

• Choose yto maximize conditional likelihood

• Estimation is usually not in closed form. Requires iterative maximization (gradient methods etc).

### Example: logistic regresion

• Assume p(x|y) are Gaussians with different means and same variances. Then

• Goal is to estimate ay,by

• This is called logistic regression. Since the log of the distribution is linear in x

### DiscriminativeNaïve Bayes

• Assuming p(x|y) is in Naïve Bayes class, the discriminative distribution is

• Similar to Naïve Bayes, but the ψ(x,y) functions are not distributions. This is why we need the additional normalization Z.

• Also called a conditional first order loglinear model .

• Estimates only the relevant distributions

(important when X is very complex).

• Often outperforms generative models for large enough samples (see Ng and Jordan, 2001).

• Can be shown to minimize an upper bound on the classification error.

### The best of both worlds…

• Generative models (often) employ empirical means which are easy and reliable to estimate.

• But they model each class separately so poor discriminationis obtained.

• We would like a discriminative approach based on empirical means.

### Learning from Expected values(observations, in physics)

• Assume we have some “interesting” observables:

• And we are given their sample empirical means for different classes Y, e.g. class two moments:

• How can we use this information to build a classifier?

• Idea: Look for models which yield the observed expectations, but contain no other information.

### The MaxEnt approach

• The Entropy H(X,Y) is a measure of uncertainty

(and typicality!)

• Find the distribution with the given empirical means andmaximum joint entropy H(X,Y) (Jaynes 57, …)

• “Least Committed” to the observations, most typical.

• Yield “nice” exponential forms:

### Occam’s in Classification

• Minimum assumptions about X and Y imply independence.

• Because X behaves differently for different Y they cannot be independent

• How can we quantify their level of dependence ?

p(x|y=1)

p(x|y=2)

m2

m1

X

### Mutual Information (Shannon 48)

• The measure of the information shared by two variables

• X and Y are independent iff I(X;Y)=0

• Bounds the classification error:

eBayes<0.5(H(Y)-I(X;Y)). (Hellman and Raviv 1970).

• Why not minimizeit subject to the observation constraints?

### More for Mutual Information…

• I(X;Y) - the unique functional (up to units) that quantifies the notion of information in X about Y in a covariant way.

• Mutual Information is the generating functional for both source coding (minimization) and channel coding (maximization).

• Quantifies independence in a model free way

• Has a natural multivariate extension - I(X1,…,Xn).

### MinMI: Problem Setting

• Given a sample (x1,y1),…,(xn,yn)

• For each y, calculate the

expected value of (X)

• Calculate empirical marginal p(y)

• Find the minimum Mutual Information distribution with the given empirical expected values

• The valueof the minimum information is precisely the information in the observations!

### MinMI Formulation

• The (convex) set of constraints

• The information minimizing distribution

• A convex problem. No local minima!

pMI

p

• The problem is convex given p(y) for any empirical means, without specifying p(x).

• The minimization generates an auxiliary sparse pMI (x): support alignments.

### Characterizing

• The solution form

• Where (y) are Lagrange multipliers and

• Via Bayes

• Can be used for classification. But how do we find it?

### Careful… I cheated…

• What if pMI(x)=0 ?

• No legal pMI(y|x) …

• But we can still define:

• Can show that it is subnormalized:

• And use f(y|x) for classification!

• Solutions are actually very sparse. Many pMI(x) are zero. “Support Assignments”…

### A dual formulation

• Using convex duality we can show that MinMI can be formulated as

• Called a geometric program

• Strict inequalities for x such that p(x)=0

• Avoids dealing with p(x) at all!

-log2 fMI(y|x)

fMI(y|x)

### A generalization bound

• If the estimated means are equal to their true expected values, we can show that the generalization error satisfies

Y=1

### A Game Theoretic Interpretation

• Among all distributions in F(a), why choose MinMI?

• The MinMI classifiers minimizes the worst case loss in the class

• The loss is an upper bound on generalization error

• Minimize a worst case upper bound

### MinMI and Joint Typicality

Given a sequence the probability that another independently drawn sequence: is drawn from their joint distribution,

Is asymptotically

Suggesting Minimum Mutual Information (MinMI) as a general principle for joint (typical) inference.

### I-Projections (Csiszar 75, Amari 82,…)

• The I-projection of a distribution q(x) on a set F

• For a set defined by linear constraints:

• Can be calculated using Generalized Iterative Scaling or Gradient methods.

Looks Familiar ?

### The MinMI Algorithm

• Initialize

• Iterate

• For all y: Set to be the projection of on

• Marginalize

### Example: Two moments

• Observations are class conditional mean and variance.

• MaxEnt solution would be p(X|y) a Gaussian.

• MinMI solutions are far from Gaussians and discriminate much better.

MaxEnt

MinMI

### Example: Conditional Marginals

• Recall in Naïve Bayes we used the empirical means of:

• Can use these means for MinMI.

### Naïve Bayes Analogs

Naïve Bayes

Discriminative 1st Order LogLinear

### Experiments

• 12 UCI Datasets. Discrete Features Only

used singleton marginal constraints.

• Compared to Naïve Bayes and 1st order LogLinear model.

• Note: Naïve Bayes and MinMI use exactly the same input. LogLinear regression also approximates p(x) and uses more information.

### Related ideas

• Extract the best observables using minimum MI: Sufficient Dimensionality Reduction (SDR)

• Efficient representations of X with respect to Y:

The Information Bottleneck approach.

• Bounding the information in neural codes from very sparse statistics.

• Statistical extension of Support Vector Machines.

### Conclusions

• MinMI outperforms discriminative model for small sample sizes

• Outperforms generative model.

• Presented a method for inferring classifiers based on simple sample means.

• Unlike generative models, provides generalization guarantees.