Minimum information inference
Sponsored Links
This presentation is the property of its rightful owner.
1 / 43

Minimum Information Inference PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Minimum Information Inference. Naftali Tishby Amir Globerson ICNC, CSE The Hebrew University TAU, Jan. 2, 2005. Talk outline. Classification with probabilistic models: Generative vs. Discriminative The Minimum Information Principle Generalization error bounds

Download Presentation

Minimum Information Inference

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Minimum Information Inference

Naftali Tishby

Amir Globerson


The Hebrew University

TAU, Jan. 2, 2005

Talk outline

  • Classification with probabilistic models: Generative vs. Discriminative

  • The Minimum Information Principle

    • Generalization error bounds

    • Game theoretic motivation

    • Joint typicality

  • The MinMI algorithms

  • Empirical evaluations

  • Related extensions: SDR and IB

The Classification Problem

  • Learn how to classify (complex) observationsX into (simple) classes Y

    • Given labeled examples (xi,yi)

    • Use them to construct a classifiery=g(x)

  • What is a good classifier?

    • Denote by p *(x,y) the true underlying law

    • Want to minimize the generalization error

Problem …

Generalization – Can’t be computed directly 



(xi,yi), i=1…n




Choosing a classifier

  • Need to limit search to some set of rules. If every rule is possible we will surely over-fit. Use a family g(x) where  is a parameter.

  • Would be nice if the true rule is in g(x)

  • How do we choose in g(x) ?

Common approach:Empirical Risk Minimization

  • A reasonable strategy. Find the classifier which minimizes the empirical (sample) error:

  • Not necessarily provides the best generalization, although theoretical bounds exist.

  • Computationally hard to minimize directly. Many works minimize upper bounds on the error.

  • Here we focus on a different strategy.

Probabilistic models for classification

  • Had we known p*(x,y) the optimal predictor would be

  • But we don’t know it. We can try to estimate it. Two general approaches: generative and discriminative.

Generative Models

  • Assume p(x|y) has some parametric form, e.g. a Gaussian.

  • Each y has a different set of parameters y

  • How do we estimate y, p(y) ? Maximum Likelihood!

Generative Models -Estimation

  • Easy to see that p(y) should be set to the empirical frequency of the classes

  • The parameters yobtained by collecting all x values for the class y, and generating a maximum likelihood estimate.

Example: Gaussians

  • Assume the class conditional distribution is Gaussian

  • Then are the empirical mean

    and variance of the samples in class y.



Example: Naïve Bayes

  • Say X=[X1,…,Xn] is an n dimensional observation

  • Assume:

  • Parameters are p(xi=k|y). Calculated by counting how many times xi=k in class y.

  • Empirical means of

    indicator functions:

Generative Classifiers: Advantages

  • Sometimes it makes sense to assume a generation process for p(x|y)(e.g. speech or DNA).

  • Estimation is easy. Closed form solutions in many cases (through empirical means).

  • The parameters can be estimated with relatively high confidence from small samples (e.g. empirical mean and variance). See Ng and Jordan (2001).

  • Performance is not bad at all.

Discriminative Classifiers

  • But, to classify we need onlyp(y|x).

    Why not estimate it directly? Generative classifiers (implicitly) estimate p(x), which is not really needed or known.

  • Assume a parametric form for p(y|x):

Discriminative Models - Estimation

  • Choose yto maximize conditional likelihood

  • Estimation is usually not in closed form. Requires iterative maximization (gradient methods etc).

Example: logistic regresion

  • Assume p(x|y) are Gaussians with different means and same variances. Then

  • Goal is to estimate ay,by

  • This is called logistic regression. Since the log of the distribution is linear in x

DiscriminativeNaïve Bayes

  • Assuming p(x|y) is in Naïve Bayes class, the discriminative distribution is

  • Similar to Naïve Bayes, but the ψ(x,y) functions are not distributions. This is why we need the additional normalization Z.

  • Also called a conditional first order loglinear model .

Discriminative: Advantages

  • Estimates only the relevant distributions

    (important when X is very complex).

  • Often outperforms generative models for large enough samples (see Ng and Jordan, 2001).

  • Can be shown to minimize an upper bound on the classification error.

The best of both worlds…

  • Generative models (often) employ empirical means which are easy and reliable to estimate.

  • But they model each class separately so poor discriminationis obtained.

  • We would like a discriminative approach based on empirical means.

Learning from Expected values(observations, in physics)

  • Assume we have some “interesting” observables:

  • And we are given their sample empirical means for different classes Y, e.g. class two moments:

  • How can we use this information to build a classifier?

  • Idea: Look for models which yield the observed expectations, but contain no other information.

The MaxEnt approach

  • The Entropy H(X,Y) is a measure of uncertainty

    (and typicality!)

  • Find the distribution with the given empirical means andmaximum joint entropy H(X,Y) (Jaynes 57, …)

  • “Least Committed” to the observations, most typical.

  • Yield “nice” exponential forms:

Occam’s in Classification

  • Minimum assumptions about X and Y imply independence.

  • Because X behaves differently for different Y they cannot be independent

  • How can we quantify their level of dependence ?






Mutual Information (Shannon 48)

  • The measure of the information shared by two variables

  • X and Y are independent iff I(X;Y)=0

  • Bounds the classification error:

    eBayes<0.5(H(Y)-I(X;Y)). (Hellman and Raviv 1970).

  • Why not minimizeit subject to the observation constraints?

More for Mutual Information…

  • I(X;Y) - the unique functional (up to units) that quantifies the notion of information in X about Y in a covariant way.

  • Mutual Information is the generating functional for both source coding (minimization) and channel coding (maximization).

  • Quantifies independence in a model free way

  • Has a natural multivariate extension - I(X1,…,Xn).

MinMI: Problem Setting

  • Given a sample (x1,y1),…,(xn,yn)

  • For each y, calculate the

    expected value of (X)

  • Calculate empirical marginal p(y)

  • Find the minimum Mutual Information distribution with the given empirical expected values

  • The valueof the minimum information is precisely the information in the observations!

MinMI Formulation

  • The (convex) set of constraints

  • The information minimizing distribution

  • A convex problem. No local minima!



  • The problem is convex given p(y) for any empirical means, without specifying p(x).

  • The minimization generates an auxiliary sparse pMI (x): support alignments.


  • The solution form

  • Where (y) are Lagrange multipliers and

  • Via Bayes

  • Can be used for classification. But how do we find it?

Careful… I cheated…

  • What if pMI(x)=0 ?

  • No legal pMI(y|x) …

  • But we can still define:

  • Can show that it is subnormalized:

  • And use f(y|x) for classification!

  • Solutions are actually very sparse. Many pMI(x) are zero. “Support Assignments”…

A dual formulation

  • Using convex duality we can show that MinMI can be formulated as

  • Called a geometric program

  • Strict inequalities for x such that p(x)=0

  • Avoids dealing with p(x) at all!

-log2 fMI(y|x)


A generalization bound

  • If the estimated means are equal to their true expected values, we can show that the generalization error satisfies


A Game Theoretic Interpretation

  • Among all distributions in F(a), why choose MinMI?

  • The MinMI classifiers minimizes the worst case loss in the class

  • The loss is an upper bound on generalization error

  • Minimize a worst case upper bound

MinMI and Joint Typicality

Given a sequence the probability that another independently drawn sequence: is drawn from their joint distribution,

Is asymptotically

Suggesting Minimum Mutual Information (MinMI) as a general principle for joint (typical) inference.

I-Projections (Csiszar 75, Amari 82,…)

  • The I-projection of a distribution q(x) on a set F

  • For a set defined by linear constraints:

  • Can be calculated using Generalized Iterative Scaling or Gradient methods.

Looks Familiar ?

The MinMI Algorithm

  • Initialize

  • Iterate

    • For all y: Set to be the projection of on

    • Marginalize

The MinMI Algorithm

Example: Two moments

  • Observations are class conditional mean and variance.

  • MaxEnt solution would be p(X|y) a Gaussian.

  • MinMI solutions are far from Gaussians and discriminate much better.



Example: Conditional Marginals

  • Recall in Naïve Bayes we used the empirical means of:

  • Can use these means for MinMI.

Naïve Bayes Analogs

Naïve Bayes

Discriminative 1st Order LogLinear


  • 12 UCI Datasets. Discrete Features Only

    used singleton marginal constraints.

  • Compared to Naïve Bayes and 1st order LogLinear model.

  • Note: Naïve Bayes and MinMI use exactly the same input. LogLinear regression also approximates p(x) and uses more information.

Generalization error for full sample

Related ideas

  • Extract the best observables using minimum MI: Sufficient Dimensionality Reduction (SDR)

  • Efficient representations of X with respect to Y:

    The Information Bottleneck approach.

  • Bounding the information in neural codes from very sparse statistics.

  • Statistical extension of Support Vector Machines.


  • MinMI outperforms discriminative model for small sample sizes

  • Outperforms generative model.

  • Presented a method for inferring classifiers based on simple sample means.

  • Unlike generative models, provides generalization guarantees.

  • Login