Minimum information inference
1 / 43

Minimum Information Inference - PowerPoint PPT Presentation

  • Uploaded on

Minimum Information Inference. Naftali Tishby Amir Globerson ICNC, CSE The Hebrew University TAU, Jan. 2, 2005. Talk outline. Classification with probabilistic models: Generative vs. Discriminative The Minimum Information Principle Generalization error bounds

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Minimum Information Inference' - vonda

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Minimum information inference

Minimum Information Inference

Naftali Tishby

Amir Globerson


The Hebrew University

TAU, Jan. 2, 2005

Talk outline
Talk outline

  • Classification with probabilistic models: Generative vs. Discriminative

  • The Minimum Information Principle

    • Generalization error bounds

    • Game theoretic motivation

    • Joint typicality

  • The MinMI algorithms

  • Empirical evaluations

  • Related extensions: SDR and IB

The classification problem
The Classification Problem

  • Learn how to classify (complex) observationsX into (simple) classes Y

    • Given labeled examples (xi,yi)

    • Use them to construct a classifiery=g(x)

  • What is a good classifier?

    • Denote by p *(x,y) the true underlying law

    • Want to minimize the generalization error


Generalization – Can’t be computed directly 



(xi,yi), i=1…n




Choosing a classifier
Choosing a classifier

  • Need to limit search to some set of rules. If every rule is possible we will surely over-fit. Use a family g(x) where  is a parameter.

  • Would be nice if the true rule is in g(x)

  • How do we choose in g(x) ?

Common approach empirical risk minimization
Common approach:Empirical Risk Minimization

  • A reasonable strategy. Find the classifier which minimizes the empirical (sample) error:

  • Not necessarily provides the best generalization, although theoretical bounds exist.

  • Computationally hard to minimize directly. Many works minimize upper bounds on the error.

  • Here we focus on a different strategy.

Probabilistic models for classification
Probabilistic models for classification

  • Had we known p*(x,y) the optimal predictor would be

  • But we don’t know it. We can try to estimate it. Two general approaches: generative and discriminative.

Generative models
Generative Models

  • Assume p(x|y) has some parametric form, e.g. a Gaussian.

  • Each y has a different set of parameters y

  • How do we estimate y, p(y) ? Maximum Likelihood!

Generative models estimation
Generative Models -Estimation

  • Easy to see that p(y) should be set to the empirical frequency of the classes

  • The parameters yobtained by collecting all x values for the class y, and generating a maximum likelihood estimate.

Example gaussians
Example: Gaussians

  • Assume the class conditional distribution is Gaussian

  • Then are the empirical mean

    and variance of the samples in class y.



Example na ve bayes
Example: Naïve Bayes

  • Say X=[X1,…,Xn] is an n dimensional observation

  • Assume:

  • Parameters are p(xi=k|y). Calculated by counting how many times xi=k in class y.

  • Empirical means of

    indicator functions:

Generative classifiers advantages
Generative Classifiers: Advantages

  • Sometimes it makes sense to assume a generation process for p(x|y)(e.g. speech or DNA).

  • Estimation is easy. Closed form solutions in many cases (through empirical means).

  • The parameters can be estimated with relatively high confidence from small samples (e.g. empirical mean and variance). See Ng and Jordan (2001).

  • Performance is not bad at all.

Discriminative classifiers
Discriminative Classifiers

  • But, to classify we need onlyp(y|x).

    Why not estimate it directly? Generative classifiers (implicitly) estimate p(x), which is not really needed or known.

  • Assume a parametric form for p(y|x):

Discriminative models estimation
Discriminative Models - Estimation

  • Choose yto maximize conditional likelihood

  • Estimation is usually not in closed form. Requires iterative maximization (gradient methods etc).

Example logistic regresion
Example: logistic regresion

  • Assume p(x|y) are Gaussians with different means and same variances. Then

  • Goal is to estimate ay,by

  • This is called logistic regression. Since the log of the distribution is linear in x

Discriminative na ve bayes
DiscriminativeNaïve Bayes

  • Assuming p(x|y) is in Naïve Bayes class, the discriminative distribution is

  • Similar to Naïve Bayes, but the ψ(x,y) functions are not distributions. This is why we need the additional normalization Z.

  • Also called a conditional first order loglinear model .

Discriminative advantages
Discriminative: Advantages

  • Estimates only the relevant distributions

    (important when X is very complex).

  • Often outperforms generative models for large enough samples (see Ng and Jordan, 2001).

  • Can be shown to minimize an upper bound on the classification error.

The best of both worlds
The best of both worlds…

  • Generative models (often) employ empirical means which are easy and reliable to estimate.

  • But they model each class separately so poor discriminationis obtained.

  • We would like a discriminative approach based on empirical means.

Learning from expected values observations in physics
Learning from Expected values(observations, in physics)

  • Assume we have some “interesting” observables:

  • And we are given their sample empirical means for different classes Y, e.g. class two moments:

  • How can we use this information to build a classifier?

  • Idea: Look for models which yield the observed expectations, but contain no other information.

The maxent approach
The MaxEnt approach

  • The Entropy H(X,Y) is a measure of uncertainty

    (and typicality!)

  • Find the distribution with the given empirical means andmaximum joint entropy H(X,Y) (Jaynes 57, …)

  • “Least Committed” to the observations, most typical.

  • Yield “nice” exponential forms:

Occam s in classification
Occam’s in Classification

  • Minimum assumptions about X and Y imply independence.

  • Because X behaves differently for different Y they cannot be independent

  • How can we quantify their level of dependence ?






Mutual information shannon 48
Mutual Information (Shannon 48)

  • The measure of the information shared by two variables

  • X and Y are independent iff I(X;Y)=0

  • Bounds the classification error:

    eBayes<0.5(H(Y)-I(X;Y)). (Hellman and Raviv 1970).

  • Why not minimizeit subject to the observation constraints?

More for mutual information
More for Mutual Information…

  • I(X;Y) - the unique functional (up to units) that quantifies the notion of information in X about Y in a covariant way.

  • Mutual Information is the generating functional for both source coding (minimization) and channel coding (maximization).

  • Quantifies independence in a model free way

  • Has a natural multivariate extension - I(X1,…,Xn).

Minmi problem setting
MinMI: Problem Setting

  • Given a sample (x1,y1),…,(xn,yn)

  • For each y, calculate the

    expected value of (X)

  • Calculate empirical marginal p(y)

  • Find the minimum Mutual Information distribution with the given empirical expected values

  • The valueof the minimum information is precisely the information in the observations!

Minmi formulation
MinMI Formulation

  • The (convex) set of constraints

  • The information minimizing distribution

  • A convex problem. No local minima!



  • The problem is convex given p(y) for any empirical means, without specifying p(x).

  • The minimization generates an auxiliary sparse pMI (x): support alignments.


  • The solution form

  • Where (y) are Lagrange multipliers and

  • Via Bayes

  • Can be used for classification. But how do we find it?

Careful i cheated
Careful… I cheated…

  • What if pMI(x)=0 ?

  • No legal pMI(y|x) …

  • But we can still define:

  • Can show that it is subnormalized:

  • And use f(y|x) for classification!

  • Solutions are actually very sparse. Many pMI(x) are zero. “Support Assignments”…

A dual formulation
A dual formulation

  • Using convex duality we can show that MinMI can be formulated as

  • Called a geometric program

  • Strict inequalities for x such that p(x)=0

  • Avoids dealing with p(x) at all!

A generalization bound

-log2 fMI(y|x)


A generalization bound

  • If the estimated means are equal to their true expected values, we can show that the generalization error satisfies


A game theoretic interpretation
A Game Theoretic Interpretation

  • Among all distributions in F(a), why choose MinMI?

  • The MinMI classifiers minimizes the worst case loss in the class

  • The loss is an upper bound on generalization error

  • Minimize a worst case upper bound

Minmi and joint typicality
MinMI and Joint Typicality

Given a sequence the probability that another independently drawn sequence: is drawn from their joint distribution,

Is asymptotically

Suggesting Minimum Mutual Information (MinMI) as a general principle for joint (typical) inference.

I projections csiszar 75 amari 82
I-Projections (Csiszar 75, Amari 82,…)

  • The I-projection of a distribution q(x) on a set F

  • For a set defined by linear constraints:

  • Can be calculated using Generalized Iterative Scaling or Gradient methods.

Looks Familiar ?

The minmi algorithm
The MinMI Algorithm

  • Initialize

  • Iterate

    • For all y: Set to be the projection of on

    • Marginalize

Example two moments
Example: Two moments

  • Observations are class conditional mean and variance.

  • MaxEnt solution would be p(X|y) a Gaussian.

  • MinMI solutions are far from Gaussians and discriminate much better.



Example conditional marginals
Example: Conditional Marginals

  • Recall in Naïve Bayes we used the empirical means of:

  • Can use these means for MinMI.

Na ve bayes analogs
Naïve Bayes Analogs

Naïve Bayes

Discriminative 1st Order LogLinear


  • 12 UCI Datasets. Discrete Features Only

    used singleton marginal constraints.

  • Compared to Naïve Bayes and 1st order LogLinear model.

  • Note: Naïve Bayes and MinMI use exactly the same input. LogLinear regression also approximates p(x) and uses more information.

Generalization error for full sample
Generalization error for full sample

Related ideas
Related ideas

  • Extract the best observables using minimum MI: Sufficient Dimensionality Reduction (SDR)

  • Efficient representations of X with respect to Y:

    The Information Bottleneck approach.

  • Bounding the information in neural codes from very sparse statistics.

  • Statistical extension of Support Vector Machines.


  • MinMI outperforms discriminative model for small sample sizes

  • Outperforms generative model.

  • Presented a method for inferring classifiers based on simple sample means.

  • Unlike generative models, provides generalization guarantees.