minimum information inference n.
Skip this Video
Download Presentation
Minimum Information Inference

Loading in 2 Seconds...

play fullscreen
1 / 43

Minimum Information Inference - PowerPoint PPT Presentation

  • Uploaded on

Minimum Information Inference. Naftali Tishby Amir Globerson ICNC, CSE The Hebrew University TAU, Jan. 2, 2005. Talk outline. Classification with probabilistic models: Generative vs. Discriminative The Minimum Information Principle Generalization error bounds

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Minimum Information Inference' - vonda

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
minimum information inference

Minimum Information Inference

Naftali Tishby

Amir Globerson


The Hebrew University

TAU, Jan. 2, 2005

talk outline
Talk outline
  • Classification with probabilistic models: Generative vs. Discriminative
  • The Minimum Information Principle
    • Generalization error bounds
    • Game theoretic motivation
    • Joint typicality
  • The MinMI algorithms
  • Empirical evaluations
  • Related extensions: SDR and IB
the classification problem
The Classification Problem
  • Learn how to classify (complex) observationsX into (simple) classes Y
    • Given labeled examples (xi,yi)
    • Use them to construct a classifiery=g(x)
  • What is a good classifier?
    • Denote by p *(x,y) the true underlying law
    • Want to minimize the generalization error

Problem …

Generalization – Can’t be computed directly 



(xi,yi), i=1…n




choosing a classifier
Choosing a classifier
  • Need to limit search to some set of rules. If every rule is possible we will surely over-fit. Use a family g(x) where  is a parameter.
  • Would be nice if the true rule is in g(x)
  • How do we choose in g(x) ?
common approach empirical risk minimization
Common approach:Empirical Risk Minimization
  • A reasonable strategy. Find the classifier which minimizes the empirical (sample) error:
  • Not necessarily provides the best generalization, although theoretical bounds exist.
  • Computationally hard to minimize directly. Many works minimize upper bounds on the error.
  • Here we focus on a different strategy.
probabilistic models for classification
Probabilistic models for classification
  • Had we known p*(x,y) the optimal predictor would be
  • But we don’t know it. We can try to estimate it. Two general approaches: generative and discriminative.
generative models
Generative Models
  • Assume p(x|y) has some parametric form, e.g. a Gaussian.
  • Each y has a different set of parameters y
  • How do we estimate y, p(y) ? Maximum Likelihood!
generative models estimation
Generative Models -Estimation
  • Easy to see that p(y) should be set to the empirical frequency of the classes
  • The parameters yobtained by collecting all x values for the class y, and generating a maximum likelihood estimate.
example gaussians
Example: Gaussians
  • Assume the class conditional distribution is Gaussian
  • Then are the empirical mean

and variance of the samples in class y.



example na ve bayes
Example: Naïve Bayes
  • Say X=[X1,…,Xn] is an n dimensional observation
  • Assume:
  • Parameters are p(xi=k|y). Calculated by counting how many times xi=k in class y.
  • Empirical means of

indicator functions:

generative classifiers advantages
Generative Classifiers: Advantages
  • Sometimes it makes sense to assume a generation process for p(x|y)(e.g. speech or DNA).
  • Estimation is easy. Closed form solutions in many cases (through empirical means).
  • The parameters can be estimated with relatively high confidence from small samples (e.g. empirical mean and variance). See Ng and Jordan (2001).
  • Performance is not bad at all.
discriminative classifiers
Discriminative Classifiers
  • But, to classify we need onlyp(y|x).

Why not estimate it directly? Generative classifiers (implicitly) estimate p(x), which is not really needed or known.

  • Assume a parametric form for p(y|x):
discriminative models estimation
Discriminative Models - Estimation
  • Choose yto maximize conditional likelihood
  • Estimation is usually not in closed form. Requires iterative maximization (gradient methods etc).
example logistic regresion
Example: logistic regresion
  • Assume p(x|y) are Gaussians with different means and same variances. Then
  • Goal is to estimate ay,by
  • This is called logistic regression. Since the log of the distribution is linear in x
discriminative na ve bayes
DiscriminativeNaïve Bayes
  • Assuming p(x|y) is in Naïve Bayes class, the discriminative distribution is
  • Similar to Naïve Bayes, but the ψ(x,y) functions are not distributions. This is why we need the additional normalization Z.
  • Also called a conditional first order loglinear model .
discriminative advantages
Discriminative: Advantages
  • Estimates only the relevant distributions

(important when X is very complex).

  • Often outperforms generative models for large enough samples (see Ng and Jordan, 2001).
  • Can be shown to minimize an upper bound on the classification error.
the best of both worlds
The best of both worlds…
  • Generative models (often) employ empirical means which are easy and reliable to estimate.
  • But they model each class separately so poor discriminationis obtained.
  • We would like a discriminative approach based on empirical means.
learning from expected values observations in physics
Learning from Expected values(observations, in physics)
  • Assume we have some “interesting” observables:
  • And we are given their sample empirical means for different classes Y, e.g. class two moments:
  • How can we use this information to build a classifier?
  • Idea: Look for models which yield the observed expectations, but contain no other information.
the maxent approach
The MaxEnt approach
  • The Entropy H(X,Y) is a measure of uncertainty

(and typicality!)

  • Find the distribution with the given empirical means andmaximum joint entropy H(X,Y) (Jaynes 57, …)
  • “Least Committed” to the observations, most typical.
  • Yield “nice” exponential forms:
occam s in classification
Occam’s in Classification
  • Minimum assumptions about X and Y imply independence.
  • Because X behaves differently for different Y they cannot be independent
  • How can we quantify their level of dependence ?






mutual information shannon 48
Mutual Information (Shannon 48)
  • The measure of the information shared by two variables
  • X and Y are independent iff I(X;Y)=0
  • Bounds the classification error:

eBayes<0.5(H(Y)-I(X;Y)). (Hellman and Raviv 1970).

  • Why not minimizeit subject to the observation constraints?
more for mutual information
More for Mutual Information…
  • I(X;Y) - the unique functional (up to units) that quantifies the notion of information in X about Y in a covariant way.
  • Mutual Information is the generating functional for both source coding (minimization) and channel coding (maximization).
  • Quantifies independence in a model free way
  • Has a natural multivariate extension - I(X1,…,Xn).
minmi problem setting
MinMI: Problem Setting
  • Given a sample (x1,y1),…,(xn,yn)
  • For each y, calculate the

expected value of (X)

  • Calculate empirical marginal p(y)
  • Find the minimum Mutual Information distribution with the given empirical expected values
  • The valueof the minimum information is precisely the information in the observations!
minmi formulation
MinMI Formulation
  • The (convex) set of constraints
  • The information minimizing distribution
  • A convex problem. No local minima!



  • The problem is convex given p(y) for any empirical means, without specifying p(x).
  • The minimization generates an auxiliary sparse pMI (x): support alignments.
  • The solution form
  • Where (y) are Lagrange multipliers and
  • Via Bayes
  • Can be used for classification. But how do we find it?
careful i cheated
Careful… I cheated…
  • What if pMI(x)=0 ?
  • No legal pMI(y|x) …
  • But we can still define:
  • Can show that it is subnormalized:
  • And use f(y|x) for classification!
  • Solutions are actually very sparse. Many pMI(x) are zero. “Support Assignments”…
a dual formulation
A dual formulation
  • Using convex duality we can show that MinMI can be formulated as
  • Called a geometric program
  • Strict inequalities for x such that p(x)=0
  • Avoids dealing with p(x) at all!
a generalization bound

-log2 fMI(y|x)


A generalization bound
  • If the estimated means are equal to their true expected values, we can show that the generalization error satisfies


a game theoretic interpretation
A Game Theoretic Interpretation
  • Among all distributions in F(a), why choose MinMI?
  • The MinMI classifiers minimizes the worst case loss in the class
  • The loss is an upper bound on generalization error
  • Minimize a worst case upper bound
minmi and joint typicality
MinMI and Joint Typicality

Given a sequence the probability that another independently drawn sequence: is drawn from their joint distribution,

Is asymptotically

Suggesting Minimum Mutual Information (MinMI) as a general principle for joint (typical) inference.

i projections csiszar 75 amari 82
I-Projections (Csiszar 75, Amari 82,…)
  • The I-projection of a distribution q(x) on a set F
  • For a set defined by linear constraints:
  • Can be calculated using Generalized Iterative Scaling or Gradient methods.

Looks Familiar ?

the minmi algorithm
The MinMI Algorithm
  • Initialize
  • Iterate
    • For all y: Set to be the projection of on
    • Marginalize
example two moments
Example: Two moments
  • Observations are class conditional mean and variance.
  • MaxEnt solution would be p(X|y) a Gaussian.
  • MinMI solutions are far from Gaussians and discriminate much better.



example conditional marginals
Example: Conditional Marginals
  • Recall in Naïve Bayes we used the empirical means of:
  • Can use these means for MinMI.
na ve bayes analogs
Naïve Bayes Analogs

Naïve Bayes

Discriminative 1st Order LogLinear

  • 12 UCI Datasets. Discrete Features Only

used singleton marginal constraints.

  • Compared to Naïve Bayes and 1st order LogLinear model.
  • Note: Naïve Bayes and MinMI use exactly the same input. LogLinear regression also approximates p(x) and uses more information.
related ideas
Related ideas
  • Extract the best observables using minimum MI: Sufficient Dimensionality Reduction (SDR)
  • Efficient representations of X with respect to Y:

The Information Bottleneck approach.

  • Bounding the information in neural codes from very sparse statistics.
  • Statistical extension of Support Vector Machines.
  • MinMI outperforms discriminative model for small sample sizes
  • Outperforms generative model.
  • Presented a method for inferring classifiers based on simple sample means.
  • Unlike generative models, provides generalization guarantees.