- 72 Views
- Uploaded on
- Presentation posted in: General

Minimum Information Inference

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Minimum Information Inference

Naftali Tishby

Amir Globerson

ICNC, CSE

The Hebrew University

TAU, Jan. 2, 2005

- Classification with probabilistic models: Generative vs. Discriminative
- The Minimum Information Principle
- Generalization error bounds
- Game theoretic motivation
- Joint typicality

- The MinMI algorithms
- Empirical evaluations
- Related extensions: SDR and IB

- Learn how to classify (complex) observationsX into (simple) classes Y
- Given labeled examples (xi,yi)
- Use them to construct a classifiery=g(x)

- What is a good classifier?
- Denote by p *(x,y) the true underlying law
- Want to minimize the generalization error

Problem …

Generalization – Can’t be computed directly

p*(x,y)

y=g(x)

(xi,yi), i=1…n

Observed

Learned

Truth

- Need to limit search to some set of rules. If every rule is possible we will surely over-fit. Use a family g(x) where is a parameter.
- Would be nice if the true rule is in g(x)
- How do we choose in g(x) ?

- A reasonable strategy. Find the classifier which minimizes the empirical (sample) error:
- Not necessarily provides the best generalization, although theoretical bounds exist.
- Computationally hard to minimize directly. Many works minimize upper bounds on the error.
- Here we focus on a different strategy.

- Had we known p*(x,y) the optimal predictor would be
- But we don’t know it. We can try to estimate it. Two general approaches: generative and discriminative.

- Assume p(x|y) has some parametric form, e.g. a Gaussian.
- Each y has a different set of parameters y
- How do we estimate y, p(y) ? Maximum Likelihood!

- Easy to see that p(y) should be set to the empirical frequency of the classes
- The parameters yobtained by collecting all x values for the class y, and generating a maximum likelihood estimate.

- Assume the class conditional distribution is Gaussian
- Then are the empirical mean
and variance of the samples in class y.

y=1

y=2

- Say X=[X1,…,Xn] is an n dimensional observation
- Assume:
- Parameters are p(xi=k|y). Calculated by counting how many times xi=k in class y.
- Empirical means of
indicator functions:

- Sometimes it makes sense to assume a generation process for p(x|y)(e.g. speech or DNA).
- Estimation is easy. Closed form solutions in many cases (through empirical means).
- The parameters can be estimated with relatively high confidence from small samples (e.g. empirical mean and variance). See Ng and Jordan (2001).
- Performance is not bad at all.

- But, to classify we need onlyp(y|x).
Why not estimate it directly? Generative classifiers (implicitly) estimate p(x), which is not really needed or known.

- Assume a parametric form for p(y|x):

- Choose yto maximize conditional likelihood
- Estimation is usually not in closed form. Requires iterative maximization (gradient methods etc).

- Assume p(x|y) are Gaussians with different means and same variances. Then
- Goal is to estimate ay,by
- This is called logistic regression. Since the log of the distribution is linear in x

- Assuming p(x|y) is in Naïve Bayes class, the discriminative distribution is
- Similar to Naïve Bayes, but the ψ(x,y) functions are not distributions. This is why we need the additional normalization Z.
- Also called a conditional first order loglinear model .

- Estimates only the relevant distributions
(important when X is very complex).

- Often outperforms generative models for large enough samples (see Ng and Jordan, 2001).
- Can be shown to minimize an upper bound on the classification error.

- Generative models (often) employ empirical means which are easy and reliable to estimate.
- But they model each class separately so poor discriminationis obtained.
- We would like a discriminative approach based on empirical means.

- Assume we have some “interesting” observables:
- And we are given their sample empirical means for different classes Y, e.g. class two moments:
- How can we use this information to build a classifier?
- Idea: Look for models which yield the observed expectations, but contain no other information.

- The Entropy H(X,Y) is a measure of uncertainty
(and typicality!)

- Find the distribution with the given empirical means andmaximum joint entropy H(X,Y) (Jaynes 57, …)
- “Least Committed” to the observations, most typical.
- Yield “nice” exponential forms:

- Minimum assumptions about X and Y imply independence.
- Because X behaves differently for different Y they cannot be independent
- How can we quantify their level of dependence ?

p(x|y=1)

p(x|y=2)

m2

m1

X

- The measure of the information shared by two variables
- X and Y are independent iff I(X;Y)=0
- Bounds the classification error:
eBayes<0.5(H(Y)-I(X;Y)). (Hellman and Raviv 1970).

- Why not minimizeit subject to the observation constraints?

- I(X;Y) - the unique functional (up to units) that quantifies the notion of information in X about Y in a covariant way.
- Mutual Information is the generating functional for both source coding (minimization) and channel coding (maximization).
- Quantifies independence in a model free way
- Has a natural multivariate extension - I(X1,…,Xn).

- Given a sample (x1,y1),…,(xn,yn)
- For each y, calculate the
expected value of (X)

- Calculate empirical marginal p(y)
- Find the minimum Mutual Information distribution with the given empirical expected values
- The valueof the minimum information is precisely the information in the observations!

- The (convex) set of constraints
- The information minimizing distribution
- A convex problem. No local minima!

pMI

p

- The problem is convex given p(y) for any empirical means, without specifying p(x).
- The minimization generates an auxiliary sparse pMI (x): support alignments.

- The solution form
- Where (y) are Lagrange multipliers and
- Via Bayes
- Can be used for classification. But how do we find it?

- What if pMI(x)=0 ?
- No legal pMI(y|x) …
- But we can still define:
- Can show that it is subnormalized:
- And use f(y|x) for classification!
- Solutions are actually very sparse. Many pMI(x) are zero. “Support Assignments”…

- Using convex duality we can show that MinMI can be formulated as
- Called a geometric program
- Strict inequalities for x such that p(x)=0
- Avoids dealing with p(x) at all!

-log2 fMI(y|x)

fMI(y|x)

- If the estimated means are equal to their true expected values, we can show that the generalization error satisfies

Y=1

- Among all distributions in F(a), why choose MinMI?
- The MinMI classifiers minimizes the worst case loss in the class
- The loss is an upper bound on generalization error
- Minimize a worst case upper bound

Given a sequence the probability that another independently drawn sequence: is drawn from their joint distribution,

Is asymptotically

Suggesting Minimum Mutual Information (MinMI) as a general principle for joint (typical) inference.

- The I-projection of a distribution q(x) on a set F
- For a set defined by linear constraints:
- Can be calculated using Generalized Iterative Scaling or Gradient methods.

Looks Familiar ?

- Initialize
- Iterate
- For all y: Set to be the projection of on
- Marginalize

- Observations are class conditional mean and variance.
- MaxEnt solution would be p(X|y) a Gaussian.
- MinMI solutions are far from Gaussians and discriminate much better.

MaxEnt

MinMI

- Recall in Naïve Bayes we used the empirical means of:

- Can use these means for MinMI.

Naïve Bayes

Discriminative 1st Order LogLinear

- 12 UCI Datasets. Discrete Features Only
used singleton marginal constraints.

- Compared to Naïve Bayes and 1st order LogLinear model.
- Note: Naïve Bayes and MinMI use exactly the same input. LogLinear regression also approximates p(x) and uses more information.

- Extract the best observables using minimum MI: Sufficient Dimensionality Reduction (SDR)
- Efficient representations of X with respect to Y:
The Information Bottleneck approach.

- Bounding the information in neural codes from very sparse statistics.
- Statistical extension of Support Vector Machines.

- MinMI outperforms discriminative model for small sample sizes
- Outperforms generative model.
- Presented a method for inferring classifiers based on simple sample means.
- Unlike generative models, provides generalization guarantees.