310 likes | 539 Views
Information Bottleneck EM. Gal Elidan and Nir Friedman. School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel. T. X 1. X 2. X 3. 1. 0.9. 0.8. 0.7. 0.6. Likelihood. 0.5. 0.4. 0.3. 0.2. 0.1. 0. 0. 0.1. 0.2. 0.3. 0.4. 0.5. 0.6. 0.7. 0.8. 0.9.
E N D
Information Bottleneck EM Gal Elidan and Nir Friedman School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel
T X1 X2 X3 1 0.9 0.8 0.7 0.6 Likelihood 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Params Learning with Hidden Variables Input: Output:A model P(X,T) X1 … XN T Problem: No closed-form solution for ML estimation Use Expectation Maximization (EM) Problem:Stuck in inferior local Maxima • Random Restarts • Deterministic • Simulated annealing DATA ? ? ? ? ? ? EM + information regularizationfor learning parameters
X1 X2 X3 Learning Parameters X1 … XN Input: Output: A model P(X) DATA Empirical distribution Q(X) Parametrization of P P(X1) = Q(X1)P(X2|X1) = Q(X2|X1) P(X3|X1) = Q(X3|X1)
Y 1 2 T 3 4 X1 X2 X3 M Learning with Hidden Variables X1 … XN T Input: Desired structure: DATA ? ? ? ? ? ? guess of Empirical distributionQ(X,T,Y)=Q(X,T)Q(T|Y) Empirical distribution Q(X,T,Y) = Empirical distribution Q(X,T) = ? For each instance ID, complete value of T EM Iterations Parametrization for P
EM Functional The EM Algorithm: E-Step: Generate empirical distribution M-Step: Maximize using EM is equivalent to optimizing function of Q,P Each step increases value of functional [Neal and Hinton, 1998]
Information Bottleneck EM Target: In the rest of the talk… • Understanding this objective • How to use it to learn better models EM target Information between hidden and ID
Information Regularization Motivating idea: Fit training data: Set T to be instance ID to “predict” X Generalization: “Forget” ID and keep essence of X Objective: parameter free regularization of Q (lower bound of) Likelihood of P Compression of instance ID vs. [Tishby et. al, 1999]
=0 Compressionmeasure EMTarget 1 1 5 5 6 6 total compression 11 11 4 4 7 7 3 3 =0 10 10 2 2 8 8 9 9 Clustering example EMTarget Compressionmeasure
6 5 1 9 11 3 10 2 4 8 7 Clustering example =1 EMTarget Compressionmeasure 1 5 6 total preservation 11 4 7 3 =1 10 2 8 9 T ID
1 7 3 1 5 6 9 11 5 11 4 7 6 10 3 4 10 2 2 8 8 9 Clustering example =? Compressionmeasure EMTarget Desired =? |T| = 2
Information Bottleneck EM EM functional Formal equivalence with Information Bottleneck at =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case]
Information Bottleneck EM EM functional Formal equivalence with Information Bottleneck Maximum of Q(T|Y) is obtained when Marginal ofT in Q Prediction ofT using P Normalization
The IB-EM Algorithm for fixed • Iterate until convergance E-Step:Maximize LIB-EM by optimizing Q M-Step:Maximize LIB-EM by optimizing P (same as standard M-step) • Each step improves LIB-EM • Guaranteed to converge
Information Bottleneck EM Target: In the rest of the talk… • Understanding this objective • How to use it to learn better models EM target Information between hidden and ID
Continuation easy Follow ridge from optimum at =0 LIB-EM hard 0 Q 1
Q Continuation • Recall, if Q is a local maxima of LIB-EM then • We want to follow a path in (Q, ) space so that… for all t, and y Local maxima for all
0 1 Q Continuation Step start • Start at (Q,) so that • Compute gradient • Take direction • Take a step in thedesired direction
0 1 Q Staying on the ridge start • Potential problem: • Direction is tangent to path miss optimum Solution:Use EM steps to regain path
The IB-EM Algorithm • Set =0 (start at easy solution) • Iterate until =1 (EM solution is reached) • Iterate (stay on the ridge) • E-Step: Maximize LIB-EM by optimizing Q • M-Step: Maximize LIB-EM by optimizing P • Step (follow the ridge) • Compute gradient and direction • Take the step by changing and Q
0 1 Inferior solution Q Calibrating the step size • Potential problem: • Step size too smalltoo slow • Step size too largeovershoot target
1.5 1.5 1 1 I(T;Y) 0.5 0.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Calibrating the step size Recall that I(T;Y) measures compression of ID When I(T;Y) rises more of data is captured • Non-parametric: involves only Q • Can be bounded: I(T;Y) ≤ log2|T| Naive Use change in I(T;Y) “Interesting”area Too sparse
1.5 1 I(T;Y) 0.5 0 0.2 0.4 0.6 0.8 1 The Stock Dataset Naive Bayes model Daily changes of20 NASDAQ stocks. 1213 train, 303 test • IB-EM outperforms best of EM solutions • I(T;Y) follows changes of likelihood • Continuation ~follows region of change ( marks evaluated ) -19 Best of EM -21 Train likelihood IB-EM -23 0 0.2 0.4 0.6 0.8 1 [Boyen et. al, 1999]
Y Multiple Hidden Variables We want to learn a model with many hiddens ( ) Naive: Potentially exponential in # of hiddens Variational approximation: use factorized form (Mean Field) P Q(T|Y) LIB-EM = (Variational EM) - (1- )Regularization [Friedman et. al, 2002]
-330 -334 Test log-loss / instance Mean Field EM1 min/run -338 -342 20 40 60 80 100 Percentage of random runs The USPS Digits dataset 400 samples 21 hiddens • Superior to all Mean Field EM runs • Time single exact EM run single IB-EM 27 min exact EM25 min/run 3/50 EM runs are IB-EM: EM needs x17 time for similar results Offers good value for your time!
-147.5 -148.5 -149.5 Test log-loss / instance Mean Field EM~0.5 hours -150.5 -151.5 0 20 40 60 80 100 Precentage of random runs Yeast Stress Response 173 experiments (variables) 6152 genes (samples) 25 hidden variables • Superior to all Mean Field EM runs • An order of magnitude faster then exact EM IB-EM ~6 hours Exact EM>60 hours 5-24 experiments Effective when exact solution becomes intractable!
Summary New framework for learning hidden variables • Formal relation of Bottleneck and EM • Continuation for bypassing local maxima • Flexible: structure / variational approximation Future Work • Learn optimal ≤1 for better generalization • Explore other approximations of Q(T|Y) • Model selection: learning cardinality and enrich structure
Relation to Weight Annealing Init: temp = hot Iterate until temp = cold • Perturb w temp • Use QW and optimize • Cool down Y X1 … XN W 1 DATA 2 3 4 M Similarities: • Change in empirical Q • Morph towards EM solution Differences: • IB-EM uses info. regulatization • IB-EM uses continuation • WA requires cooling policy • WA applicable for wider range of problems [Elidan et. al, 2002]
Relation to Deterministic Annealing Init: temp = hot Iterate until temp = cold • “Insert” entropy temp into model • Optimize noisy model • Cool down Y X1 … XN 1 DATA 2 3 4 M Similarities: • Use informationmeasure • Morph towards EM solution Differences: • DA parameterization dependent • IB-EM uses continuation • DA requires cooling policy • DA applicable for wider range of problems