Information Bottleneck EM

Information Bottleneck EM Gal Elidan and Nir Friedman School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel

T X1 X2 X3 1 0.9 0.8 0.7 0.6 Likelihood 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Params Learning with Hidden Variables Input: Output:A model P(X,T) X1 … XN T Problem: No closed-form solution for ML estimation Use Expectation Maximization (EM) Problem:Stuck in inferior local Maxima • Random Restarts • Deterministic • Simulated annealing DATA ? ? ? ? ? ? EM + information regularizationfor learning parameters

X1 X2 X3 Learning Parameters X1 … XN Input: Output: A model P(X) DATA Empirical distribution Q(X) Parametrization of P P(X1) = Q(X1)P(X2|X1) = Q(X2|X1) P(X3|X1) = Q(X3|X1)

Y 1 2 T 3 4 X1 X2 X3 M Learning with Hidden Variables X1 … XN T Input: Desired structure: DATA ? ? ? ? ? ? guess of Empirical distributionQ(X,T,Y)=Q(X,T)Q(T|Y) Empirical distribution Q(X,T,Y) = Empirical distribution Q(X,T) = ? For each instance ID, complete value of T EM Iterations Parametrization for P

EM Functional The EM Algorithm: E-Step: Generate empirical distribution M-Step: Maximize using  EM is equivalent to optimizing function of Q,P  Each step increases value of functional [Neal and Hinton, 1998]

Information Bottleneck EM Target: In the rest of the talk… • Understanding this objective • How to use it to learn better models EM target Information between hidden and ID

Information Regularization Motivating idea: Fit training data: Set T to be instance ID to “predict” X Generalization: “Forget” ID and keep essence of X Objective: parameter free regularization of Q (lower bound of) Likelihood of P Compression of instance ID vs. [Tishby et. al, 1999]

=0 Compressionmeasure EMTarget 1 1 5 5 6 6 total compression 11 11 4 4 7 7 3 3 =0 10 10 2 2 8 8 9 9 Clustering example EMTarget Compressionmeasure

6 5 1 9 11 3 10 2 4 8 7 Clustering example =1 EMTarget Compressionmeasure 1 5 6 total preservation 11 4 7 3 =1 10 2 8 9 T  ID

1 7 3 1 5 6 9 11 5 11 4 7 6 10 3 4 10 2 2 8 8 9 Clustering example =? Compressionmeasure EMTarget Desired =? |T| = 2

Information Bottleneck EM EM functional Formal equivalence with Information Bottleneck at =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case]

Information Bottleneck EM EM functional Formal equivalence with Information Bottleneck Maximum of Q(T|Y) is obtained when Marginal ofT in Q Prediction ofT using P Normalization

The IB-EM Algorithm for fixed  • Iterate until convergance E-Step:Maximize LIB-EM by optimizing Q M-Step:Maximize LIB-EM by optimizing P (same as standard M-step) • Each step improves LIB-EM • Guaranteed to converge

Information Bottleneck EM Target: In the rest of the talk… • Understanding this objective • How to use it to learn better models EM target Information between hidden and ID

Continuation easy Follow ridge from optimum at =0 LIB-EM hard 0  Q 1

 Q Continuation • Recall, if Q is a local maxima of LIB-EM then • We want to follow a path in (Q, ) space so that… for all t, and y Local maxima for all 

0  1 Q Continuation Step start • Start at (Q,) so that • Compute gradient • Take  direction  • Take a step in thedesired direction

0  1 Q Staying on the ridge start • Potential problem: • Direction is tangent to path miss optimum Solution:Use EM steps to regain path

The IB-EM Algorithm • Set =0 (start at easy solution) • Iterate until =1 (EM solution is reached) • Iterate (stay on the ridge) • E-Step: Maximize LIB-EM by optimizing Q • M-Step: Maximize LIB-EM by optimizing P • Step (follow the ridge) • Compute gradient and  direction • Take the step by changing  and Q

0  1 Inferior solution Q Calibrating the step size • Potential problem: • Step size too smalltoo slow • Step size too largeovershoot target

1.5 1.5 1 1 I(T;Y) 0.5 0.5   0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Calibrating the step size Recall that I(T;Y) measures compression of ID When I(T;Y) rises more of data is captured • Non-parametric: involves only Q • Can be bounded: I(T;Y) ≤ log2|T| Naive Use change in I(T;Y) “Interesting”area Too sparse

1.5 1 I(T;Y) 0.5 0 0.2 0.4 0.6 0.8 1 The Stock Dataset Naive Bayes model Daily changes of20 NASDAQ stocks. 1213 train, 303 test • IB-EM outperforms best of EM solutions • I(T;Y) follows changes of likelihood • Continuation ~follows region of change ( marks evaluated ) -19 Best of EM -21 Train likelihood IB-EM -23  0 0.2 0.4 0.6 0.8 1 [Boyen et. al, 1999]

Y Multiple Hidden Variables We want to learn a model with many hiddens ( ) Naive: Potentially exponential in # of hiddens Variational approximation: use factorized form (Mean Field) P Q(T|Y)  LIB-EM = (Variational EM) - (1- )Regularization [Friedman et. al, 2002]

-330 -334 Test log-loss / instance Mean Field EM1 min/run -338 -342 20 40 60 80 100 Percentage of random runs The USPS Digits dataset 400 samples 21 hiddens • Superior to all Mean Field EM runs • Time  single exact EM run single IB-EM 27 min exact EM25 min/run 3/50 EM runs are  IB-EM: EM needs  x17 time for similar results Offers good value for your time!

-147.5 -148.5 -149.5 Test log-loss / instance Mean Field EM~0.5 hours -150.5 -151.5 0 20 40 60 80 100 Precentage of random runs Yeast Stress Response 173 experiments (variables) 6152 genes (samples) 25 hidden variables • Superior to all Mean Field EM runs • An order of magnitude faster then exact EM IB-EM ~6 hours Exact EM>60 hours 5-24 experiments Effective when exact solution becomes intractable!

Summary New framework for learning hidden variables • Formal relation of Bottleneck and EM • Continuation for bypassing local maxima • Flexible: structure / variational approximation Future Work • Learn optimal ≤1 for better generalization • Explore other approximations of Q(T|Y) • Model selection: learning cardinality and enrich structure

Relation to Weight Annealing Init: temp = hot Iterate until temp = cold • Perturb w  temp • Use QW and optimize • Cool down Y X1 … XN W 1 DATA 2 3 4 M Similarities: • Change in empirical Q • Morph towards EM solution Differences: • IB-EM uses info. regulatization • IB-EM uses continuation • WA requires cooling policy • WA applicable for wider range of problems [Elidan et. al, 2002]

Relation to Deterministic Annealing Init: temp = hot Iterate until temp = cold • “Insert” entropy  temp into model • Optimize noisy model • Cool down Y X1 … XN 1 DATA 2 3 4 M Similarities: • Use informationmeasure • Morph towards EM solution Differences: • DA parameterization dependent • IB-EM uses continuation • DA requires cooling policy • DA applicable for wider range of problems

Information Bottleneck EM

Information Bottleneck EM

Presentation Transcript

Capacity vs. bottleneck theories

Locating Bottleneck/Congested Links

Bottleneck Bandwidth Estimation

 Bottleneck effect 

Population Bottleneck

Bottleneck Management and Improvement

Maximum Bottleneck Paths

Traffic Bottleneck

Seeing Through The Bottleneck

Seeing Through The Bottleneck

Information Bottleneck

Detecting the Misappropriation of Sensitive Information through Bottleneck Monitoring

Information Bottleneck versus Maximum Likelihood

Information Bottleneck versus Maximum Likelihood

Bregman Information Bottleneck

Multivariate Information Bottleneck

Maximum Likelihood and the Information Bottleneck

Bottleneck identification

Gaussian Information Bottleneck

Multivariate Information Bottleneck

Alleviating the Confidential Information Memorandum Bottleneck in Your Firm