NIPS 2003 Workshop on
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

Saturday, December 13 Morning session PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

NIPS 2003 Workshop on Information Theory and Learning: The Bottleneck and Distortion Approach Organizers: Thomas Gedeon Naftali Tishby http://www.cs.huji.ac.il/~tishby/NIPS-Workshop. Saturday, December 13 Morning session

Download Presentation

Saturday, December 13 Morning session

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Saturday december 13 morning session

NIPS 2003 Workshop onInformation Theory and Learning:The Bottleneck and Distortion ApproachOrganizers:Thomas Gedeon Naftali Tishbyhttp://www.cs.huji.ac.il/~tishby/NIPS-Workshop


Saturday december 13 morning session

Saturday, December 13

Morning session

  • 7:30-8:20 N. TishbyIntroduction – A new look at Shannon’s Information Theory

  • 8:20-9:00 T. GedeonMathematical structure of Information Distortion methods

  • 9:00-9:30 D. Miller Deterministic annealing

  • 9:30-10:00 N. SlonimMultivariate Information Bottleneck

  • 10:00-10:30 B. MumeyOptimal Mutual Information Quantization is NP-complete

    Afternoon session

  • 16:00-16:20 G. Elidan The Information Bottleneck EM algorithm

  • 16:20-16:40 A. GlobersonSufficient Dimensionality Reduction with Side Information

  • 16:40-17:00 A. ParkerPhase Transitions in the Information Distortion

  • 17:00-17:20 A. Dimitrov Information Distortion as a model of sensory processing

    break

  • 17:40-18:00 J. SinkkonenIB-type clustering for continuous finite data

  • 18:00-18:20 G. ChechikGIB: Gaussian Information Bottleneck

  • 18:20-18:40 Y. CrammerIB and data-representation – Bregman to the rescue

  • 18:40-19:00 B. WestoverAchievable rates for Pattern Recognition


Outline

Outline:

The fundamental dilemma: Simplicity/Complexity versus Accuracy

  • Lessons from Statistical Physics for Machine Learning

  • Is there a “right level” of description?

  • A Variational Principle

    • Shannon’s source and channel coding – dual problems

    • [Mutual] Information as theFundamental Quantity

    • Information Bottleneck (IB) andEfficient Representations

    • Finite Data Issues

  • Algorithms, Applications and Extensions

    • Words, documents and meaning…

    • Understanding neural codes …

    • Quantization of side-information

    • Relevant linear dimension reduction: Gaussian IB

    • Using “irrelevant” side information

    • Multivariate-IB and Graphical Models

    • Sufficient Dimensionality Reduction (SDR)

  • The importance of being principled…


  • Learning theory the fundamental dilemma

    Learning Theory:The fundamental dilemma…

    y=f(x)

    Tradeoff between accuracy and simplicity

    Y

    Good models should enable Prediction of new data…

    X


    Or in unsupervised learning

    Or in unsupervised learning …

    Y

    Cluster models?

    X


    Lessons from statistical physics

    Lessons from Statistical Physics

    Physical systems with many degrees of freedom:

    What is the right level of description?

    • “Microscopic Variables” – dynamical variables (positions, momenta,…)

    • “Variational Functions” – Thermodynamic potentials

      (Energy, Entropy, Free-energy,…)

    • Competition between order (energy) and disorder (entropy)…

    Emergent “relevant” description

    Order Parameters

    (magnetization,…)


    Statistical models in learning and ai

    Statistical Models in Learning and AI

    Statistical models of complex systems:

    What is the right level of description?

    • “Microscopic Variables” – probability distributions (over all observed and hidden variables)

    • “Variational Functions” – Log-likelihood, Entropy,… ?(Multi-Information, “Free-energy”,… )

    • Competition between accuracy and complexity

    Emerged “relevant” description

    Features

    (clusters, sufficient statistics,…)


    The fundamental dilemma of science model complexity vs prediction accuracy

    Limited data

    Bounded

    Computation

    The Fundamental Dilemma (of science): Model Complexity vs Prediction Accuracy

    Accuracy

    Survivability

    Possible Models/representations

    Complexity

    Efficiency


    Can we quantify it

    Can we quantify it…?

    When there is a (relevant) prediction or distortion measure

    Accuracy small average distortion (good predictions)

    Complexity short description (high compression)

    A general tradeoff between distortion and compression:

    Shannon’s Information Theory


    Shannon s information theory

    decoder

    encoder

    receiver

    source

    channel

    Shannon’s Information Theory

    Sent

    messages

    Received

    messages

    symbols

    Source

    coding

    Channel

    coding

    Channel

    decoding

    Source

    decoding

    0110101001110…

    Compression

    Error Correction

    Decompression

    Source Entropy

    Channel Capacity

    Rate vs Distortion

    Capacity vs Efficiency


    Compression and mutual information

    Compression and Mutual Information

    We need to index the max number of non-overlapping green blobs inside the blue blob:

    (mutual information!)


    Axioms for multi information w i nemenman

    Axioms for “(multi) information”(w I Nemenman)

    • M1: For two nodes, X and Y, the shared information,I[X;Y], is a continuous function of p(x,y).

    • M2: when one of the variables defines the other uniquely (1-1 mapping from X to Y) and both p(x)=p(y)=1/K , then I[X;Y] is a monotonic function of K.

    • M3: If X is partitioned into two subsets, X1 and X2, then the amount of shared information is additive in the following sense: I[X;Y]=I[(X1,X2);Y]+p(x1)I[X;Y|X1]+p(x2)I[X;Y|X2]

    • M4: Shared information is symmetric: I[X;Y]=I[Y;X].

    • M5:

      Theorem:


    The dual variational problems of it rate vs distortion and capacity vs efficiency

    The Dual Variational problems of IT:Rate vs Distortion and Capacity vs Efficiency

    Q:What is the simplest representation – fewer bits(/sec) (Rate) for a given expected distortion (accuracy)?

    A: (RDT, Shannon 1960) solve:

    Q: What is the maximum number of bits(/sec) that can be sent reliably (prediction rate) for a given efficiency of the channel (power, cost)?

    A: (Capacity-power tradeoff, Shannon 1956) solve:


    Rate distortion theory

    Rate-Distortion theory

    • The tradeoff between expected representation size (Rate) and the expected distortion is expressed by the rate-distortion function:

    • By introducing a Lagrange multiplier, , we have a variational principle:

    • with:


    Saturday december 13 morning session

    The Rate-Distortion function, R(D), is the optimal ratefor agiven distortionand is a convex function.

    R(D)

    Achievable region

    -bits/accuracy

    impossible

    D


    Saturday december 13 morning session

    0

    The Capacity – Efficiency Tradeoff

    bits/erg

    Achievable Region

    C(E)

    Capacity at E

    E (Efficiency (power))


    Double matching of source to channel eliminates the need for coding

    Double matching of source to channel- eliminates the need for coding (?!)

    R(D)

    C(E)

    E

    D

    In the case of a Gaussian channel (w. power constraint) and square distortion, double matching means:

    And No coding!


    Saturday december 13 morning session

    “There isa curious and provocative duality between the properties of a source with a distortion measure and those of a channel … if we consider channels in which there is a “cost” associated with the different input letters…

    The solution to this problem amounts, mathematically, to maximizing a mutual information under linear inequality constraint… which leads to a capacity-cost function C(E) for the channel…

    In a somewhat dual way, evaluating the rate-distortion function R(D) for source amounts, mathematically, to minimizing a mutual Information … again with a linear inequality constraint.

    This duality can be pursued further and is related to the duality between past and future and the notions of control and knowledge. Thus we may have knowledge of the past but cannot control it; we may control the future but have no knowledge of it.”

    C. E. Shannon (1959)


    Saturday december 13 morning session

    Bottlenecks and Neural Nets

    • Auto association: forcing compact representations

    • provide a “relevant code” of w.r.t.

    Input

    Output

    Sample 1

    Sample 2

    Past

    Future


    Saturday december 13 morning session

    Bottlenecks in Nature…

    (w E Schniedman, Rob de Ruyter, W Bialek, N Brenner)


    Saturday december 13 morning session

    • The idea: find a compressed variable

      that enables short encoding of ( small )

      while preserving as much as possible the information on the relevant signal ( )


    A variational principle

    A Variational Principle

    We want a short representation of X that keeps the information about another variable, Y, if possible.


    Self consistent equations

    Self Consistent Equations

    • Marginal:

    • Markov condition:

    • Bayes’ rule:


    Saturday december 13 morning session

    The emerged effective distortion measure:

    • Regular if is absolutely continuous w.r.t.

    • Small if predicts y as well as x:


    I projections on a set of distributions csiszar 75 84

    I-Projections on a set of distributions (Csiszar 75,84)

    • The I-projection of a distribution q(x) on a convex set of distributions L:


    The blahut arimoto algorithm

    The Blahut-Arimoto Algorithm


    The information bottleneck algorithm

    The Information BottleneckAlgorithm

    “free energy”


    Saturday december 13 morning session

    The IB emergenteffective distortion measure:

    The IB

    algorithm


    Key theorems

    Key Theorems

    IB Coding Theorem(Wyner 1975, Bachrach, Navot, Tishby 2003):

    The IB variational problem provides a tight convex lower bound on the expected size of the achievable representations of X, that maintain at least of mutual information on the variable Y. The bound can be achieved by: .

    Formally equivalent to “Channel coding with Side-Information” (introduced in a very different context by Wyner, 75).

    The same bound is true for the dual channel coding problem and the optimal representation constitutes a “perfectly matched source-channel” and requires “no further coding”.

    IB Algorithm Convergence(Tishby 99, Friedman, Slonim,Tishby 01):

    The GeneralizedArimoto-Blauht IB algorithm converges to a local minima of the IB functional, (including its generalized multivariate case).


    Summary

    Summary

    • Shannon’s Information Theory suggests a compelling framework

      for quantifying the “Complexity-Accuracy” dilemma, by unifying

      source and channel coding as a Min Max of mutual information.

    • The IB method turns this unified coding principle into algorithms

      for extracting “informative” relevant structures from empirical

      joint probability distributions P(X1,X2)…

    • The Multivariate IB extends this framework to extract

      “informative” structures from multivariate distributions,

      P(X1,…,Xn), via Min Max I(X1,…,Xn),and Graphical models

    • IB can be further extended for

      Sufficient Dimensionality Reduction

      a method for finding informative continuous

      low dimensional representations via Min Max I(X1,X2)


    Saturday december 13 morning session

    Many thanks to:

    Fernando Pereira

    William Bialek

    Nir Friedman

    Noam Slonim

    Gal Chechik

    Amir Globerson

    Ran Bachrach

    Amir Navot

    Ilya Nemenman


  • Login