1 / 27

Information Theory and Learning

Information Theory and Learning. Tony Bell Helen Wills Neuroscience Institute University of California at Berkeley. One input, one output deterministic Infomax: match the input distribution to the non-linearity:. Gradient descent learning rule to maximise the transferred information.

bhanscom
Download Presentation

Information Theory and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Theory and Learning Tony Bell Helen Wills Neuroscience Institute University of California at Berkeley

  2. One input, one output deterministic Infomax: match the input distribution to the non-linearity:

  3. Gradient descent learning rule to maximise the transferred information deterministic sensory only

  4. Examples of score functions LOGISTIC LAPLACIAN In stochastic gradient algorithms (online training), we dispense with the ensemble averages giving: for a single training example and a laplacian ‘prior’.

  5. Same theory for multiple dimensions: fire vectors into the the unit hypercube uniformly: ( ) where this is the absolute determinant of the Jacobian matrix, measuring how stretchy the mapping is for square or overcomplete transforms Undercomplete transformations are not invertable, and require the more complex formula:

  6. Same theory for multiple dimensions: fire vectors into the the unit hypercube uniformly: ( ) Post-multiplying this by a positive definate transform rescales the gradient optimally (called the Natural Gradient - Amari) giving the pleasantly simple form:

  7. Decorrelation is not enough: diagonal matrix f gives higher order statistics, through its Taylor expansion

  8. Infomax/ICA on image patches: learn co-ordinates for natural scenes. In this linear generative model, we want u = s: recover independent sources. After training, we calculate A = W , and plot the columns. For 16x16 images, we get 256 bases -1

  9. f from logistic density

  10. f from laplacian density

  11. f from Gaussian density

  12. But this does not actually make the neurons independent. Many joint densities p(u1,u2) are decorrelated but still radially symmetric: they factorise in polar co-ordinates, but not in cartesian, unless they’re Gaussian.. instead of This happens when cells have similar position, spatial frequency, and orientation selectivity, but different phase. Dependent filters can combine to make non-linear complex cells (oriented but phase insensitive).

  13. ‘Dependent’ Component Analysis. First, the maximum likelihood framework. What we have been doing is: Infomax Maximum Likelihood Minimum KL Divergence We are fitting a model to the data: or equivalently: But a much more general model is the ‘energy-based’ model (Hinton): sum of functions on subsets of with

  14. ‘Dependent’ Component Analysis. For the completely general model: the learning rule is: with the 2nd term reducing to -I (identity) in the case of ICA. Unfortunately this involves an intractable integral over the model q. Nonetheless, we can still work with all dependency models which are non-loopy hypergraphs. Learn as before, but with a modified score function: : a loopy hypergraph: instead of

  15. For example, we can split the space into subspaces such that the cells are independent between subspaces and dependent within the subspaces. Eg: for 4 cells: 1 3 2 4 We now show a sequence of symmetry-breaking occuring as we move from training, on images, a model which is one big 256-dimensional hyperball, down to a model which is 64 four-dimensional hyperballs:

  16. Logistic Density 1 subspace

  17. Logistic density 2 subspaces

  18. Logistic density 4 subspaces

  19. Logistic density 8 subspaces

  20. Logistic density 16 subspaces

  21. Logistic density 32 subspaces

  22. Logistic density 64 subspaces

  23. Topographic ICA Arrange the cells in a 2D map with a statistical model q constructed from overlapping subsets. This is a loopy hypergraph, an un-normalised model, but it still gives a nice result…. The hyperedges of our hypergraph are overlapping 4x4 neighbourhoods etc.

  24. That was from Hyvarinen & Hoyer. Here’s one from Osindero & Hinton.

  25. Conclusion. Well, we did get somewhere: We seem to have an information-theoretic explanation of some properties of area V1 of visual cortex: -simple cells (Olshausen &Field, Bell & Sejnowski) -complex cells (Hyvarinen & Hoyer) -topographic maps with singularities (Hyvarinen & Hoyer) -colour receptive fields (Doi & Lewicki) -direction sensitivity (van Hateren & Ruderman) But we are stuck on: -the gradient of the partition function -still working with rate models, not spiking neurons -no top-down feedback -no sensory-motor (all passive world modeling)

  26. References. The references for all the work in these 3 talks will be forwarded separately. If you don’t have access to them email me at tbell@berkeley.edu, and I’ll send them to you.

More Related