E ﬃ cient Coding: From Retina Ganglion Cells To V2 Cells

Eﬃcient Coding: From Retina Ganglion Cells To V2 Cells Honghao Shan Garrison W. Cottrell The Temporal Dynamics of Learning Center Gary's Unbelievable Research Unit (GURU) Computer Science and Engineering Department Institute for Neural Computation UCSD

Introduction and Motivation • We have 1011 − 1012 neurons with ~1015 connections between them - it seems highly unlikely that the features they respond to are learned by any supervised mechanism! • Hence unsupervised learning seems much more likely. • What is the correct learning rule? • Here we focus on the visual system.

Introduction and Motivation • In V1, simple cells respond to orientedvisual edges • In V1, complex cells respond to visual edges at nearby locations - they appear to pool the responses of simple cells • In V2,cell responses are already hard to characterize. • Eventually, there are cells that respond to faces, and even further in, respond to identity (faces and names).

Introduction and Motivation • E.g., the “Halle Berry” neuron…

Introduction and Motivation • If these are learned by unsupervised learning, then what is the correct learning rule? • What is the goal of the learning rule? • Hypothesis: visual perception serves to capture statistical structure of the visual inputs • Attneave (1954): the statistical structure can be measured by the redundancy of the inputs: I(x) = iH(xi ) − H(x) (Minimized to zero when xi are independent). • Barlow (1961) suggested what has come to be called the efficient coding theory: the goal of early vision is to remove redundancy from the visual inputs. • The coding (outputs) should be as independent as possible

Introduction and Motivation There have been a variety of implementations of the efficient coding theory: • Principal Components Analysis (PCA): provably optimal (in a least squares sense) linear dimensionality technique

Introduction and Motivation • Principal Components Analysis (PCA) - but this only leads to uncorrelated outputs, and global receptive fields that look nothing like V1 receptive fields.

Introduction and Motivation There have been a variety of implementations of the efficient coding theory: • Principal Components Analysis (PCA) - but this only leads to uncorrelated outputs, and global receptive fields • Independent Components Analysis (ICA) (Bell & Sejnowski) • Sparse Coding (Olshausen & Field) • These last two: • Lead to Gabor like receptive fields (as we see in V1) • Turn out to be equivalent under certain assumptions.

Introduction and Motivation • When applied to natural images, ICA and sparse coding give Gabor-like receptive fields, similar to those seen in simple cells in V1

Introduction and Motivation • There have been many attempts to go beyond a single layer(Karklin, Y., & Lewicki, M. S. (2006), Schwartz, O., & Simoncelli, E. P. (2001) Hoyer & Hyvarinen (2002) but: • They usually require a different learning rule • And do not lead to a way to do the next layer. • Or, like deep belief networks, do not have plausible receptive fields (Hinton, 2006)

Our contribution • We have previously developed a method for applying ICA over and over, in order to get higher layer representations (Shan, Zhang, & Cottrell, NIPS, 2006/2007), called RICA • Recursive Independent Components Analysis: ICA->add nonlinearity->ICA->add nonlinearity… • In our paper, we showed that the second layer of ICA had interesting neural properties

Roadmap • We describe ICA and our version of a hierarchical ICA, Recursive ICA (RICA) • We illustrate PCA and describe sparse PCA (SPCA) and the initial results • We investigate the receptive fields of the higher layers of RICA.

Independent Components Analysis • Sparse coding and ICA can be summarized by a generative model, in which the observed data vector x∈ RL is assumed to be a linear mixture of some (unknown) underlying signal sources s∈ RM: the pixels ->x = As + ε <- s is the neural firings where A ∈ RLXM denotes the (unknown) mixing matrix, ε∈ RL denotes additive Gaussian noise • I.e., the pixels are a linear combination of neural firings (underlying signal sources - what we believe is in the input).

Independent Components Analysis x = As + ε • Two assumptions are imposed on the signal sources: • They are assumed to be independent • Each dimension of s is assumed to follow a sparse distribution, usually with a peak at zero and two heavy tails, unlike PCA, where the projections tend to follow a Gaussian. • Model parameters are adapted to make the generation of the observed x’s likely, and to encourage the sparse prior:

Recursive Independent Components Analysis (RICA 1.0) • We assume that higher layers of cortex follow a similar learning rule as earlier layers - so we would like to apply essentially the same learning rule to subsequent layers. • BUT: • More linear processing will not lead to more interesting structure… • So we need some nonlinearity applied to the output of the first layer of ICA…

Recursive Independent Components Analysis (RICA 1.0) • Notice that the generative model: x = As + ε • means that: xi= j Aij*sj + εi • Thus, each input variable (pixel) is assumed to be the sum of many independent random variables…

Recursive Independent Components Analysis (RICA 1.0) • Thus, each input variable (pixel) is assumed to be the sum of many independent random variables… i.e., it follows a Gaussian distribution! Idea: ICA therefore expects a Gaussian distributed input - which makes applying a second layer of ICA to a sparsely distributed input unlikely to work well. Hence, we apply a component-wise nonlinearity to the first layer outputs to make the output follow a Gaussian distribution.

Recursive Independent Components Analysis (RICA 1.0) • Another observation: The sign of the output of any si is redundant statistically: This is the distribution of one signal source (s2) as a function of the value of a neighboring signal source (s1) …

Recursive Independent Components Analysis (RICA 1.0) • Hence our nonlinear activation function: • Note that ambiguous (not quite on, not quite off) responses (in BLUE) are emphasized in the activation function

Recursive Independent Components Analysis (RICA 1.0) • An actual nonlinear activation function

Recursive Independent Components Analysis (RICA 1.0) • We applied RICA 1.0 to natural image patches. • Layer-1 ICA learns the standard edge/bar shaped visual features. • Layer-2 ICA learns more complex visual features that appear to capture contour and texture (Shan, Zhang & Cottrell, NIPS, 2007).

Recursive Independent Components Analysis (RICA 1.0) Furthermore, these nonlinear features are useful: • We applied layer 1 features with the nonlinearity to face recognition, and obtained state-of-the-art recognition performance on face recognition, using a simple linear classifier (Shan & Cottrell, CVPR, 2008). • We also used the layer 1 features in a completely different recognition system we applied to faces, objects and flowers, and got state-of-the-art results on all three, without retraining (Kanan & Cottrell, CVPR, 2010)

Results (NIPS 06/07) • Error rates on the Yale face database: Number of training examples Number of features

Image Decide where to look Fixate Region Local Features Local Classifier Decision? Get next fixation? CVPR 2010 • Both the salience map and the features stored at each location are ICA features with our nonlinearity

Improvement Over State-of-the-art

Roadmap • We describe ICA and our version of a hierarchical ICA, Recursive ICA (RICA 1.0) • We illustrate PCA and describe sparse PCA (SPCA) and the initial results • We investigate the receptive fields of the higher layers of RICA 2.0.

The “hidden layers” • The story I told: • Recursive Independent Components Analysis: ICA->add nonlinearity->ICA->add nonlinearity… • The truth: in fact, like everyone else who does this sort of work, it is actually interleaved with PCA: • PCA->ICA->add nonlinearity->PCA->ICA->add nonlinearity… • And like everyone else, we never publish the pictures of the PCA receptive fields - because they don’t look biologically plausible!

RICA 2.0 • We now combine this with our improvements to sparse PCA (Vincent et al., 2005) to get receptive fields up to V2. • SPCA->ICA->add nonlinearity->SPCA->ICA->add nonlinearity… • And, sparse PCA learns biologically-realistic receptive fields.

A simple (unrealistic) example • Suppose two input signals (e.g., pixels) are completely correlated: Pixel 2 Pixel 1

A simple (unrealistic) example • Then we could represent that information with one (linear) “neuron”: • This is an example of redundancy reduction 0.5*Pixel 1 + 0.5*Pixel 2 0.5 0.5 Pixel 1 Pixel 2

Pixel 1 Pixel 2 1.0 1.0 0.5*Pixel 1 + 0.5*Pixel 2 A simple (unrealistic) example • Furthermore, we can reconstruct the original pixels from that one “neural response”:

Pixel 1 Pixel 2 1.0 1.0 0.5*Pixel 1 + 0.5*Pixel 2 A simple (unrealistic) example • Hence the “autoencoder network”: 0.5 0.5 Pixel 1 Pixel 2

0.5*Pixel 1 + 0.5*Pixel 2 0.5 0.5 Pixel 1 Pixel 2 Principal Components Analysis • Principal Components Analysis would do exactly this, because it learns representations based on correlations between the inputs. • This is an example of redundancy reduction and dimensionality reduction (from 2 dimensions to 1)

0.5*Pixel 1 + 0.5*Pixel 2 0.5 0.5 Pixel 1 Pixel 2 Pixel 1 Pixel 2 Principal Components Analysis • Note that we can plot this “principal component” in image space, corresponding to the “weights”, (0.5,0.5) • The same thing applies if we have more than two pixels…so we have more than 2 principal components…capturing more correlations…

Pixel 2 Pixel 1 1.0 1.0 Pixel 1 0.5*Pixel 1 + 0.5*Pixel 2 Pixel 2 Principal Components Analysis • And now we can see that the reconstruction is a weighted version of that “image” • The same thing applies if we have more than two pixels…so we have more than 2 principal components…capturing more correlations…

Principal Components Analysis • Here are the principal components of 10x10 patches of natural images:

Principal Components Analysis • But PCA learns these correlations in order of their size: so the first principal component does a lot of work: 1st PC

Principal Components Analysis • and the last principal component does very little work: last PC

Principal Components Analysis • So we can throw a lot of them away and you can’t tell the difference in an image that was reconstructed from them: Original Compressed • So PCA does two things right: It decorrelates the inputs, and it reduces dimensionality, making it “efficient” at encoding images.

Principal Components Analysis • But no neuron should have to be the first principal component: So we should distribute the load evenly - this is called “response equalization.”

Principal Components Analysis • Secondly, PCA is profligate with connections - every pixel is connected to every principal component “neuron”: we should try to reduce the connections also.

Sparse Principal Components Analysis • We will try to minimize reconstruction error, • While trying to equalize the neural responses • And minimizing the connections.

Reconstruction error Minimize connections Equalize the “work” Sparse Principal Components Analysis • We minimize: • Subject to the following constraint:

Information Kept With Sparse Connections • We applied the model to 20 X 20 image patches, and reduced the dimensionality to 100. • Results: • Our model captures 99.23% of the variance that could be captured by PCA with 100 output neurons. • 96.31% of the connection weights in our model are zero.

Information Kept With Sparse Connections

The model as a neural net… It is AT that is mostly 0…

Results • On grayscale images: • Note that we get essentially the same results applying the model to pink noise images…

Results • suggesting the 1/f power spectrum of images is where this is coming from…

E ﬃ cient Coding: From Retina Ganglion Cells To V2 Cells