WK7 – Hebbian Learning

WK7 – Hebbian Learning CS 476: Networks of Neural Computation WK7 – Hebbian Learning Dr. Stathis Kasderidis Dept. of Computer Science University of Crete Spring Semester, 2009

Contents • Introduction to Hebbian Learning • Definitions on Pattern Association • Pattern Association Network • Formal Theory of Associations: Building Correlations • Examples • Conclusions Contents

Hebbian Learning • Hebbian Learning is a learning rule which is the oldest and most famous of all learning rules. It was postulated by Donald Hebb (1949) in his book (The Organisation of Behaviour): • “ When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such as A’s efficiency as one of the cells firing B, is increased” • Hebb proposed this change as a basis of associative learning. We may expand this as a two-part rule: Hebb. Learn.

Hebbian Learning-1 • If two neurons on either side of a synapse (connection) are activated simultaneously then the strength of that synapse is selectively increased. • If two neurons on either side of a synapse are activated asynchronously, then that synapse is selectively weakened or eliminated. • Such a synapse is called a Hebbian synapse. More precisely, we define a Hebbian synapse as a synapse that uses a time-dependent, highly local, and strongly interactive mechanism to increase synaptic efficiency as a function of the correlation between the pre-synaptic and post-synaptic activities. Hebb. Learn.

Hebbian Learning-2 • We analyse the four key mechanisms mentioned above: • Time dependent mechanism: This mechanism refers to the fact that the modifications in the synapse depend on the exact time of occurrence of the pre-synaptic and post-synaptic signals; • Local mechanism: By its very nature, a synapse is the transmission site where information-bearing signals are in spationtemporal contiguity. This locally available information is used by the synapse to produce a local modification that is input specific; • Interactive Mechanism: The occurrence of a change in a synapse depends on signals on both Hebb. Learn.

Hebbian Learning-3 • sides of the synapse. That is, a Hebbian form of learning depends on a “true interaction” between the pre- and post-synaptic signals in the sense that we cannot make any prediction from either one of these two activities by itself. The interaction may be deterministic of stochastic; • Correlational mechanism: The condition for a change in synaptic efficiency is the co-occurrence of pre- and post-synaptic signals. The correlation over time between the two signals is responsible for the synaptic change. Hebb. Learn.

Hebbian Learning-4 • We may classify synaptic modifications of a synapse as: • Hebbian: which is a synapse that increases its strength when positively correlated pre- and post-synaptic signals are present and decreases its strength when these signals are either uncorrelated or negatively correlated; • Anti-Hebbian: Such a synapse weakens positively correlated pre- and post-synaptic signals and strengthens negatively correlated signals; • Non-Hebbian: It does not involve, in the modification of a synapse, any mechanism that is time dependent, highly local and strongly interactive in nature (as in the previous cases). Hebb. Learn.

Hebbian Learning-5 • To formulate the Hebbian rule mathematically we consider a weight wkj of a neuron k with pre-and post-synaptic signal denoted by xj and yk respectively. The adjustment to the weight wkj at time step n is given by: wkj (n)=F(yk(n), xj(n)) where F(•,•) is a function of both signals. The above formula can take many specific forms. Typical examples are: • Hebb’s hypothesis: In the simplest case we have just the product of the two signals (it is also called the activity product rule): Hebb. Learn.

Hebbian Learning-6 wkj (n)=yk(n) xj(n) where  is a learning rate. This form emphasises the correlational nature of a Hebbian synapse. However this simple rule leads to an exponential growth of the weights (becomes unbounded). Thus we need to mechanism to stop the unbounded increase of the weights. One such is the following. • Covariance hypothesis: In this case we replace the product of pre- and post-synaptic signals with the departure of of the same signals from their respective average values over a certain time interval. If x* and y* is their Hebb. Learn.

Hebbian Learning-7 time-averaged value then the covariance form is defined by: wkj (n)=(yk(n)-y*) (xj(n)-x*) • The covariance hypothesis allows for the following: • Convergence to a non-trivial state, which is reached when xj(n)=x* or yk(n)=y*; • Prediction of both synaptic potentiation (i.e. increase in synaptic strength) and synaptic depression (i.e. decrease in synaptic strength). Hebb. Learn.

Pattern Association • An associative memory is a brain-like distributed memory that learns associations. Association is a known and prominent feature of human memory. • Association takes two forms: • Auto-association: Here the task of a network is to store a set of patterns (vectors) by repeatedly presenting them to the network. The network subsequently is presented with a partial description or distorted (noisy) version of the original pattern stored in it, and the task is to retrieve (recall) that particular pattern. • Hetero-association: In this task we want to pair an arbitrary set of input patterns to an arbitrary set of output patterns. Patt. Assoc.

Pattern Association-1 • Auto-association involves the use of unsupervised learning (Hebbian, Hopfield) while hetero-association involves the use of unsupervised (Hebbian) or supervised learning (e.g. MLP/BP) approaches. • Let xk denote a key pattern applied to an associative memory and yk denote a memorised pattern. The pattern association performed by the network is described by: • xk  yk , k=1,2,…,q • Where q is the number of patterns stored in the network. The key pattern xk acts as a stimulus that not only determines the storage location of memorised pattern yk but also holds the key for its retrieval. Patt. Assoc.

Pattern Association-2 • In an auto-associative memory, yk = xk , so the input and output spaces have the same dimensionality. In a hetero-associative memory, ykxk , hence in this case the dimensionality of the output space may or may not equal the dimensionality of the input space. • There are two phases involved in the operation of the associative memory: • Storage phase: which refers to the training of the network in accordance with a suitable rule; • Recall phase: which involves the retrieval of a memorised pattern in response to the presentation of a noisy version of a key pattern to the network. Patt. Assoc.

Pattern Association-3 • Let the stimulus x (input) represent a noisy version of a key pattern xj. This stimulus produces a response y (output). For perfect recall, we should find that y= yj where yj is the memorised pattern associated with the key pattern xj. When yyj for x=xj , the associative memory is said to have made an error in recall. • The number q of patterns stored in an associative memory provides a direct measure of the storage capacity of the network. In designing an associative memory, we want to make the storage capacity q (expressed as a percentage of the total number N of neurons) as large as possible and yet insist that a large fraction of the patterns is recalled correctly. Patt. Assoc.

Pattern Association Network • A pattern associator is a network which is able to learn hetero-associations of two patterns. A schematic representation is given below: Associator

Pattern Association Network-1 • The net input that arrives to every unit is calculated as: • Where i is an output neuron and j an index of an input neuron. The dimensionality of the input space is N and of the output space is M. wij is the weight from neuron j to neuron i. aj is the activation of a neuron j. • The activation of each neuron is produced by using a suitable threshold function and a threshold. For example we can assume that the activations are binary (i.e. either 0 or 1) and to achieve this we use the step Associator

Pattern Association Network-2 • function • The training of the network takes place by using for example the Hebbian form. Thus what we have is a matrix of weights, with all of them zero initially, assuming an input pattern of (101010) and an output pattern (1100): Associator

Pattern Association Network-3 • If we assume a learning rate =1 and after a single learning step we get: • To recall from the matrix we simply apply the input pattern and we perform matrix multiplication of the weight matrix with the input vector. We get in our example: Associator

Pattern Association Network-4 • If we assume a threshold of 2 we can get the correct answer (1100) using a step function as activation function. • We can learn multiple associations using the same weight matrix. For example assume that a new input vector (110001) is given with corresponding output Associator

Pattern Association Network-5 • vector as (0101). In this case after a single presentation (with =1) we will get an updated weight matrix: • Again we can get the correct output vectors when we introduce the corresponding input vector: Associator

Pattern Association Network-6 • Again by using the threshold of 2 and a step function we can get the correct answers of (1100) and (0101). • However, keep in mind that there is only a limited number of patterns which can be stored before perfect recall fails. Typical capacity of an associator network is 20% of the total number of neurons. Associator

Pattern Association Network-7 • Recall accuracy reflects the similarity of a key pattern with the stored patterns. The network can generalise in the sense when an input pattern is not exactly the same with any of the stored patterns, then it returns the (stored) patterns which more closely resembles the input. • Properties of pattern associators: • Generalisation; • Fault Tolerance; • Distributed representations are necessary for generalisation and fault tolerance; Associator

Pattern Association Network-8 • Prototype extraction and noise removal; • Speed; • Interference is not necessarily a bad thing (it is the basis of generalisation). Associator

Correlations • We have stated that the simple Hebb form creates unbounded weights. One way to overcome this problem was the covariance rule. A second one is the Oja’s rule. The latter rule has the benefit that is closely related to the principal components analysis method. • Let us restate the Hebb form for a single linear unit in the output layer and for an input vector with dimension larger than 1: • wi= Vi • Where V is the activation of the output unit, and i is the activation of input neuron i.  is the learning rate. Correlations

Correlations-1 • This rule as it stands does not have any (non-trivial) stable fixed point. To see this, let us assume for the moment that it there are (hypothetically) some fixed points. (A fixed point is a pair of (V, ) such that < w>=0). In this case we will have: • 0=< wi>=<Vi>=<jwjji>=jCijwj=Cw • Where the angle brackets indicate an average over the input distribution P() and we have defined the correlation matrix C by: • Cij<i j> • Or Cij<T> Correlations

Correlations-2 • Several things should be noted for C: • C is not the covariance matrix of the input, which would be defined in terms of the means i=< i> as <(i - i)(j - j)>; • C is symmetric, i.e. Cij= Cji which implies that the eigenvalues are real and the eigenvectors can be taken as orthogonal; • Because of the outer product form, C is positive semi-definite, thus all its eigenvalues are positive or zero. • Now let us return to the equation: • Cw = 0 Correlations

Correlations-3 • This equation says that w is an eigenvector of C with eigenvalue 0. But this will never be stable because C has some positive eigenvalues. Thus we conclude that there are only unstable fixed points for the plain Hebb learning procedure. • One can prevent the divergence of the Hebbian learning by constraining the growth of the weight vector w. There are several methods how this can be achieved: • One way is to renormalise the new vector, wi’=awi , of all the weights after each update, choosing as such that |w’|=1; Correlations

Correlations-4 • Another way is to clip the value of the weight at a lower and higher bound, in other words to constrain the value of the weight to higher or lower value when tries to cross over these values, i.e. • w- wi w+ • Another way is to use the Oja’s rule. This will examine next. • Oja has modified the plain Hebb rule in such a way so as to make possible the weight vector to approach a constant length |w|=1, without having to do any renormalisation by hand. • Moveover, w approaches an eigenvector of C with largest eigenvalue max. We call this maximal Correlations

Correlations-5 • eigenvector. • Oja’s modification corresponds to adding a weight decay proportional to V2 to the plain Hebb rule: • wi= V(i-Vwi) • Note that this form looks like a delta rule where the correction wi depends on the difference of the actual input and the backpropagated output. • We state some properties of Oja’s rule without any proof: • Unit length: |w|=1; • Eigenvector direction: w lies in the maximal Correlations

Correlations-6 • eigenvector direction of C; • Variance maximisation: w lies in a direction that maximises <V2> • Other rules exist in the literature about the modification of the plain Hebb rule. In most cases these are more complex forms. Correlations

Examples • Ex1- Hippocampal Model: There has been strong support up today to suggest that the brain area known as hipocampus uses a Hebb – style learning for forming episodic memories. • A model which captures the interactions of the hippocampus (DG / CA3 /CA1) with the immediate surrounding regions (Entorhinal cortex, Subiculum) and the neocortex areas is given below: Examples

Examples-1 Examples

Examples-2 • The module details are as follows: • Entorhinal cortex: 600 neurons, each with 200 synapses and sparseness=0.05; • DG: 1000 with 60 synapses each and sparseness=0.05; • CA3: 1000 neurons each with: • 200 recurrent synapses (from other CA3 neurons) • 120 synapses from Entorhinal cortex • 4 synapses from DG • With a sparseness=0.05; Examples

Examples-3 • CA1: 1000 neurons 200 synapses each and sparseness=0.01; • Sparseness is the number of activated neurons when a new stimulus arrives. This is determined by true data from the rat hippocampal area. • Input is coming to Entorhinal cortex • The connections from Ent. Cortex  DG are trained using Hebbian learning • DG is a competitive network • CA3 is an auto-association network • CA3 recurrent connections use Hebbian learning Examples

Examples-4 • The connections CA3  CA1 are trained with a Hebbian rule • CA1 is a competitive network • The connections from CA1  Ent. Coertx use Hebbian learning • Simulations of the model showed that one-shot learning is possible and it matched well a number of experimental data. Examples

Examples-5 • Ex2 – VisNet: This network is a model of biological vision and tries to solve the problem of position and view invariant representations built from multiple views of the same object, e.g. a human face. • It uses a hierarchical layered structure where the neurons of a top layer are connected to neurons of a previous layer by using receptive fields of appropriate size. The fields are progressively becoming wider as we move along the hierarchy. • In each layer we have an array of 32x32 cells, which use lateral inhibition in a competitive network arrangement. Examples

Examples-6 • Forward connections from one layer to another are trained by Hebbian-style learning. • Each cell receives 100 conenctions from the previous layer with 67% probability that a connections is coming from within 4 cells of the distribution centre. • The architecture is shown below: Examples

Examples-7 Examples

Examples-8 • The input to the model is an image of a face which is then is convoluted with appropriate filters so as to recognise different orientations and edges in the input image. This corresponds roughly to V1 brain area. • The learning law that is used, is a Hebbian rule with a memory trace: wkj (n)=ak(n) mj(n) mi(n)=(1-)ai(n)+ mi(n-1) Where  is a constant which determines the contribution of memory and of current activation. ai(n)is the activation of the neuron at time n and is calculated in the usual way. Examples

Examples-9 • The model successfully provides recognition of faces in different angles and positions in the input image. For more details one has to see the literature (Rolls & Treves, 1998) Examples

Conclusions • Hebbian learning is the oldest learning law discovered in neural networks • It is used mainly in order to build associators of patterns. • The original Hebb rule creates unbounded weights. For this reason there are other forms which try to correct this problem. There are also temporal forms of the Hebbian rule. A hybrid case is the memory case presented before in the VisNet case. • It has wide applications in pattern association problems and models of computational neuroscience & cognitive science. Conclusions

WK7 – Hebbian Learning