10. The cognitive brain

10.The cognitive brain Fundamentals of Computational Neuroscience, T. P. Trappenberg, 2010. Lecture Notes on Brain and Computation Summary by Byoung-Hee Kim Lecture by Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science and Engineering Graduate Programs in Cognitive Science, Brain Science and Bioinformatics Brain-Mind-Behavior Concentration Program Seoul National University E-mail: btzhang@bi.snu.ac.kr This material is available online at http://bi.snu.ac.kr/

Introduction • This chapter is a continuation of discussing system-level models of the brain • Layered representation of brain • Invariant object recognition • Visual attention • Workspace hypothesis • How the brain is able to produce novel solutions to new tasks • General discussion of the brain theory in bidirectional layered models of cortex – anticipation is a central feature • Bayesian formulations • Boltzmann machine / Helmholtz machine / Deep belief networks • Probabilistic reasoning: Bayesian networks, EM • Adaptive resonance theory (ART)

Outline

10.1 Hierarchical maps and attentive vision • Invariant object recognition of humans • Recognize objects even though they vary in form, size, location, and viewing angle • Neural systems that learn to recognize objects through supervised learning in mapping networks are quite sensitive against changes in the input vector • Solution: hierarchical networks • Related question: attention in the visual system

10.1.1 Invariant object recognition • VisNet • Model by Edmund Rolls, Simon Stringer • A hypothesis of how invariant visual object recognition is achieved within the ventral visual pathways of the brain • Each node in a layer is connected to a spatially restricted area of nodes in the layer below. The receptive fields of the nodes increases with each level.

VisNet Model • Each layer is a competitive map • Competition is implemented through adjustment of the firing threshold of nodes until a predefined sparseness is reached (refer Section 7.5.11) • The model is trained on sequences of patterns from movements of objects in the visual field • Weights btw the layers are adjusted with Hebbian learning • Early version: Hebbian learning with a trace rule – some memory in the neural activity (trace) was used • Temporal associations between moving objects • Recent model: without a trace rule when objects in consecutive time steps have some overlap with previous neural representations • Using spatial associations • Able to learn invariant object recognition • invariance to translation, rotation, and size • Multiple objects can be trained simultaneously • The information flow between the layers is strictly feedforward

10.1.2 Attentive vision • Top-down information flow in cortical models • Crucial in cognitive process: • Example: Visual search • Demands the top-down influence of an object bias that specifies what to look for ↔ object recognition demands a spatial attention

The overall scheme of the model for attentive vision • Model by Gustavo Deco et al. • Three important parts: ‘V1-V4’, ‘IT’, ‘PP’ • We are particularly interested in how the different parts influence each other

Roles of parts: LGN • Representations in the LGN: Gabor functions • Bottom-up input images to the model are decomposed with Gabor functions • Turning curves in the LGN can be parameterized with Gabor functions

Roles of parts: ‘V1-V4’ • Principalrole in the model • Decomposing the visual scene into features • V1 neurons respond mainly to simple features - orientation of edges • Later areas are specialized to represent other features – color, motion, combination of basic features • Modeling issue • Simplification: features are just represented in sections which correspond to the location of the object in the visual field • Each section represents a feature of a part as a vector of node activities • No global competition between the modules. • Only within each section – an inhibitory pool that keeps the sparseness of the activity in each section roughly constant (refer Section 7.5.1)

Roles of parts: ‘IT’ • IT: model of processes in the inferior-temporal cortex • Known to be involved in object recognition • Modeled as associative memory (refer Ch. 8) • Attractor network: Point attractorsof specific objects are formed through the Hebbian-trained collateral connections within this structure • The connections btw ‘V1-V4’ and ‘IT’ can also be trained with Hebbian learning • This enables simulation of translation-invariant object recognition

Roles of parts: ‘IT’ • The contribution of the attractor network in ‘IT’ to translation-invariant object recognition explains some recent experimental findings • the size of the receptive field of IT neurons depends on the content of the visual field and the specifics of the task • Example: Single object on a screen with a blank background – the receptive field of a neuron that responds to this object can be very large (> 30 degrees)

Roles of parts: ‘IT’ • If two objects are presented simultaneously, or if the target object is shown on top of a complex background (which can be viewed as a scene with many objects) • Then the size of the receptive field shrinks markedly • Fig. 10.4(A)

Roles of parts: ‘IT’ • Hypothesis given by the model • Based on the attractor dynamics of the autoassociator network in ‘IT’ • If only one object is shown, then this object would trigger the right point attractor and thus recall of the object regardless of its location, which corresponds to large receptive fields • If two or more trained objects are shown, then it is likely that the final state of the attractor network is mainly dominated by the object closest to the fovea, which gets the most weight due to cortical magnification • Fig. 10.4(B)

10.1.3 Attentional bias in visual search and object recognition • Objectbias input to the attractor network in ‘IT’ • Tells the system what to look for in visual search task • Such top-down information is thought to originate in the frontal areas of the brain • Can speed up the recognition process in ‘IT’ • Supports the recognition ability of the input from ‘V1-V4’ that corresponds to the target object in visual search • The receptive fields of target objects are larger than receptive fields of non-target objects

Attentional bias in Object Recognition • Top-down input to a specific location in ‘PP’ • Object recognition task • The label of this module suggests processing in the posterior parietal cortex, which is part of the dorsal visual processing pathway (‘where’ pathway) • Modeled with a spatially organized neural sheet, which is connected to the corresponding section in ‘V1-V4’ • Enhance the neural activity in ‘V1-V4’ for the features of the object that are located at the corresponding location in the visual field • Faster completion of the input patterns for the ‘IT’ network • Faster object recognition at the corresponding location

Attentional bias – summary • Origins of the attentional biases • Visual search: top-down input to ‘IT’, object-based • Object recognition: acts on ‘PP’, location-based • It may be difficult to separate the different forms of attention in experiments because all parts of the model are bidirectional

Numerical experiments • Simulation experiment by Deco et al. • Target: the letter ‘E’ • Distractors: visually different (‘X’, Fig. 5A) and similar (‘F’, Fig. 5B) • Reaction time • Independent of the number of objects (Fig. 5A) • Linear increase with the number of objects (Fig. 5B) • Both modes are present in the same ‘parallel architecture’ • The apparent serial search is only due to the more intense conflict-resolution demand in the recognition process

10.2 An interconnecting workspace hypothesis • Humans are very flexible in coping with the complex world • Ex. Humans can drive a car with such ease that little attention despite the fact that we have to react to the unknown environment • Model to solve complex cognitive task • Attributes: flexible cooperation of many specialized modules • System-level perspective • specialized processors + global system activities • Flexibility + robustness

10.2.1 The global workspace • Model by S. Dehaene et al. (1995) • Five basic subsystems • Perceptual (input), motor (output), three cognitive subsystems • Global workspace: interconnecting network among subsystems • Projection between cortical areas are indeed abundant • Large portion of the global workspace could be localized within layers II and III

10.2.2 Demonstration of the global workspace in the Stroop task • Stroop Task • Task 1: read the word or • Task 2: name the color in which the word is written • You should do it rapidly • Idea: the global workspace has to become active to ‘re-wire’ the commonly active word-naming configuration of the brain Yellow Green Blue

10.2.2 Demonstration of the global workspace in the Stroop task • Model that can be tested on a Stroop task to demonstrate the idea • Three specialized processors • Top-down influence of workspace nodes on the nodes in the specialized processors • Reward signal – indicates a mismatch btw the desired response and actual response of the system • Vigilance parameter • Usage of the model • Predictions • Flexibility to solve different tasks

10.3 The anticipating brain • Layered architectures with bottom-up and top-down information flow are important for cognitive processes • Previous examples: VisNet, interconnecting workspace hypothesis • Generalization (emerging brain theory) • brain-style information processing principles • general hypothesis of how the brain implements cognitive functions • Brain as an anticipating memory system • Remainder is a discussion of • Anticipating brain hypothesis • Related model implementations

The Anticipating brain – factors • Factors that are essential in realizing cognitive functions • The brain can develop a model of the world, which ca be used to anticipate or predict the environment • The inverse of the model can be used to recognize causes by evoking internal concepts • Hierarchical representations are essential to capture the richness of the world • Internal concepts are learned through matching the brain’s hypotheses with input from the world • An agent can learn actively by testing hypothesis through actions • The temporal domain is an important degree of freedom

10.3.1 The brain as anticipatory system in a probabilistic framework • Notations • Sensory state: s • c: causal state • g: describes the physical process of generating the sensory response • a: action of the agent • s’: internal representation of sensory states in primary sensory cortex • c’: higher-order cortical representations. Called as concepts • Generative model G: on an abstract level, we see the brain as a generative model of the world • Recognition model Q: the inverse of G, which evokes internal concepts from causes in the environment

Definition of the term ‘causes’ highlights two important functions of brain processing • First, one of the major goals of the brain is to learn what causes are by forming internal concepts • Second, the brain must learn concepts at different levels of abstraction • Learning concepts and predicting causes in our environment is central to the thesis developed in this chapter • Agent: system that can explore the environment actively via interaction with the environment • Conjecture • The brain is trying to match sensory input with internally generated states

The world model in a probabilistic framework • Layered structure that includes the necessary bidirectional connections • Related models: deep belief networks • Highly interactive system: it is not easy to clearly separate the generative model from the recognition model • Model form: Bayesian network or causal graphical model

Concepts at different levels of cortical representation • learned in a self-supervised way through the interaction with the environment • Engrained into a memory system • Example from the visual system • Early level can learn to recognize different sequences of retinal patterns • Sequences of these concepts can then be learned by higher-order cortical areas • Higher order concepts that are evoked by specific sensory input can influence the expectations of concepts in lower cortical representations, ultimately anticipating specific patterns of sensory input

Hypothesis testing – how good the world model • Can only be achieved through interactions with the environment • Inference of the world model with environmental data • Hypothesis testing by the agent is different to common inference techniques in statistics in that the agent seems to be able to actively interact with the environment • Active learning might be necessary to reduce the demands on learning in large systems • Now, we will learn some recent models that implement and elaborate on the principal ideas outlined in this section

10.3.2 The Boltzmann machine - Intro • Models which are able to learn expectations of sensory states • The attractor neural network (ANN) (Ch. 8) • More general dynamic models (in this chapter) • Features of the attractor neural network (ANN) • It can be seen as predictive memory systems • Simple recurrent network • Trained with Hebbian autocorrelation rules  corresponding dynamic system has point attractors • Limitations of the ANN • Always produce the same answer given partial input of a sensory state • Not reflecting the probability of different causes in the environment with similar sensory state

Toward a general dynamic system • Introducing hidden nodes to a recurrent system • Hidden nodes • in feedforward mapping networks (perceptrons), provide enough internal representations • In recurrent networks, provide enough degree of freedom • Finding practical training rules for the dynamic system of recurrent networks with hidden nodes have been a major challenge • Extension • Distinguish visible nodes and hidden nodes • The system still can be described by an energy function • Boltzmann machines – symmetrical connections • Helmholtz machines – asymmetrical connections

The Boltzmann machine - 1 • The energy between two nodes • s: the state variable • n or m can have values v or h to indicate visible and hidden nodes • Probabilistic update rule (Glauber dynamics) • β: inverse temperature • Describes the competitive interaction between minimizing the energy and the randomizing thermal force • Probability distribution for such a stochastic system is called the Boltzmann-Gibbs distribution

The Boltzmann machine - 2 • The distribution of visible states in thermal equilibrium • w: the weights of the recurrent network • normalization term called the partition function • Target of the model • With enough hidden nodes • By choosing the right weight values, we want to the dynamical system to approximate the probability function of the sensory states caused by the environment

The Boltzmann machine - 3 • To derive a learning rule we need to define an objective function – difference between two density functions • Kullbach-Leibler (KL) divergence • \ • Minimizing the KL divergence is equivalent to maximizing the average log-likelihood function • By gradient ascent the learning rule can be written • ‘clamped’ – thermal average of the correlation btw two nodes when the states of the visible nodes are fixed • ‘free’ – thermal average when the recurrent system is running freely Note: <x> represents expectation of x

Features and limitations of the Boltzmann machine • Features • In principle, the Boltzmann machine can be trained to represent any arbitrary density functions, given that the network has a sufficient number hidden nodes • The clamped phase could be associated with a sensory driven agent during an awake state • The freely running state could be associated with a sleep phase • Limitations • Learning is too demanding in practice • The averages have to be evaluated at thermal equilibrium • Instability of the gradient method in recurrent systems – small changes of weights can trigger large changes in the dynamics

10.3.3 The restricted Boltzmann machine and contrastive Hebbian learning • From the Boltzmann machine to the restricted Boltzmann machine (RBM) • Training the Boltzmann machine with (10.12) is challenging because the states of the nodes are always changing • (1) The update rule is probabilistic – even with constant activity of the visible nodes, hidden nodes receive variable input • (2) The recurrent connections between hidden nodes can change the states of the hidden nodes rapidly • RBM • Keep (1) and change (2) • eliminating recurrent connections within each layer of BM • Many layers – still giving abilities of general recurrent networks

Contrastive Hebbian learning • Outline of the basic step of learning RBM • A sensory input state to the input layer  probabilistic recognition in the hidden layer • The pattern in hidden layer is used to approximately reconstruct the pattern of visible nodes • Alternating Gibbs sampling • Learning rule • Contrastive divergence (CD) by Geoffrey Hinton et al. • Allow some finite number of alternations between hidden responses and the reconstruction of sensory states • Learning with only a few reconstructions is able to self-organize the system Fig. 10.12 Alternating Gibbs sampling

Deep belief networks • Building a hierarchy of RBM • Using the activities of hidden nodes in one layer as inputs to the next layer • Many applications • Object recognition in images, information retrieval, modeling V1-V2, digit classification, music classification, etc. • Layered RBMs as auto-encoders • Restricted alternating Gibbs sampling, or contrastive divergence, was used as pre-training • Fine-tuned with backpropagation technique • Note: for us, it is more important to understand how the brain works

10.3.4 The Helmholtz machine • A recurrent network with asymmetrical connections • General description of the learning process • Driven by differences between sensory states caused by the environment and the expectations generated by the causal model • Measure of differences • In the Boltzmann machine – log-likelihood of the data • Helmholtz free energy • L(G): the log-likelihood of the generative model • p(c;s,Q): the densities of causes produced by the recognition model • p(c|s;G): the densities of causes produced by the generative model, for a given set of visible states

Learning algorithm for the Helmholtz machine • Wake-sleep algorithm • Wake phase • Data are applied to the input layer • The generative (top-down) weights are trained • Sleep phase • Random sequences are produced by the topmost layer and propagated down with the generative model to the input layer • The recognition (bottom-up) weights are trained • Comments on the wake-sleep algorithm • Resembling the expectation-maximization (EM) algorithm • In the case of stochastic sigmoidal neurons, the training algorithm take the form of Hebbian-type delta rules

Simulation (by Hinton et al.) • Online demonstration: http://www.cs.toronto.edu/~hinton/adi • Recognition-readout-and-stimulation layer • Trained by providing labels as inputs for the purpose of ‘reading the mind’ • Analogous to brain-computer interfaces developed with EEG, fMRI, etc. • Two possible test modes • Supplying a handwritten image and asking for recognition • Asking the system to produce images of a certain letter • The stimulation device allows us to instruct the system to ;’visualize ‘ specific letters • The probabilistic nature of the system much better resembles human abilities to produce a variety of responses

10.3.5 Probabilistic reasoning: causal models and Bayesian networks • Anticipating brain system • We want to implement general learning machines which are able to self-organize from experience • Learning of concepts is the basis of forming a general understanding of the environment and to enable sophisticated anticipation of causes • Statistical models to formalize statistical reasoning in causal models • Bayesian networks • Dynamic Bayesian networks (DBN) • Hidden Markov models

Bayesian Networks • Node (circle): random variable • Arrows: represent conditional probabilities • The whole density function can be factorized due to the conditional independence of nodes • One can answer specific questions, such as how likely it is that it rains given that the weather forecast calls for rain

Dynamic Bayesian networks and Hidden Markov models • The dynamic Bayesian network (DBN) takes temporal aspects into account • The hidden Markov model (HMM) can be seen as a special case of DBN with following properties • Markov chain of hidden nodes • An observable node has a hidden node as its parent • Stationary: The laws (conditional probabilities) do not depend on time

10.3.6 Expectation maximization (EM) • Here we view the problem in a different way • We assume that a general form of a model is given • The problem is to estimate the parameters of the corresponding generative/recognition models in an unsupervised (or self-supervised) way • Expectation maximization (EM) • Technique for parameter estimation • Self-supervised strategy. Repeat the following steps until convergence: • E-step: we make assumptions of training labels (or the prob. That the data were produced by a specific cause) from the current model • M-step: use this hypothesis to update the parameters of the model to maximize the observations

Example of EM

Simulation of EM Recognizing data by inverting the generative model using Bayes’ formula

10.4 Adaptive resonance theory (ART) • Contents of the book so far – important concepts underlying cognitive processes • Learning • Different forms of memory • Self-organization • Attention • Anticipation • ART is an important theory that combines many of these concepts and explains how they are related • Basic ideas: Stephen Grossberg, 1976 • Formal theory: Carpenter and Grossberg 1987 • Extensions: ART1 (binary patterns), ART2/FuzzyART (real-valued patterns, ARTMAP/fuzzy ARTMAP (supervised learning)

10.4.1 The basic ART model • Theory that specifies more directly how bottom-up and top-down processes interact to guide learning • Plasticity-stability dilemma (7.2.3) • Major challenge for advanced learning machines • Learning system to learn new concepts or refine learned concepts quickly • System should be stable enough to not overthrow the gained experience and world model it acquired over its development • Questions when a pattern is observed in the environment • How should this experience change our world model, should it change our acquired concepts, or should it learn the new input as a new concept? • How much should a new input change an existing concept? • When is an input sufficiently different to everything the system has experienced before to grant the creation of a new concept?

The basic ART model • Three subsystems • Attentional subsystem, orienting subsystem, gain-control subsystem • Two layers: F1, F2 • categories • competition among categories • selection of a winning category • features • receives some unspecific gain input • selection of a category in F2 cancels the gain input

10. The cognitive brain