Computational Cognitive Modelling

Computational Cognitive Modelling COGS 511-Lecture 4 Connectionist and Dynamic Approaches COGS 511

Related Readings Readings: McLeod et al. Chaps 1,5,7; Eliasmith, The Third Contender (In Thagard, Chap. 13) • See also (all books available in METU Library) • Sun. R (2008). The Cambridge Handbook of Computational Psychology – Chs 2,3, and 4. • Rumelhart and McClelland. (1986) Parallel Distributed Processing. • Wermter and Sun (2000) Hybrid Neural Systems • Dawson (2005). Connectionism: A Hands-on Approach, Blackwell. Also see http://www.bcp.psych.ualberta.ca/~mike/Book3/index.html • Lytton. (2002) From Computer to Brain: Foundations of Computational Neuroscience • O’Reilly and Munakata (2000) Computational Explorations in Cognitive Neuroscience • Rolls and Treves (1998) Neural Networks and Brain Function • Ward. (2002) Dynamical Cognitive Science • Sutton’s web site on Dynamicism in Cognitive Science (see Links) • Misc. books are available for studying neural networks technically. COGS 511

Fastly evolving Fast cycle time Storage capacity ? Parallelism ? Fault tolerance ? Adaptiveness ? Slowly evolving Slow cycle time Inherent parallelism (100 step constraint: time for a simple cognitive task/time for firing of one neuron) Fault tolerant Adaptive Computers vs Brains COGS 511

Neurons • Major cell type in the nervous system (other: glial cells) • About 50- 100 billion neurons (1011 ) connectedness (typical fanout 103), layered organization • Different types of neurons • Soma (body), dendrites (small branches), axon, myelin sheath, synaptic gap (10-3 mm) • Synaptic connections exhibit plasticity COGS 511

Neurons (cont). • Resting membrane potential vs. Action potential (fire!) : concentration of ions • Electrical synapses vs. Chemical synapses • Excitatory vs inhibitory • Neurotransmitters: Chemicals released across the synapse, e.g. acetylocholine, dopamine, serotonin-around 30 known • A neuron’s death is final!! • Synaptic proliferation, pruning, graceful degradation COGS 511

COGS 511 (McLeod et al., 1998)

Connectionism • Parallel Distributed Processing (PDP), Artificial Neural Networks (ANNs) • Computational Neuroscience • Dynamicism • No supermodel (Unified Theory of Cognition) but rather a metatheory w. explorational perspective or not? COGS 511

Connectionist Emphasis on Cognition • Parallelism • Gradedness • Interactivity • Competition • Learning COGS 511

Comparing PDP vs Symbolic COGS 511 (Rumelhart and McClleland, 1986)

Elements of Connectionism • Computing Unit: A node (neuron) as a nonlinear computing unit • Network Topology and Connectivity: Each node having directed and weighted connections with some other neurons. Feedforward/recurrent; single layer/multilayered • Learning Policy: Supervised/Unsupervised/Reinforcement • Problem Representation: Local/Distributed/ Temporal COGS 511

Objections as Stated in PDP book PDP models are • too weak. • Minsky and Papert’s objections to perceptrons • Turing equivalence and recursion • behaviouristic. • Reply: Concerned with representation and mental processing • at implementational level of analysis only. • Reply: PDP is at psychological level • “Macrotheories are approximations to underlying microstructure” • The argument of total brain simulation: “A principled interpretation of our understanding of the brain that transfigures it into an understanding of the mind” • not biologically plausible. • Reply: not enough known from neuroscience to seriously constrain the cognitive theories • Quite changed since 1986. • reductionist. • Reply: Interactional, not reductionist COGS 511

Basic Properties of a unit ANN • A set of input signals • A set of real valued weights (excitatory or inhibitory) • An activation level based on • an input function on input values and weights (inner product of weight and input vectors, Σwij.aj) • activation/threshold function: if the output of input function is above the threshold send activation along output links. • A positive weight will increase positive net input to a unit it is connected; a negative weight will reduce the effect of a positive net input to a unit it is connected. COGS 511

Basic ANN architecture COGS 511 (McLeod et al., 1998)

Sample Activation Functions COGS 511 (McLeod et al., 1998)

Learning • If the output units have the correct level of activity in response to the pattern of activity produced by any input in the problem domain, the model has learnt the desired task. The network is said to converge when error becomes acceptably low. • The aim of a learning rule is to find such an optimum set of weights for the network connections such that it can correctly predict previously unseen input data too. (problem: overfitting/overtraining) • Learning by changing network topology: optimizing number of hidden units by genetic algorithms; optimizing connectivity by “optimal brain damage”. • Training vs testing: training done in multiple presentations of data called sweeps – organized into epochs. COGS 511

Delta Rule • Perceptrons: single layered, feed forward network (Rosenblatt, 50s) • An output unit which has too low an activity can be corrected by increasing the weights of connections from units in previous layer which provide positive input to it (decrease the weights if the input is negative) • wij= [ai (desired) – ai (obtained)].aj.ε • Learning Rate (ε) : A constant to determine how large the changes to weights will be on any learning trial COGS 511

Delta Rule ->LMS -> Gradient Descent • Netinputi= Σiwij.aj •  = (tout – aout) • Ep= (tout – aout)2 for an input pattern p • aout = F (Σinw.ain) • w= -εdE/dw • w= -εd(tout – aout)2 )/dw • w= -ε d(tout – F (Σinw.ain))2 /dw • w= 2. ε. .F’ (Σinw.ain) F’: slope of the activation function at the output unit COGS 511

Linear Separability (McLeod et al., 1998) COGS 511

Exclusive OR (XOR) Reimplemented (McLeod et al., 1998) COGS 511

Gradient Descent Error landscape for two weights Holding all but one weight constant Ep: error score for input pattern p (square of the difference between desired output and obtained output) COGS 511 (McLeod et al., 1998)

Basic Idea of Gradient Descent • If you can calculate the slope of the curve at its current position, you can change w in the direction that will reduce the error: if slope is positive, decrease w; if negative, increase w. • Slope is calculated by taking derivatives (rate of change of E wrt w) • With more weights, the surface has more dimensions, but you still try to minimize the error. • Since derivative of the activation function is also needed, a continous (which has a derivative at every point) function (like the logistics function) is used as the activation function. COGS 511

Backpropagation • Aka Generalized Delta Rule applied on Multilayer Perceptrons (at least one layer of hidden units) • Propagate error back to previous layers and update weights. • A hidden node is responsible for some fraction of the error in each of the output node to which it connects, in proportion to strength of the connection between them. COGS 511

(McLeod et al., 1998) COGS 511

Local minima Backpropagation guarantees that a solution exists for any mapping problem; but does not guarantee to find it. A general AI problem for gradient descent search with a number of fixes (McLeod et al., 1998) COGS 511

Biological Plausibility of Backpropagation • Axons- unidirectional transmitters of information-error ? • Number of hidden units critical • Learning rules such as Hebbian learning rule require local information and unsupervised. • Biologically more plausible extensions of backpropagation such as Generalized Recirculation (GeneRec) algorithm and Leabra (GeneRec + Hebbian) COGS 511

Simple Recurrent Networks (SRNs) (McLeod et al., 1998) COGS 511

Recurrent Network Architectures • SRNs: Fixed-weight connections from hidden units to a set of context units, that act as memory of hidden unit activities and feeding them back to hidden units on the next time step. Can discover sequential dependencies in training data. • Change of output over time causes the network to settle into one of several states depending on the input. Those states are called attractors. Points of attraction close to the attractor reach the final state more quickly. COGS 511

An Attractor Space with Two Basins (McLeod et al., 1998) COGS 511

Advantages • Simulate reaction time by measuring time to settle into one of the attractor states • Relatively immune to noisy input • Arbitrary mappings between input and output are allowed • Dynamic in character COGS 511

Arbitrary Input-Output Mappings in Attractor Networks COGS 511 (McLeod et al., 1998)

Variety of ANNs • Hopfield networks • Adaptive Resonance Theory (ART) networks • Kohonen Self Organizing Maps • Radial Basis Function Networks • Boltzmann Machines, Support Vector Machines and more... COGS 511

Hybrid Neural Networks • Best of both worlds ? • Unified Neural Architectures • Rely solely on connectionist representations but symbolic interpretation is possible • Hybrid Transformation Architectures • Transform symbolic representations into neural ones or vice versa e.g. rule extraction architectures • Hybrid Modular Architectures • Coexisting symbolic and neural modules • Coupling between them can be loose or tight COGS 511

Dynamicism • Natural cognitive systems are certain kinds of dynamical systems and are best understood from the perpective of dynamics, i.e. unambigously described interactions of a cognizer with its environment through time. • A novel set of metaphors for thinking about cognition or real explanatory power for embodied cognition? • Brains are dynamical systems but is dynamicist hypothesis a new paradigm? COGS 511

Dynamical Systems Theory • Terminology: state space, trajectory, attractors, topology, birfurcation points etc etc. • Tools: Linear and Nonlinear Time Series Analysis, Chaos Theory, Complexity Theory, Relaxation Oscillators COGS 511

Applications • Cyclical Motor Behaviour Model of Human Neonate • Olfactory Bulb Model: Model for neural processing of smell in rabbits • The A-not-B error in infants – immature concept of object permanence or inability to sustain visually cued reach in a novel direction in presence of strong memory of previous reaches, prediction: should be possible to observe in older children COGS 511

Difficulties • Dimensionality and tractability of models • Do we reject mental represenations, alternatives? Internal states? • Predictive power? • Connectionism vs dynamicism – not always easily separable, eg. Elman’s SRNs predicting word boundaries; vocabulary and style of explanation different only? COGS 511

Bayes’ Rule • A doctor knows that the disease meningitis causes the patient to have a stiff neck, say 50% of the time. The doctor also knows that the prior probability that a patient has meningitis is very low, 1/50 000 and the prior probability that a patient has stiff neck is 1/20. • E.g., let M be meningitis, S be stiff neck: • P(m|s) = P(s|m) P(m) / P(s) = 0.5 × 1/50000 / 1/20 = 0.0002 • Note: posterior probability of meningitis still very small! We expect only one in 5000 cases with stiff neck to have meningitis. (Russell and Norvig, 2003) COGS 511

Bayesian Models of Cognition • Many Recent Models on Learning and Inference • Language Acquisition • Visual Scene Perception • Categorization • Causal Relations • Available data underconstrain the inferences so we make guesses guided by prior probabilities about which structures are most likely. • Modeling at “computational” level • Integration attempts between connectionism and Bayesian models COGS 511

Lecture 5 • HWs Announced – To be individually done... • Do not forget the Forum Activity • Problems and Evaluation in Cognitive Modelling Reading: Gluck, Pew and Young (2005). COGS 511

Computational Cognitive Modelling