Emergent Functions of Simple Systems

Emergent Functions of Simple Systems J. L. McClellandStanford University

Emergent probabilistic optimization in neural networks Relationship between competence/rational approaches and mechanistic (including connectionist) approaches Some models that bring connectionist and probabilistic approaches into proximal contact Topics

Given A unit representing hypothesis hi, with binary inputs j representing the state of various elements of evidence e, where for all j p(ej) is assumed conditionally independent given hi A bias on the unit equal to log(priori/(1-priori)) Weights to the unit from each input equal to log(p(ej|hi)/(log(p(ej|not hi)) If the output of the unit is computed from the logistic function ai = 1/[1+exp( biasi + Sj aj wij)] Then ai = p(hi|e) Input fromunit j wij Unit i Connectionist Units Calculate Posteriors based on Priors and Evidence

Choosing one of N alternatives • A collection of connectionist units representing mutually exclusive alternative hypotheses can assign the posterior probability to each in a similar way, using the softmax activation function neti = biasi + Sj aj wijai = exp(gneti)/Si’ exp(gneti’) • If g = 1, this constitutes probability matching. • As g increases, more and more of the activation goes to the most likely alternative(s).

Emergent Outcomes from Local Computations (Hopfield, ’82, Hinton & Sejnowski, ’83) • If wij = wji and if units are updated asynchronously, setting ai = 1 if neti >0, ai = 0 otherwiseA network will settle to a state s which is a local maximum in a measure Rumelhart et al (1986) called G • G(s) =Si<jwij aiaj + Siai(biasi + exti) • If each unit sets its activation to 1 with probability logistic(gneti) thenp(s) = exp(gG(s))/Ss’(exp(gG(s’))

A Tweaked Connectionist Model (McClelland & Rumelhart, 1981) that is Also a Graphical Model • Each pool of units in the IA model is equivalent to a Dirichlet variable (c.f. Dean, 2005). • This is enforced by using softmax to set one of the ai in each pool to 1 with probability: pj = egnetj/Sj’egnetj’ • Weight arrays linking the variables are equivalent of the ‘edges’ encoding conditional relationships between states of these different variables. • Biases at word level encode prior p(w). • Weights are bi-directional, but encode generative constraints (p(l|w), p(f|l)). • At equilibrium with g = 1, network’s probability of being in state s equals p(s|I).

We want to learn how to represent the world and constraints among its constituents from experience, using (to the fullest extent possible) a domain-general approach. In this context, the prototypical connectionist learning rules correspond to probability maximization or matching Back Propagation Algorithm: Treats output units (or n-way pools) as conditionally independent given Input Maximizes p(oi|I) for each output unit. But that’s not the true PDP approach to Perception/Cognition/etc… I o

Overcoming the Independence Assumption • The Boltzmann Machine learning algorithm learns to match probabilities of entire output stateso given current Input. • That is, it minimizes ∫p(o|I) log(p(o|I)/q(o|I)) do where: p(o|I) is sampled from the environment (plus phase) q(o|I) is net’s estimate of p(o|I) obtained by settling with the input only (minus phase) • The algorithm is beautifully simple and local: Dwij = e (ai+aj+ - ai-aj-)

Hinton’s deep belief networks are fully distributed learned connectionist models that use a restricted form of the Boltzmann machine (no intra-layer connections) and learn state-of-the-art models very fast. Generic constraints (sparsity, locality) allow such networks to learn efficiently and generalize very well in demanding task contexts. Recent Developments Hinton, Osindero, and Teh (2006). A fast learning algorithm for deep belief networks. Neural Computation, 18, 1527-54.

One take on the relationship between rational analysis and human behavior • Characterizing what’s optimal is always a great thing to do • Optimality is always relative to some framework; what that framework should be isn’t always obvious. • It is possible to construct a way of seeing virtually anything as optimal post hoc (c.f. Voltaire’s Candide). • Optimization is also relative to a set of constraints • Time • Memory • Processing speed • Available mechanisms • Simplifying assumptions … • The question of whether people do behave optimally (according to some framework and constraints) in any particular situation is an empirical question. • The question of why and how people can/do so behave in some situations and not in others is worth understanding more thoroughly.

People are rational, their behavior is optimal. They seek explicit internal models of the structure of the world, within which to reason. Optimal structure type for each domain Optimal structure instance within type Resource limits and implementation constraints are unknown, and should be ignored in determining what is rational/optimal. Inference is still hard, and prior domain-specific constraints are therefore essential. People evolved through an optimization process, and are likely to approximate optimality/rationality within limits. Many aspects of natural/intuitive cognition may depend largely on implicit knowledge. Natural structure (e.g. language) does not exactly correspond to any specific structure type. Culture/School encourages us to think and reason explicitly, and gives us tools for this; we do so under some circumstances. Two perspectives

Box appears… Then one or two objects appear Then a dot may or may not appear RT condition: Respond as fast as possible when dot appears Prediction condition: Predict whether a dot will appear, get feedback after prediction. Each event in box occur several times, interleaved, with reversal of outcome on 10% of trials. Half of participants are instructed in Causal Powers model, half not. All participants learn explicit relations. Only Instructed Prediction subjects show Blocking and Screening. Same experienced structure leads to different outcomes under different performance conditions (Sternberg & McClelland, in prep) AB+,A+ CD+,C- EF+ GH-,G- fillers

People are rational, their behavior is optimal. They seek explicit internal models of the structure of the world, within which to reason. Optimal structure type for each domain Optimal structure instance within type Resource limits and implementation constraints are unknown, and should be ignored in determining what is rational/optimal. Inference is still hard, and prior domain-specific constraints are therefore essential. People evolved through an optimization process, and are likely to approximate optimality/rationality within limits. Many aspects of natural/intuitive cognition may depend largely on implicit knowledge. Natural structure (e.g. language) does not exactly correspond to any specific structure type. Culture/School encourages us to think and reason explicitly, and gives us tools for this; we do so under some circumstances. Many connectionist models do not directly address this kind of thinking; eventually they should be elaborated to do so. Human behavior won’t be understood without considering the constraints it operates under. Determining what is optimal sans constraints is always useful, even so Such an effort should not presuppose individual humans intend to derive an explicit model. Inference is hard, and domain specific priors can help, but domain-general mechanisms subject to generic constraints deserve full exploration. In some cases such models may closely approximate what might be the optimal explicit model. But that model might only be an approximation and the domain-specific constraints might not be necessary. Two perspectives

What is happening here? • Prediction participants have both a causal framework and the time to reason explicitly about which objects have the power to make the dot appear and which do not. • Recall of (e.g.) C- during a CD prediction trial, in conjunction with the causal powers story, licenses the inference to D+ • This inference does not occur without the both the time to think and the appropriate cover story.

The Rumelhart Sematic Attribution Model is Approximated by a Gradually Changing Mixture of Increasingly Specific Naïve Bayes Classifiers (Roger Grosse, 2007) Very young Correlation of network’s attributions with indicated classifier Stillyoung Older

Some models that bring connectionist and probabilistic approaches into proximal contact • Graphical IA model of Context Effects in Perception • In progress; see Movellan & McClelland, 2001. • Leaky Competing Accumulator Model of Decision Dynamics • Usher and McClelland, 2001, and the large family of related decision making models • Models of Unsupervised Category Learning • Competitive Learning, OME, TOME (Lake et al, ICDL08). • Subjective Likelihood Model of Recognition Memory • McClelland and Chappell, 1998; c.f. REM, Steyvers and Shiffrin, 1997), and a forthcoming variant using distributed item representations.

Emergent Functions of Simple Systems