Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Comp

Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL)

Collaborators Tom Griffiths Charles Kemp Noah Goodman Chris Baker Amy Perfors Vikash Mansinghka Lauren Schmidt Pat Shafto

The probabilistic revolution in AI • Principled and effective solutions for inductive inference from ambiguous data: • Vision • Robotics • Machine learning • Expert systems / reasoning • Natural language processing • Standard view: no necessary connection to how the human brain solves these problems.

Bayesian models of cognition Visual perception [Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman, Kersten, Knill, Maloney, Olshausen, Jacobs, Pouget, ...] Language acquisition and processing [Brent, de Marken, Niyogi, Klein, Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …] Motor learning and motor control [Ghahramani, Jordan, Wolpert, Kording, Kawato, Doya, Todorov, Shadmehr,…] Associative learning [Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …] Memory [Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …] Attention [Mozer, Huber, Torralba, Oliva, Geisler, Movellan, Yu, Itti, Baldi, …] Categorization and concept learning [Anderson, Nosfosky, Rehder, Navarro, Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …] Reasoning [Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …] Causal inference [Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …] Decision making and theory of mind [Lee, Stankiewicz, Rao, Baker, Goodman, Tenenbaum, …]

Everyday inductive leaps How can people learn so much about the world from such limited evidence? • Learning concepts from examples “horse” “horse” “horse”

“tufa” “tufa” “tufa” Learning concepts from examples

Everyday inductive leaps How can people learn so much about the world from such limited evidence? • Kinds of objects and their properties • The meanings of words, phrases, and sentences • Cause-effect relations • The beliefs, goals and plans of other people • Social structures, conventions, and rules

Modeling Goals • Principled quantitative models of human behavior, with broad coverage and a minimum of free parameters and ad hoc assumptions. • Explain how and why human learning and reasoning works, in terms of (approximations to) optimal statistical inference in natural environments. • A framework for studying people’s implicit knowledge about the structure of the world: how it is structured, used, and acquired. • A two-way bridge to state-of-the-art AI and machine learning.

The approach: from statistics to intelligence • How does background knowledge guide learning from sparsely observed data? Bayesian inference: 2. What form does background knowledge take, across different domains and tasks? Probabilities defined over structured representations: graphs, grammars, predicate logic, schemas, theories. 3. How is background knowledge itself acquired? Hierarchical probabilistic models, with inference at multiple levels of abstraction. Flexible nonparametric models in which complexity grows with the data.

Outline • Predicting everyday events • Learning concepts from examples • The big picture

Basics of Bayesian inference • Bayes’ rule: • An example • Data: John is coughing • Some hypotheses: • John has a cold • John has lung cancer • John has a stomach flu • Likelihood P(d|h) favors 1 and 2 over 3 • Prior probability P(h) favors 1 and 3 over 2 • Posterior probability P(h|d) favors 1 over 2 and 3

Bayesian inference in perception and sensorimotor integration (Weiss, Simoncelli & Adelson 2002) (Kording & Wolpert 2004)

Everyday prediction problems(Griffiths & Tenenbaum, 2006) • You read about a movie that has made $60 million to date. How much money will it make in total? • You see that something has been baking in the oven for 34 minutes. How long until it’s ready? • You meet someone who is 78 years old. How long will they live? • Your friend quotes to you from line 17 of his favorite poem. How long is the poem? • You meet a US congressman who has served for 11 years. How long will he serve in total? • You encounter a phenomenon or event with an unknown extent or duration, ttotal, at a random time or value of t <ttotal.What is the total extent or duration ttotal?

Bayesian analysis P(ttotal|t)  P(t|ttotal) P(ttotal)  1/ttotal P(ttotal) Assume random sample (for 0 < t < ttotal else = 0) Form of P(ttotal)? e.g., uninformative (Jeffreys) prior  1/ttotal

Bayesian analysis P(ttotal|t)  1/ttotal 1/ttotal posterior probability Random sampling “Uninformative” prior P(ttotal|t) ttotal t Best guess for ttotal: t* such that P(ttotal > t*|t) = 0.5 Yields Gott’s Rule: Guess t*= 2t

Evaluating Gott’s Rule • You read about a movie that has made $78 million to date. How much money will it make in total? • “$156 million” seems reasonable. • You meet someone who is 35 years old. How long will they live? • “70 years” seems reasonable. • Not so simple: • You meet someone who is 78 years old. How long will they live? • You meet someone who is 6 years old. How long will they live?

Priors P(ttotal) based on empirically measured durations or magnitudes for many real-world events in each class: Median human judgments of the total duration or magnitude ttotal of events in each class, given that they are first observed at a duration or magnitude t, versus Bayesian predictions (median of P(ttotal|t)).

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?

You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign? How long did the typical pharaoh reign in ancient egypt?

Summary: prediction • Predictions about the extent or magnitude of everyday events follow Bayesian principles. • Contrast with Bayesian inference in perception, motor control, memory: no “universal priors” here. • Predictions depend rationally on priors that are appropriately calibrated for different domains. • Form of the prior (e.g., power-law or exponential) • Specific distribution given that form (parameters) • Non-parametric distribution when necessary. • In the absence of concrete experience, priors may be generated by qualitative background knowledge.

Learning concepts from examples “tufa” “tufa” • Word learning “tufa” • Property induction Cows have T9 hormones. Seals have T9 hormones. Squirrels have T9 hormones. All mammals have T9 hormones. Cows have T9 hormones. Sheep have T9 hormones. Goats have T9 hormones. All mammals have T9 hormones.

The computational problem(c.f., semi-supervised learning) ? Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Seal Rhino Elephant ? ? ? ? ? ? ? ? New property Features (85 features from Osherson et al. E.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘quadrapedal’,…)

Similarity-based models Human judgments of argument strength Model predictions Cows have property P. Elephants have property P. Horses have property P. All mammals have property P. Gorillas have property P. Mice have property P. Seals have property P. All mammals have property P.

Beyond similarity-based induction Poodles can bite through wire. German shepherds can bite through wire. • Reasoning based on dimensional thresholds:(Smith et al., 1993) • Reasoning based on causal relations:(Medin et al., 2004; Coley & Shafto, 2003) Dobermans can bite through wire. German shepherds can bite through wire. Salmon carry E. Spirus bacteria. Grizzly bears carry E. Spirus bacteria. Grizzly bears carry E. Spirus bacteria. Salmon carry E. Spirus bacteria.

Horses have T9 hormones Rhinos have T9 hormones Cows have T9 hormones } X Y Hypotheses h Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Seal Rhino Elephant ? ? ? ? ? ? ? ? ... ... Prior P(h)

Horses have T9 hormones Rhinos have T9 hormones Cows have T9 hormones } X Y Hypotheses h Prediction P(Y | X) Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Seal Rhino Elephant ? ? ? ? ? ? ? ? ... ... Prior P(h)

Where does the prior come from? Horse Cow Chimp Gorilla Mouse Squirrel Dolphin Seal Rhino Elephant ... ... Prior P(h) Why not just enumerate all logically possible hypotheses along with their relative prior probabilities?

Chimps have T9 hormones. Gorillas have T9 hormones. Taxonomic similarity Poodles can bite through wire. Dobermans can bite through wire. Jaw strength Salmon carry E. Spirus bacteria. Grizzly bears carry E. Spirus bacteria. Food web relations Different sources for priors

mouse P(form) squirrel chimp gorilla P(structure | form) P(data | structure) Hierarchical Bayesian Framework F: form Tree with species at leaf nodes Background knowledge S: structure F1 F2 F3 F4 Has T9 hormones mouse squirrel chimp gorilla ? ? ? D: data …

The value of structural form knowledge: inductive bias

mouse squirrel chimp gorilla Hierarchical Bayesian Framework F: form Tree with species at leaf nodes S: structure F1 F2 F3 F4 Has T9 hormones mouse squirrel chimp gorilla ? ? ? D: data … Property induction

P(D|S): How the structure constrains the data of experience • Define a stochastic process over structure S that generates hypotheses h. • Intuitively, properties should vary smoothly over structure. Smooth: P(h) high Not smooth: P(h) low

P(D|S): How the structure constrains the data of experience S Gaussian Process (~ random walk, diffusion) [Zhu, Ghahramani & Lafferty 2003] y Threshold h

P(D|S): How the structure constrains the data of experience S Gaussian Process (~ random walk, diffusion) [Zhu, Lafferty & Ghahramani 2003] y Threshold h

Structure S Data D Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features 85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…

[c.f., Lawrence, 2004; Smola & Kondor 2003]

Structure S Data D Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 ? ? ? ? ? ? ? ? Features New property 85 features for 50 animals (Osherson et al.): e.g., for Elephant: ‘gray’, ‘hairless’, ‘toughskin’, ‘big’, ‘bulbous’, ‘longleg’, ‘tail’, ‘chewteeth’, ‘tusks’, ‘smelly’, ‘walks’, ‘slow’, ‘strong’, ‘muscle’, ‘fourlegs’,…

Cows have property P. Elephants have property P. Horses have property P. Tree 2D Gorillas have property P. Mice have property P. Seals have property P. All mammals have property P.

Inductive bias Correct bias Wrong bias No bias Too strong bias Testing different priors

A connectionist alternative(Rogers and McClelland, 2004) Species Features Emergent structure: clustering on hidden unit activation vectors

Reasoning about spatially varying properties “Native American artifacts” task

Property type “has T9 hormones” “can bite through wire” “carry E. Spirus bacteria” Theory Structure taxonomic tree directed chain directed network + diffusion process + drift process + noisy transmission Class A Class B Class C Class D Class E Class F Class G Class D Class D Class A Class A Class F Class E Class C Class C Class B Class G Class E Class B Class F Hypotheses Class G Class A Class B Class C Class D Class E Class F Class G . . . . . . . . .

Reasoning with two property types “Given that X has property P, how likely is it that Y does?” Herring Biological property Tuna Mako shark Sand shark Dolphin Human Disease property Kelp Tree Web Sand shark (Shafto, Kemp, Bonawitz, Coley & Tenenbaum) Mako shark Human Herring Tuna Kelp Dolphin

Summary so far • A framework for modeling human inductive reasoning as rational statistical inference over structured knowledge representations • Qualitatively different priors are appropriate for different domains of property induction. • In each domain, a prior that matches the world’s structure fits people’s judgments well, and better than alternative priors. • A language for representing different theories: graph structure defined over objects + probabilistic model for the distribution of properties over that graph. • Remaining question: How can we learn appropriate theories for different domains?

chimp mouse gorilla squirrel squirrel chimp gorilla mouse Hierarchical Bayesian Framework F: form Chain Tree Space mouse squirrel S: structure gorilla chimp F1 F2 F3 F4 D: data mouse squirrel chimp gorilla

Snake Turtle Crocodile Robin Ostrich Bat Orangutan Discovering structural forms Snake Turtle Bat Crocodile Robin Orangutan Ostrich Ostrich Robin Crocodile Snake Turtle Bat Orangutan

Snake Turtle Crocodile Robin Ostrich Bat Orangutan Discovering structural forms “Great chain of being” Snake Turtle Bat Crocodile Robin Plant Rock Angel Orangutan Ostrich God Linnaeus Ostrich Robin Crocodile Snake Turtle Bat Orangutan

People can discover structural forms Tree structure for biological species • Periodic structure for chemical elements • Scientific discoveries • Children’s cognitive development • Hierarchical structure of category labels • Clique structure of social groups • Cyclical structure of seasons or days of the week • Transitive structure for value “great chain of being” Systema Naturae Kingdom Animalia Phylum Chordata Class Mammalia Order Primates Family Hominidae Genus Homo Species Homo sapiens (1837) (1735) (1579)

Typical structure learning algorithms assume a fixed structural form Flat Clusters Line Circle K-Means Mixture models Competitive learning Guttman scaling Ideal point models Circumplex models Grid Tree Euclidean Space Hierarchical clustering Bayesian phylogenetics Self-Organizing Map Generative topographic mapping MDS PCA Factor Analysis

Bayesian models of human learning and reasoning Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Comp