The Harmonic Mind

The Harmonic Mind Paul Smolensky Cognitive Science Department Johns Hopkins University with: Géraldine Legendre Donald Mathis Melanie Soderstrom Alan Prince Peter Jusczyk†

Advertisement The Harmonic Mind: From neural computation to optimality-theoretic grammar Paul Smolensky & Géraldine Legendre • Blackwell 2002 (??) • Develop the Integrated Connectionist/Symbolic (ICS) Cognitive Architecture • Apply to the theory of grammar • Present a case study in formalist multidisciplinary cognitive science; show inputs/outputs of ICS

Talk Outline ‘Sketch’ the ICS cognitive architecture, pointing to contributions from/to traditional disciplines • Connectionist processing as optimization • Symbolic representations as activation patterns • Knowledge representation: Constraints • Constraint interaction I: Harmonic Grammar, Parser • Explaining ‘productivity’ in ICS (Fodor et al. ‘88 et seq.) • Constraint interaction II: Optimality Theory (‘OT’) • Nativism I: Learnability theory in OT • Nativism II: Experimental test • Nativism III: UGenome

Processing I: Activation • Computational neuroscience  ICS • Key sources • Hopfield 1982, 1984 • Cohen and Grossberg 1983 • Hinton and Sejnowski 1983, 1986 • Smolensky 1983, 1986 • Geman and Geman 1984 • Golden 1986, 1988

–λ (–0.9) a1 a2 i1 (0.6) i2 (0.5) Processing I: Activation Processing — spreading activation — is optimization: Harmony maximization

–λ (–0.9) a1 a2 i1 (0.6) i2 (0.5) Processing II: Optimization • Cognitive psychology  ICS • Key sources: • Hinton & Anderson 1981 • Rumelhart, McClelland, & the PDP Group 1986 Processing — spreading activation — is optimization: Harmony maximization

a1 and a2must not be simultaneously active (strength: λ) Harmony maximization is satisfaction of parallel, violable constraints –λ (–0.9) a1 a2 a1must be active (strength: 0.6) a2must be active (strength: 0.5) i1 (0.6) i2 (0.5) Optimal compromise: 0.79 –0.21 Processing II: Optimization Processing — spreading activation — is optimization: Harmony maximization

Two Fundamental Questions Harmony maximization is satisfaction of parallel, violable constraints 2. What are the constraints? Knowledge representation Prior question: 1. What are the activation patterns — data structures — mental representations — evaluated by these constraints?

Representation • Symbolic theory  ICS • Complex symbol structures • Generative linguistics  ICS • Particular linguistic representations • PDP connectionism  ICS • Distributed activation patterns • ICS: • realization of (higher-level) complex symbolic structures in distributed patterns of activation over (lower-level) units (‘tensor product representations’ etc.)

σ σ k k æ t æ t σ/rε k/r0 æ/r01 t/r11 [σ k [æ t]] Representation

Constraints • Linguistics (markedness theory) ICS • ICS  Generative linguistics: Optimality Theory • Key sources: • Prince & Smolensky 1993 [ms.; Rutgers TR] • McCarthy & Prince 1993 [ms.] • Texts: Archangeli & Langendoen 1997, Kager 1999, McCarthy 2001 • Electronic archive: http://roa.rutgers.edu

σ k æ t *violation ‘cat’ W a[σk [æ t ]] * Constraints NOCODA: A syllable has no coda * H(a[σk [æ t]) = –sNOCODA < 0

Constraint Interaction I • ICS  Grammatical theory • Harmonic Grammar • Legendre, Miyata, Smolensky 1990 et seq.

σ H k = H æ t = H(k ,σ) > 0 H(σ, t) < 0 NOCODACoda/t ONSETOnset/k = Constraint Interaction I The grammar generates the representation that maximizes H: this best-satisfies the constraints, given their differential strengths Any formal language can be so generated.

Top-down X Y X Y X Y Bottom-up A B B A A B B A A B B A Harmonic Grammar Parsing • Simple, comprehensible network • Simple grammar G • X → A B Y → B A • Language Parsing

W Simple Network Parser • Fully self-connected, symmetric network • Like previously shown network … … Except with 12 units; representations and connections shown below

Explaining Productivity • Approaching full-scale parsing of formal languages by neural-network Harmony maximization • Have other networks that provably compute recursive functions !productive competence • How to explain?

1. Structured representations

+ 2. Structured connections

= Proof of Productivity • Productive behavior follows mathematically from combining • the combinatorial structure of the vectorial representations encoding inputs & outputs and • the combinatorial structure of the weight matrices encoding knowledge

Constraint Interaction II: OT • ICS  Grammatical theory • Optimality Theory • Prince & Smolensky 1993

Constraint Interaction II: OT • Differential strength encoded in strict domination hierarchies: • Every constraint has complete priority over all lower-ranked constraints (combined) • Approximate numerical encoding employs special (exponentially growing) weights • “Grammars can’t count” — question period

Constraint Interaction II: OT • Constraints are universal(Con) • Candidate outputs are universal (Gen) • Human grammars differ only in how these constraints are ranked • ‘factorial typology’ • First true contender for a formal theory of cross-linguistic typology • 1st innovation of OT: constraint ranking • 2nd innovation: ‘Faithfulness’

The Faithfulness / Markedness Dialectic • ‘cat’: /kat/  kæt*NOCODA— why? • FAITHFULNESSrequires pronunciation = lexical form • MARKEDNESS often opposes it • Markedness-Faithfulness dialectic diversity • English: FAITH≫ NOCODA • Polynesian: NOCODA≫ FAITH(~French) • Another markedness constraint M: • Nasal Place Agreement [‘Assimilation’] (NPA): ŋg ≻ŋb, ŋd velar nd ≻ md, ŋd coronal mb ≻nb, ŋb labial

Optimality Theory • Diversity of contributions to theoretical linguistics • Phonology • Syntax • Semantics • Here: New connections between linguistic theory & the cognitive science of language more generally • Learning • Neuro-genetic encoding

Nativism I: Learnability • Learning algorithm • Provably correct and efficient(under strong assumptions) • Sources: • Tesar 1995 et seq. • Tesar & Smolensky 1993, …, 2000 • If you hear A when you expected to hear E, increase the Harmony of A above that of E by minimally demoting each constraint violated by A below a constraint violated by E

☺☞ Constraint Demotion Learning If you hear A when you expected to hear E, increase the Harmony of A above that of E by minimally demoting each constraint violated by A below a constraint violated by E Correctly handles difficult case: multiple violations in E

Nativism I: Learnability • M ≫ F is learnable with /in+possible/→impossible • ‘not’ = in- except when followed by … • “exception that proves the rule, M = NPA” • M ≫ F is not learnable from data if there are no ‘exceptions’ (alternations) of this sort, e.g., if lexicon produces only inputs with mp, never np: then M andF, no M vs. F conflict, no evidence for their ranking • Thus must have M ≫ F in the initial state, ℌ0

Nativism II: Experimental Test • Collaborators • Peter Jusczyk • Theresa Allocco • (Elliott Moreton, Karen Arnold)

Nativism II: Experimental Test • Linking hypothesis: More harmonic phonological stimuli ⇒Longer listening time • More harmonic: • M ≻ *M, when equal on F • F ≻ *F, when equal on M • When must chose one or the other, more harmonic to satisfy M: M ≫ F • M = Nasal Place Assimilation (NPA)

4.5 Months (NPA)

Nativism III: UGenome • Can we combine • Connectionist realization of harmonic grammar • OT’s characterization of UG to examine the biological plausibility of UG as innate knowledge? • Collaborators • Melanie Soderstrom • Donald Mathis

Nativism III: UGenome • The game: take a first shot at a concrete example of a genetic encoding of UG in a Language Acquisition Device — no commitment to its (in)correctness • Introduce an ‘abstract genome’ notion parallel to (and encoding) ‘abstract neural network’ • Is connectionist empiricism clearly more biologically plausible than symbolic nativism? No!

The Problem • No concrete examples of such a LAD exist • Even highly simplified cases pose a hard problem: How can genes— which regulate production of proteins — encode symbolic principles of grammar? • Test preparation: Syllable Theory

Basic syllabification: Function • ƒ: /underlying form/  [surface form] • Plural form of dish: • /dš+s/[.d.š z.] • /CVCC/ [.CV.CV C.]

Basic syllabification: Function • ƒ: /underlying form/  [surface form] • Plural form of dish: • /dš+s/[.d.š z.] • /CVCC/ [.CV.CV C.] • Basic CV Syllable Structure Theory • Prince & Smolensky 1993: Chapter 6 • ‘Basic’ — No more than one segment per syllable position: .(C)V(C).

Basic syllabification: Function • ƒ: /underlying form/  [surface form] • Plural form of dish: • /dš+s/[.d.š z.] • /CVCC/ [.CV.CV C.] • Basic CV Syllable Structure Theory • Correspondence Theory • McCarthy & Prince 1995 (‘M&P’) • /C1V2C3C4/ [.C1V2.C3 V C4]

Syllabification: Constraints (Con) • PARSE: Every element in the input corresponds to an element in the output — “no deletion” [M&P 95: ‘MAX’]

Syllabification: Constraints (Con) • PARSE: Every element in the input corresponds to an element in the output • FILLV/C: Every output V/C segment corresponds to an input V/C segment [every syllable position in the output is filled by an input segment] — “no insertion/epenthesis” [M&P 95: ‘DEP’]

Syllabification: Constraints (Con) • PARSE: Every element in the input corresponds to an element in the output • FILLV/C: Every output V/C segment corresponds to an input V/C segment • ONSET: No V without a preceding C

Syllabification: Constraints (Con) • PARSE: Every element in the input corresponds to an element in the output • FILLV/C: Every output V/C segment corresponds to an input V/C segment • ONSET: No V without a preceding C • NOCODA: No C without a following V

C V Network Architecture • /C1 C2/  [C1 V C2] /C1 C2 / [ C1 V C2 ]

s2 i  2 1 s1 Local: fixed, gene-tically determined Content of constraint 1 Global: variable during learning Strength of constraint 1  Network weight: Network input: ι = WΨ a Connection substructure

C 1 1 1 1 V 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 PARSE • All connection coefficients are +2

C V ONSET • All connection coefficients are 1

Activation: Stochastic, binary Boltzmann Machine/Harmony Theory dynamics (T  0); maximizes Harmony: • Learning: Gradient descent in error: • During the processing of training data in phase P, whenever unit φ (of type Φ) and unit ψ (of type Ψ) are simultaneously active, modify si by ε . Network Dynamics

Crucial Open Question(Truth in Advertising) • Relation between strict domination and neural networks? • Apparently not a problem in the case of the CV Theory

The Harmonic Mind