Motivating questions

A neural network that learns graded syntactic categories and rules Gideon Borensztajnjoint work withWillem ZuidemaRens BodInstitute for Logic, Language and Computation (ILLC)University of AmsterdamCortona, September 2009

Motivating questions • How are syntactic categories and systematic “grammar rules” physically instantiated in a brain-like architecture? • How do syntactic categories acquire their abstract status, (i.e. how do they extend their “scope” during a process of generalization?) • How is a parse executed within a neural architecture? (i.e. what is the neural mechanism behind unification, or binding?) • How can the brain represent hierarchical constituent structure (i.e. parse trees) of unbounded depth? This work explores the possibility of a neural instantiation of grammar, and proposes a neural network solution for the acquisition and representation of a “syntactic category space”.

A proposalfor a neural theory of grammar • The language area of the cortex contains local neuronal assemblies that function as neural correlates of graded syntactic categories. • These assemblies represent temporally compressed and invariant word sequences (constituents). (Memory Prediction Framework (Hawkins, 2004)) • The topological and hierarchical arrangement of the syntactic assemblies in the cortical hierarchy constitutes a grammar. • Assemblies can be dynamically, and serially bound to each other, enabling (phrasal) substitution (and recursion). How to model this within constraints of neural network?

Problems of traditional connectionist models for languageprocessing • Simple Recurrent Networks (SRN) (e.g., Elman, 1991) do not employ a notion of categories. (But remember: to cognite is to categorize). • SRN’s ignore psychological reality of phrasal grouping. • But children’s generalizations suggest they substitute entire phrases. I catch the fly I catch a big turtle. • Such generalizations presuppose a representation involving phrasal syntactic categories, and a neural mechanism for substitution. Our model, the “Hierarchical Prediction Network” (HPN), shows that constituent structure and substitution are NOT at odds with connectionist modeling. • In contrast to symbolic grammars (which assume innate categories), a connectionist grammar must explain where syntactic categories come from. • Weneedtoassumethatcategorymembership is graded and dynamic (prototypical categories).

Syntactic categories are regions in a continuous “substitution space” in HPN X1 behind VP, verb jump X2 NP, noun PP, prep icecream horse X3 learn complex units A node’s position in substitution space defines its graded membership to one or more syntactic categories in a continuous space. Substitutability is given as the topological distance in substitution space.

The Hierarchical Prediction Network X1 compressor node: temporal integration root a compressor node fires after all its slots are activated in specific order slot3 slot1 slot2 slots Probability of binding a node to a slot is function of distance between their representations in “substitution space” lexical input nodes w1 w2 w4 w5 w8 w3 w6 w7

HPN definitions • Input nodes represent elementary symbols (words). • Compressor nodes temporally `compress’ (integrate) a sequence of two or more nodes. • Slotsfunction asphysicalsubstitutionsites where nodes are attached (bound) to each other. • Slot vectors form a basis forsubstitution space. • When learning, nodes develop internal representations with respect to this basis: their position in substitution space determines in which slots they fit. • The distance between node and slot representation determines the probability of a binding, and is a measure for substitutability.

Parsing with HPN A derivation in HPN is a connected trajectory through the nodes of the physical network that binds a set of nodes together through pointers stored in the slots. Derivation of the sentence “Sue eats a tomato”. bindings HPN grammar X Y Z 1 3 2 Node state is characterized by start index j, active slot i and slot index n. This is needed, because a single node can be used multiple times in a derivation. 2 1 4 2 3 3 2 4 1 3 a Sue tomato eats (Sue (eats (a tomato))

Substitution and binding • Whereas in a symbolic (CFG) grammar substitution is trivially allowed between categories of the same label, in a neural network a mechanistic substitution operation must be defined between connected nodes. • A node binds to a slot by transmitting its identity (plus a state index) to the slot, which subsequently stores this as a pointer. (This requires an extra set of weights on the slots for short-term memory). • The stored pointers to bound nodes connect a trajectory through the network, from which a parse tree can be constructed (see Figure). • All nodes that a slot can bind to are substitutable for each other in the HPN grammar. Replacing the pointer in a slot amounts to substituting a constituent in the sentence.

Unsupervised learning with HPN • All root and slot representations are randomly initialized. A training corpus is presented. After selection of the best parse of a sentence: • For every binding, move the representations of winning noden and bound slot s closer to each other according to Δn= λ* s,andΔs= λ* n. • Adjust the representations of the nodes in the neigborhood h of the winning node, in proportion to their distance in substitution space. • As the topology self-organizes, substitutability relations are learned, hence a grammar.

Generalization: from concrete to abstract • Initially nodes are uncorrelated. Abstract and systematic syntactic categories develop as the topology self-organizes, and comes to reflect the corpus distribution. • The slots mediate generalization across nodes. If two nodes often occur in the same contexts (hence bind to the same slots) their representations become more similar via a process of contamination through the slot. X1 X1 & dog cat dog cat the the feed feed • HPN may serve as computational model for Usage Based Grammar (Tomassello, 2003)

Experiments The network learns linguistically plausible topologies from artificial and realistic corpora. • 1000 sentences generated with recursive artificial grammar • 10 productions with 2 slots, 5 with 3 slots • All nodes random initial values CFG Rewrite rule S → NP VP (1.0) NP → PropN (0.2) || N (0.5) NP → N RC (0.3) VP → VI (0.4) || VT NP (0.6) RC → WHO NP VT (0.1) VI → walks (0.5) || lives (0.5) VT → chases (0.8) || feeds (0.2) N → boy (0.6) || girl (0.4) PropN → John (0.5) || Mary(0.5) WHO → who (1.0)

Experiment with Eve corpus from Childes • 2000 sentences from second half of Eve corpus • Brackets available from Childes, but not good quality. Binarized. • Initialize HPN with 120 productions with 2 slots • Reasonable clustering of part of speech categories

Context free grammars are subsumed in HPN A conversion procedure can be defined between (p)CFG and (p)HPN grammar. • HPN finds the same parses as the corresponding CFG grammar with the same probabilities. Not the other way round, since expressive power of HPN is richer than that of CFG (details forthcoming).

Conclusions • HPN solution for binding and substitution greatly enhances the expressive power of neural networks, and allows explicit representation of constituent structure. • Continuous category representations enable incremental learning of syntactic categories, from concrete to abstract. • HPN offers a novel perspective on grammar acquisition as a self-organizing process of topology construction. • HPN offers a contribution to the systematicity debate (Fodor & Pylyshyn, 1988): by virtue of the node substitution operation, slots act as placeholders that bind nodes in their topological vicinity, and behave as variables within a “production” of a compressor node. • Possible extensions to other cognitive domains, and construction grammars.

Thank you! References: • Borensztajn, G., Zuidema, W., & Bod, R. (2009). The hierarchical prediction network: towards a neural theory of grammar acquisition. Proc. of the 31th Annual Meeting of the Cognitive Science Society. • Borensztajn, G., Zuidema, W., & Bod, R. (2009). Children’s grammars grow more abstract with age. Evidence from an automatic procedure for identifying the productive units of language. Topics in Cognitive Science, 175-188. • Elman (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning. • Fodor, J. D., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 3-71. • Hawkins, J., & Blakeslee, S. (2004). On intelligence. New York: Henry Holt and Company. • Tomasello, M.(2003). Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press. My homepage: staff.science.uva.nl/~gideon

Motivating questions