1 / 34

A neural theory of language processing?

The Hierarchical Prediction Network: towards a neural theory of grammar acquisition Gideon Borensztajn Willem Zuidema Rens Bod Institute for Logic, Language and Computation (ILLC) University of Amsterdam Potsdam, May 2009. A neural theory of language processing?.

ethan
Download Presentation

A neural theory of language processing?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Hierarchical Prediction Network: towards a neural theory of grammar acquisitionGideon BorensztajnWillem ZuidemaRens BodInstitute for Logic, Language and Computation (ILLC)University of AmsterdamPotsdam, May 2009

  2. A neural theory of language processing? Questions that linguists never dare to ask (at the risk of being called silly) • How is a grammar represented in the brain? • How are syntactic categories and rewrite rules (or the equivalent thereof) instantiated in the brain? • How is a rewrite rule selected and accessed in the brain? • How are the symbolic syntactic variables acquired, and where does their global scope derive from? • How can a parse be processed in a neural manner, through local interactions alone? • How can the brain represent parse trees of unbounded depth within limited space? Yet, these questions make sense! This work explores the possibility of a neural instantiation of grammar.

  3. Outline of the talk • What is common between language and visual processing? • The Memory Prediction Framework and its generalization to language • A new perspective on binding and substitution • The Hierarchical Prediction Network • Architecture and definitions • Processing, solution for substitution, parsing • Simulation of CFG and PCFG with HPN • Learning, topology induction • Experiments • Discussion

  4. Fragment based hierarchical object recognition • Shimon Ullman, Rumelhart prize cognitive science conference 2008 • Visual processing for linguists • Decomposition of the object into informative sub- components results in a hierarchical object representation based on fragments • Visual object recognition involves construction of parse trees!

  5. Abstraction based on common context “If two fragments are interchangeable within a common context, they are likely to be semantically equivalent.” (Ullman, tics 2006) Familiar? In grammar induction context of a word is used to determine its syntactic category (merging)  invariance and generalization  parse unseen objects

  6. Top-down bottom-up interaction Object classification is an interaction between • Bottom-up grouping process of image regions based on perceptual similarity • Top-down segmentation process based on class membership hypothesis to predict which areas belong to figure • Is top-down vs b-u processing issue in language?  parsing strategies Is there a uniform cortical algorithm that underlies both visual and linguistic processing?

  7. The Memory Prediction Framework • MPF is an neurally motivated algorithm for pattern recognition by the neocortex (Hawkins, 2004) • Cortical categories (columns) represent temporal sequences of patterns • Categories become progressively temporally compressed and invariant as one moves up in the cortical hierarchy • Hierarchical temporal compression allows top-down prediction and anticipation of future events

  8. The bold hypothesis Considerations of analogy • If visual categories are represented by neural assemblies, then why not syntactic categories? • Cortical processing in MPF is analogous to syntagmatic and paradigmatic processing in syntax (chunk and merge). • The nodes in the MPF encode temporal sequences, and when they `unfold’ they predict future events; syntactic constituents encode `chunks’ and predict future words. Hypothesis • There exist cell assemblies in the language area of the cortex that function as neural correlates of (graded) syntactic constituents. These represent temporally compressed, and invariant sequences of words. • Constituents are organized in a topologically structured hierarchical network, which constitutes a grammar.

  9. Challenges for connectionist models of language • Fodor & Pylyshin (1988): “Connectionist models cannot represent systematic relations because they don’t have `variables’, and therefore they are unsuited for a productive symbol manipulation system like language.” • Systematic relations between categories are problematic for fully distributed neural networks, because the distributed representations don’t behave as variables (Marcus, 2001) • Productivity: How do we produce sentences that we have never seen before? How do we process an infinite number of novel configurations from stored elements? • How can a network with limited physical space represent unbounded recursion?

  10. blue? banana yellow colour shape Dynamic versus static binding • Traditional connectionist models can only process patterns that are hardwired in static bindings, and interpolations • In order to process novel, unseen configurations dynamic bindings are required, that can cross the existing connectivity pattern (e.g., blue banana). Productivity in language also requires flexible bindings • Dynamic binding is the counterpart of substitution in symbolic parsing • Dynamic binding solves recursion Schematic diagram of cortical processing according to MPF

  11. The Hierarchical Prediction Network X Y compressor node layer compressor node fires if its slots are activated in specific sequence 1 2 2 3 1 3 sx1 sy3 slots sx2 sx3 sy2 sy1 node representations in 6D substitution space input node layer w1 w2 w4 w5 w8 w3 w6 w7

  12. HPN architecture • Input nodes represent elementary symbols (words). • Compressor nodes temporally `compress’ (integrate) a sequence of two or more nodes in a production • Production: unit consisting of a compressor node with two or more ordered slots • Slot:physicalsubstitutionsite where nodes are coupled (bound) to each other • Slots span a basis forsubstitution space • Nodes develop internal representations with respect to substitution space: their position in substitution space determines in which slots they fit. • Match between node and slot representation determines whether binding will take place

  13. Substitution and substitutability • Substitutability is given as topological distance (inner product) between node and/or slot representations in substitution space. • Substitution: a node’s internal representation plus a temporal indicator are transmitted to the bound slot (its `context’ is passed along) • Path connectors(pointers from slots to bound nodes) store the activation path through the network • Serial binding is operationalized by temporary storage of contextual information in the path connectors • The possibility of substitution greatly enhances the expressive power of neural networks

  14. Syntactic categories in HPN • Regions in substitution space define a continuum of categories • A node’s representation in substitution space defines its graded membership to one or more categories. • Convex regions of input nodes correspond to part of speech tags, and neighboring compressor nodes correspond to higher order syntactic categories, such as NP.

  15. HPN processing • Top-down process: runs a production from a compressor node by serially activating its slots one after the other • Bottom-up process: input token triggers an input node, and one or more compressor nodes during a parallel feed forward sweep • Top-down bottom-up interaction: Top-down and bottom-up processes meet at the slots. A top-down production can only proceed when the involved slots are bottom-up activated in the correct order. • The process strongly resembles left-corner parsing  possible to implement a syntactic parser in HPN

  16. Illustration of HPN processing 1) 2)

  17. Parsing with HPN • A parse in HPN is a connected trajectory through the nodes of the physical network that binds a set of productions together through path connectors. • A parse is successful if it fullfills certain trivial conditions • A derivation is an ordered sequence of time indexed productions (states) and bindings that fully determines the network trajectory upon a successful parse. • A stack in HPN is an ordered sequence of path connectors, which bind the nodes involved in the parse. • After a successful parse one can reconstruct the parse tree by following the path connectors back

  18. Some example parses a b d c e b c d e a (( a b c ) d e ) ( a b ( c d e )) g e a c b f d ( a b (( c d e ) f g )) Any kind of branching structure can be represented in HPN

  19. Z3 Y2 X1 S6 S1 S2 S3 S4 S5 eats2 tomato4 a3 Sue1 An example derivation in HPN Derivation of the sentence Sue eats a tomato HPN productions X1 → S1 S2 Y2 → S3 S4 Z3 → S5 S6 Bindings Y2 S2 Z3 S4 S1 Sue1 Input nodes Sue eats a tomato S3 eats2 S5 a3 S6 tomato4 Input and compressor nodes involved in derivation are time indexed, because they can be used multiple times

  20. HPN grammar for anbn CFG grammar: S → A S B S → A B A → a B → b HPN grammar: X1 → S1 S2 S3 X2 → S4 S5 S1 a S2 X1 S2 X2 S3 b S4 a S5 b X1 X2 S2 S1 S3 S4 S5 a b S1 = 10000 S2 = 01000 S3 = 00100 S4 = 00010 S5 = 00001 a = 10010 b = 00101 X1 = 01000 X2 = 01000

  21. Context free grammars are subsumed in HPN A conversion procedure exists between CFG and HPN grammar (illustration follows) • Create HPN productions for every non-unary rule expansion with orthogonal slots • For every non-unary production in the CFG, compute the representations of the right hand side non-terminals from the slot vector representations • For every unary production in the CFG, compute the representation of the right hand side by copying the left hand side representation. If one defines the match between a node and a slot as the inner product of their representations, then  HPN parses exactly those sentences that are parsed by the corresponding CFG grammar

  22. Example (P)CFG with recursive relative clauses Rewrite rule P S → NP VP 1.0 NP → PropN 0.2 NP → N RC 0.3 VP → VI 0.4 RC → WHO NP VT 0.1 VI → walks 0.5 VT → chases 0.8 N → boy 0.6 PropN → John 0.5 WHO → who 1.0 Rewrite rule P NP → N 0.5 VP → VT NP 0.6 RC → WHO VP 0.9 VI → lives 0.5 VT → feeds 0.2 N → girl 0.4 PropN → Mary 0.5

  23. Conversion CFG2HPN Slots are orthogonal: S1= 10000000000, S2= 01000000000, S3= 00100000000, etc. NP3 S NP1 NP2 S2 S3 S4 S1 PropN N (RC) (N) (NP) (VP) VP1 VP2 RC1 RC2 S5 S6 S7 S8 S9 S10 S11 (VT) (NP) (WHO) (NP) (VT) (WHO) VI (VP) NP1=NP2=NP3 ← S1+S6+S8 =10000101000; VP1=VP2 ← S2+S11 =01000000001; N ← S3+NP; RC1=RC2 ← S4 =00010000000; VT ← S5; VI ← VP; PropN ← NP; who ← WHO ← S7+S10 boy=girl ← N; John=Mary ← PropN; walks=lives ← VI; feeds=chases ← VT

  24. Conversion of PCFG to probabilistic HPN Multiply the representation of the left hand side by the probability of expansion in the PCFG Compressor nodes NP1 (0.3 0 0 0 0 0.3 0 0.3 0 0 0) VP1 (0 0.6 0 0 0 0 0 0 0 0 0.6) RC1 (0 0 0 0.1 0 0 0 0 0 0 0) RC2 (0 0 0 0.9 0 0 0 0 0 0 0) Intermediate representations NP2 (0.2 0 0 0 0 0.2 0 0.2 0 0 0) NP3 (0.5 0 0 0 0 0.5 0 0.5 0 0 0) VP2 (0 0.4 0 0 0 0 0 0 0 0 0.4) N (0.5 0 0 1 0 0.5 0 0.5 0 0 0) PropN = NP2 (0.2 0 0 0 0 0.2 0 0.2 0 0 0) VT (0 0 0 0 1 0 0 0 1 0 0) VI = VP2 (0 0.4 0 0 0 0 0 0 0 0 0.4) WHO=who (0 0 0 0 0 0 1 0 0 1 0) Input nodes John = Mary (0.1 0 0 0 0 0.1 0 0.1 0 0 0) lives = walks (0 0.2 0 0 0 0 0 0 0 0 0.2) boy (0.3 0 0 0.6 0 0.3 0 0.3 0 0 0) girl (0.2 0 0 0.4 0 0.2 0 0.2 0 0 0) chases (0 0 0 0 0.8 0 0 0 0.8 0 0) feeds (0 0 0 0 0.2 0 0 0 0.2 0 0)

  25. PCFG HPN S S 1.0 NP VP 2 1 0.6 0.3 0.6 0.3 NP1 VP1 N RC VT NP 0.6 0.9 0.8 0.2 boy chases PropN 3 4 5 6 WHO VP 0.9 0.6 0.5 0.1 0.8 1.0 0.4 RC2 boy Mary chases Mary who VI 0.5 10 11 lives 0.2 1.0 who lives Probability of binding is inner product between node and slot representation Probability of parse is product of binding probabilities

  26. Learning in HPN • Initialization. • Create input nodes with random representations for every distinct word in the corpus. • Specify how many productions of different sizes are created • For each production, create a compressor node with random representation, and orthogonal slot representations. • Parsing. • For every sentence, let HPN compute the most probable parse • Recover the bindings involved, using the path connectors. • Learning. • For every binding, adjust the representation of winning noden that participated in slot s according to Δn= λ* s • Adjust the representations of the nodes in the neigborhood h of the winning nodes, in proportion to their distance in substitution space. • With every sentence, decrease λ and shrink the neighborhood

  27. From concrete to abstract • Categories start out as individual nodes (representing a single concrete exemplar), and more abstract syntactic categories develop with time as the node representations learn their place in the topology • Two lexical nodes that often participate in the same slot(s) will be gradually merged into a single part of speech category. X1 X1 ? cat the feed the feed dog • HPN may serve as computational model for Usage Based Grammar (Tomassello, 2003)

  28. Experiment with artificial language • 1000 sentences generated with recursive artificial grammar • 10 productions with 2 slots, 5 with 3 slots • All nodes random initial values • λi= 0.3, λf= 0.05 • hi = 10, hf = 0.01 • 800 training sentences with brackets • 200 test sentences w/o brackets • Evaluation: UP=UR= 0,864

  29. Experiment with Eve corpus from Childes • 2000 sentences from second half of Eve corpus • Brackets available from Childes, but not good quality • Binarized • Initialize HPN with 120 productions with 2 slots • No neighborhood function used • Visualization with Kohonen map • Reasonable clustering of part of speech categories

  30. Conclusion: HPN’s proposed solutions for the questions in the introduction Questions that linguists never dare to ask (at the risk of being called silly) • How is a grammar represented in the brain? • How are syntactic categories and rewrite rules (or the equivalent thereof) instantiated in the brain? • How is a rewrite rule selected and accessed in the brain? • How are the symbolic syntactic variables acquired, and where does their global scope derive from? • How can a parse be processed in a neural manner, through local interactions alone? • How can the brain represent parse trees of unbounded depth within limited space?

  31. Questions and discussion Open questions: • Does HPN make any non-standard behavioral predictions, and how can these be tested? • How can some of the theoretical assumptions be tested experimentally (compressor nodes, path connectors)? • Is the proposed solution for substitution truly connectionist, and what is the definition of connectionism anyway? • How does HPN compare to other proposals for structure encoding and recursion (RAAM, SPEC, OT)? • How does HPN compare to Elman’s SRN in its treatment of time and recursion? • Can HPN represent richer languages than context-free?

  32. References • Borensztajn, G., Zuidema, W., & Bod, R. (2009). The hierarchical prediction network: towards a neural theory of grammar acquisition. Proc. of the 31th Annual Meeting of the Cognitive Science Society • Fodor, J. D., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 3-71 • Hawkins, J., & Blakeslee, S. (2004). On intelligence. New York: Henry Holt and Company. • Marcus, G. F.(2001). The algebraic mind: Integrating connectionism and cognitive science. Cambridge, MA: MIT Press • Tomasello, M.(2003). Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press • Ullman, S. (2007). Object recognition and segmentation by a fragmentbased hierarchy. Trends in Cognitive Science, 11(2), 58-64. My e-mail: gideonbor@gmail.com My homepage: staff.science.uva.nl/~gideon

  33. Other connectionist models of language processing A) Fully distributed networks: no categories, no (phrase) structure • Simple Recurrent Network (SRN) (Elman, 1990) • Long short-term memory (Hochreiter, 1997) • Fractal encoding neural networks (Tabor, 2000) stack state is encoded as a location on a fractal space, and implemented as hidden units in a hand-wired neural network. B) Networks that can represent/compute compositional structure • Recursive Auto-Associative Memory (RAAM) (Pollack, 1988) • SPEC (combination of SRN with RAAM) (Miikkulainen, 1996) • Optimality/Harmony Theory (Prince & Smolensky, 1997) Hardcodes structure and variables, is in fact symbolic. C) Symbolic parsers on top of connectionist hardware • Temporal Synchrony Variable Binding (Henderson, 1998) .

  34. 1 4 4 2 3 1 2 2 2 3 3 3 Bindings Y  S2 Z  S4 S1Sue Input nodes Sue eats a tomato S3eats S5a S6tomato HPN productions X → S1 S2 Y → S3 S4 Z → S5 S6 X Y Z S6 S1 S2 S5 S3 S4 1 eats2 tomato4 a3 Sue1

More Related