Sentence Processing using a Simple Recurrent Network

Sentence Processing using a Simple Recurrent Network EE 645 Final Project Spring 2003 Dong-Wan Kang 5/14/2003

Contents Introduction - Motivations Previous & Related Worksa) McClelland & Kawamoto(1986)b) Elman (1990, 1993, & 1999)c) Miikkulainen (1996) Algorithms (Williams and Zipser, 1989) - Real Time Recurrent Learning Simulations Data & Encoding schemes Results Discussion & Future work

Motivations • Can the neural network recognize the lexical classes from the sentence and learn the various types of sentences? • From cognitive science perspective: - comparison between human language learning and neural network learning pattern - (e.g.) Learning English past tense (Rumelhart & McClelland,1986), Grammaticality judgment (Allen & Seidenberg,1999), Embedded sentences(Elman 1993, Miikkulainen1996, etc.)

Related Works • McClelland & Kawamoto (1986) - Sentences with Case Role Assignments and semantic features by using backpropagation algorithm - output: 2500 case role units for each sentence - (e.g.) input: the boyhitthe wallwith the ball. output: [ AgentVerbPatientInstrument]+ [other features] - Limitation: poses a hard limit on the number of input size. - Alternative: Instead of detecting the input patterns displaced in space, detect the patterns which were in time (sequential inputs).

Related Works (continued) • Elman (1990,1993, & 1999) - Simple Recurrent Network: Partially Recurrent Network using Context units - Network with a dynamic memory - Context units at time t hold a copy of the activations of the hidden units from the previous time step at time t-1. - Network can recognize sequences. input: Manyyears ago boy and girl … | | | | | | output:years ago boy and girl …

Related Works (continued) • Miikkulainen (1996) - SPEC architecture (Subsymbolic Parser for Embedded Clauses  Recurrent Network) - Parser, Segmenter, and Stack: process the center and tail embedded sentences: 98,100 sentences with 49 different sentence by using case role assignments - (e.g.) Sequential Inputs input: …, the girl, who, liked, the dog, saw, the boy, … output: …, [the girl, saw, the boy] [the girl, liked, the dog] case role: (agent, act, patient) (agent, act, patient)

Algorithms • Recurrent Network - Unlike feedforward networks, they allow connections both ways between a pair of units and even from a unit to itself. - Backpropagation through time (BPTT) – unfolds the temporal operation of the network into a layered feedforward network at every time step. (Rumelhart, et al., 1986) - Real Time Recurrent Learning (RTRL) – two versions (Williams and Zipser, 1989) 1) update weights after processing sequences is completed. 2) on-line: update weights while sequences are being presented. - Simple Recurrent Network (SRN) – partially recurrent network in terms of time and space. It has context units which store the outputs of the hidden units (Elman, 1990). (It can be modified from RTRL algorithm.)

Real Time Recurrent Learning • Williams and Zipser (1989) - This algorithm computes the derivatives of states and outputs with respect to all weights as the network processes the sequence. • Summary of Algorithm: In recurrent network, for any unit connected to any other and the input at node i at time t, the dynamic update rule is:

RTRL (continued) • Error measure: with target outputs defined for some k’s and t’s if is defined at time t; otherwise • Total cost function , t =0,1, …, T, where

RTRL (continued) • The gradient of E separate in time, to do gradient descent, we define: • The derivative of update rule: where initial condition t = 0,

RTRL (continued) • Depending on the way of updating weights, there can be two versions of RTRL. 1) Update the weights after the sequences are completed at (t = T ). 2) Update the weights after each time step: on-line • Elman’s “tlearn” simulator program for “Simple Recurrent Network” (which I’m using for this project) is implemented based on the classical backpropagation algorithm and the modification of this RTRL algorithm.

Simulation • Based on Elman’s data and Simple Recurrent Network(1990,1993, & 1999), simple sentences and embedded sentences are simulated by using “tlearn” neural network program (BP + modified version of RTLR algorithm) available at http://crl.ucsd.edu/innate/index.shtml. • Question: 1. Can the network discover the lexical classes from word order? 2. Can the network recognize the relative pronouns and predict them?

Network Architecture • 31 input nodes • 31 output nodes • 150 hidden nodes • 150 context nodes * black arrow: distributed and learnable * dotted blue arrow: linear function and one-to-one connection with hidden nodes

Lexicon (31 words) Grammar (16 templates) Training Data NOUN-HUM man woman boy girl NOUN-ANIM cat mouse dog lion NOUN-INANIM book rock car NOUN-AGRESS dragon monster NOUN-FRAG glass plate NOUN-FOOD cookie bread sandwich VERB-INTRAN think sleep exist VERB-TRAN see chase like VERB-AGPAT move break VERB-PERCEPT smell see VERB-DESTROY break smash VERB-EAT eat ----------------------------- RELAT-HUM who RELAT-INHUM which NOUN-HUM VERB-EAT NOUN-FOOD NOUN-HUM VERB-PERCEPT NOUN-INANIM NOUN-HUM VERB-DESTROY NOUN-FRAG NOUN-HUM VERB-INTRAN NOUN-HUM VERB-TRAN NOUN-HUM NOUN-HUM VERB-AGPAT NOUN-INANIM NOUN-HUM VERB-AGPAT NOUN-ANIM VERB-EAT NOUN-FOOD NOUN-ANIM VERB-TRAN NOUN-ANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM VERB-AGPAT NOUN-AGRESS VERB-DESTROY NOUN-FRAG NOUN-AGRESS VERB-EAT NOUN-HUM NOUN-AGRESS VERB-EAT NOUN-ANIM NOUN-AGRESS VERB-EAT NOUN-FOOD

Sample Sentences & Mapping • Simple sentences – 2 types - man think (2 words) - girl see dog (3 words) - man break glass (3 words) • Embedded sentences - 3 types (*RP – Relative Pronoun) 1. monster eat man who sleep (RP–sub, VERB-INTRAN) 2. dog see man who eat sandwich (RP-sub, VERB-TRAN) 3. woman eat cookie which cat chase (RP-obj, VERB-TRAN) • Input-Output Mapping: (predict next input – sequential input) INPUT: girl see dog man break glass cat … | | | | | | | OUTPUT: see dog man break glass cat …

Encoding scheme • Random word representation - 31-bit vector for each lexical item, each lexical item is represented by a randomly-assigned different bit. - not semantic feature encoding sleep 0000000000000000000000000000001 dog 0000100000000000000000000000000 woman 0000000000000000000000000010000 …

Training a network • Incremental Input (Elman, 1993) “Starting small” strategy • Phase I: simple sentences (Elman, 1990, used 10,000 sentences) - 1,564 sentences generated(4,636 31-bit vectors) - train all patterns: learning rate =0.1, 23 epochs • Phase II: embedded sentences (Elman, 1993, 7,500 sentences) - 5,976 sentences generated(35,688 31-bit vectors) - loaded with weights from phase I - train (1,564 + 5,976) sentences together: learning rate = 0.1, 4 epochs

Performance • Network performance was measured by Root Mean Squared Error: the number of input patterns RMS error =target output vector actual output vector • Phase I: After 23 epochs, RMS ≈ 0.91 • Phase II: After 4 epochs, RMS ≈ 0.84 • Why can RMS not be lowered? The prediction task is nondeterministic, so the network cannot produce the unique output for the corresponding input. For this simulation, RMS is NOT the best measurement of performance. • Elman’s simulation: RMS = 0.88 (1990), Mean Cosine = 0.852 (1993)

Phase I: RMS ≈ 0.91 after 23 epochs

Phase II: RMS ≈ 0.84 after 4 epochs

<Test results after phase II> Output Target … which (target: which ) ? (target: lion ) ? (target: see ) ? (target: boy )  ? (target: move ) ? (target: sandwich ) which (target: which ) ? (target: cat ) ? (target: see ) … ? (target: book ) which (target: which ) ? (target: man ) see (target: see ) … ? (target: dog ) ? (target: chase ) ? (target: man ) ? (target: who ) ? (target: smash ) ? (target: glass ) ... Arrow() indicates the start of the sentence. In all positions, the word “which” is predicted correctly! But most of words are not predicted including the word “who” is not. Why?  Training Data Since the prediction task is nondeterministic, predicting the exact next word can not be the best performance measurement. We need to look at hidden unit activations of each input, since they reflect what the network has learned about classes of inputs with regard to what they predict. Cluster Analysis, PCA Results and Analysis

Cluster Analysis • The network successfully recognizes VERB, NOUN, and some of their subcategories. • WHO and WHICH has different distance • VERB-INTRAN failed to fit in VERB <Hierarchical cluster diagram of hidden unit activation vectors>

Discussion & Conclusion 1.The network can discover the lexical classes from word order. Noun and Verb are different classes except the “VERB-INTRAN”. Also, subclasses for NOUN are classified correctly, but some subclasses for VERB are mixed. This is related to the input example. 2. The network can recognize and predict the relative pronoun, “which”, but not “who” Why? Because the sentences for “who” is not “RP-obj”, so “who” is just considered as one of normal subject in simple sentences. 3. The organization of input data is important and sensitive to the recurrent network, since it processes the input sequentially and on-line. 4. Generally, recurrent networks by using RTRL recognized sequential input, but it requires more training time and computation resources.

Future Studies • Recurrent Least Squared Support Vector Machines, Suykens, J.A.K. & Vandewalle, J., (2000). - provides new perspectives for time-series prediction and nonlinear modeling - seems more efficient than BPTT, RTRL & SRN

References Allen, J., & Seidenberg, M. (1999). The emergence of grammaticality in connectionist networks. In Brian MacWhinney (Ed.), The emergence of language (pp.115-151). Hillsdale, NJ: Lawrence Erlbaum. Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14, 179-211. Elman, J. (1993). Learning and development in neural networks: the importance of starting small. Cognition, 48, 71-99. Elman, J.L. (1999). The emergence of language: A conspiracy theory. In B. MacWhinney (Ed.) Emergence of Language. Hillsdale, NJ: Lawrence Earlbaum Associates. Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley. MacClelland, J. L., & Kawamoto A. H. (1986). Mechanisms of sentence processing: Assigning roles to constituents of sentences. (273-325). In J. L. McClelland & D. E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. Miikkulaninen R. (1996). Subsymbolic case-role analysis of sentences with embedded clauses, Cognitive Science, 20, 47-73. Rummelhart, D., Hinton, G. E., and Williams, R. (1986). "Learning Internal Representations by Error Propagation," Parallel and Distributed Processing: Exploration in the Microstructure of Cognition, Vol. 1, D. Rumelhart and J. McClelland (Eds.), MIT Press, Cambridge, Massachusetts, 318-362 Rumelhart, D.E., & McClelland, J.L. (1986). On learning the past tense of English verbs. In J.L. McClelland & D.E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270--280.

Sentence Processing using a Simple Recurrent Network

Sentence Processing using a Simple Recurrent Network

Presentation Transcript

Simple Sentence

Sentence Processing

SIMPLE SENTENCE

Simple sentence building

THE SIMPLE SENTENCE

The Simple Sentence

What is a simple sentence?

Simple Sentence

The Simple Sentence Recognizing a Sentence

Simple Sentence

Simple Sentence Rules

Simple Sentence

Temporal Sequence Processing using Recurrent SOM

Simple Sentence Structure

Systematicity in sentence processing by recurrent neural networks

Sentence Processing:

Simple Sentence

Sentence Processing III

Simple Sentence Structure

Temporal Sequence Processing using Recurrent SOM

Sentence simpliFIcation using simple wikipedia

THE SIMPLE SENTENCE