Systematicity in sentence processing by recurrent neural networks

Systematicity in sentence processingby recurrent neural networks Stefan Frank Nijmegen Institute for Cognition and Information Radboud University Nijmegen The Netherlands

“Please make it heavy on computers and AI and light on the psycho stuff” (Konstantopoulos, personal communication, December 23, 2005)

Systematicity in language Imagine you meet someone who only knows two sentences of English: Could you please tell me where the toilet is? I can’t find my hotel. So (s)he does not know: Could you please tell me where my hotel is? I can’t find the toilet. This person has no knowledge of English but simply memorized some lines from a phrase book.

Systematicity in language • Human language behavior is (more or less) systematic: if you know some sentences, you know many. • Sentences are not atomic but made up of words. • Likewise, words can be made up of morphemes.(e.g., un + clear = unclear, un + stable = unstable, …) • It seems like language results from applying a set of rules (grammar, morphology) to symbols (words, morphemes).

Systematicity in language • The Classical symbol system hypothesis: the mind contains word-like symbols the are manipulated by structure-sensitive processes (Fodor & Pylyshyn, 1988). E.g., for dealing with language: • boy and girl are nouns (N) • loves and sees are verbs (V) • N V N is a possible sentence structure • This hypothesis explains the systematicity found in language: If you know the N V N structure, you know all N V N sentences (boy sees girl, girl loves boy, boy sees boy, …)

Some issues for the Classical theory • Lack of systematic behavior: Why are people often so unsystematic in practice? The boy plays. OK The boy who the girl likes plays. OK The boy who the girl who the man sees likes plays. OK? The athlete who the coach who the sponsor hired trained won. OK!

Some issues for the Classical theory • Lack of systematic behavior: Why are people often so unsystematic in practice? • Lack of systematicity in language: Why are there exceptions to rules? help + full = helpful help + less = helpless meaning + full = meaningful meaning + less = meaningless beauty + full = beautiful beauty + less = ugly

Some issues for the Classical theory • Lack of systematic behavior: Why are people often so unsystematic in practice? • Lack of systematicity in language: Why are there exceptions to rules? • Development: How do children learn the rules from what they hear? The Classical theory has answers to these questions, but no explanations.

Connectionism • The “state of mind” is represented as a pattern of activity over a large number of simple, quantitative (i.e., non-logical) processing units (“neurons”). • These units are connected by weighted links, forming a (neural) network through which activation moves around. • The connection weights are adjusted to the network’s input and task. • The network develops its own internal representation of the input. • It should generalize to new (test) inputs

Connectionism and the Classical issues • Lack of systematic behavior: Systematicity is built on top of an unsystematic architecture. • Lack of systematicity in language: “Beautiless” is expected statistically but never occurs, so the network learns it doesn’t exist. • Development: The network adapts to its input. But can neural networks explainsystematicity, or even behavesystematically?

Connectionism and systematicity • Fodor & Pylyshyn (1988): Neural networks cannot be systematic. They only learn to associate examples rather than becoming sensitive to structure. • Systematicity: knowing X knowing Y.Generalization:training onX learningY.So, systematicity equals generalization(Hadley, 1994) • Demonstrations of connectionist systematicity • require many training examples but only use few tests • are not robust: oversensitive to training details • only display weak systematicity: words occur in the same ‘syntactic positions’ of training and test sentences

A common SRN task is next- word prediction: The words of a sentences form the input sequence is. After each word, the output should be the next word. Simple Recurrent Networks Elman (1990) Feedforward networks have long-term memory (LTM) but no short-term memory (STM). So how to process sequential input, like the words of a sentence? output layer hidden layer input layer

SRNs and systematicityVan der Velde et al. (2004) An SRN processed a minilanguage with • 18 words (boy, girl, loves, sees, who, “.”, …) • 3 sentence types: • N V N . (boy sees girl.) • N V N who V N . (boy sees girl who loves boy.) • N who N V V N . (boy who girl sees loves boy.) • Nouns and verbs were divided into four groups, each had two nouns and two verbs. • In training sentences, nouns and verbs were from the same group: < 0.44% of sentences used for training. • In test sentences, nouns and verbs came from different groups. Note: weak systematicity only.

SRNs and systematicityVan der Velde et al. (2004) • SRNs “fail” on test sentences, so • They do not generalize to structurally similar sentences • They cannot learn systematic behavior from a small training set • They do not form good models of human language behavior • But • what does it mean to “fail”? Maybe the network was more than completely non-systematic? • was the size of the network appropriate? larger network more STM  better processing ? smaller network  less LTM  better generalization ? • was the language complex enough? With more different words there is more reason to abstract to syntactic types (nouns, verbs)

SRNs and systematicityreplication of Van der Velde et al. (2004) • What if a network does not generalize at all? When given a new sentence, it can only use the last word because combing words requires generalization. • This hypothetical, unsystematic network serves as the baseline for rating SRN performance. • Performance +1: network never makes ungrammatical predictions • Performance 0: network does not generalize at all, but gives the best possible output based on the last word • Performance –1: network only makes ungrammatical predictions. • Positive performance indicates systematicity

Network architecture w = 18 units(one for each word) output layer 10 units hidden layer n = 20 units recurrenthidden layer w = 18 units(one for each word) input layer

SRN Results Positive performance at each word of each test sentence type, so there is some systematicity.

SRN Resultseffect of recurrent layer size N V N N V N who V N N who N V V N Larger networks (n = 40) do better, but very large ones (n = 100) overfit.

SRN performance and memory • SRNs do show systematicity to some extent. • But their performance is limited: • small n limited processing capacity (STM) • large n large LTM  overfitting. • How to combine large STM with small LTM?

Echo State NetworksJaeger (2003) • Keep the connections to and within the recurrent layer fixed at random values. • The recurrent layer becomes a “dynamical reservoir”: a non-specific STM for the input sequence. • Some constraints on the dynamical reservoir: • large enough • sparsely connected (here: 15%) • weight matrix has spectral radius < 1 • LTM capacity: • In SRNs: O(n2) • In ESNs: O(n) • So, can ESNs combine large STM with small LTM?

Network architecture w = 18 units output layer = trained = untrained 10 units hidden layer The STM remains untrained, but the network does develop internal representations n = 20 units recurrenthidden layer input layer w = 18 units

ESN Results Positive performance at each word of each test sentence type, so there is some systematicity, but less than in an SRN of the same size

ESN Resultseffect of recurrent layer size N V N N V N who V N N who N V V N Bigger is better: no overfitting even when n = 1530!

ESN Resultseffect of lexicon size(n = 100) N V N N V N who V N N who N V V N Note: with larger w, a smaller percentage of possible sentences is used for training.

Strong systematicity • 30 words (boy(s), girl(s), like(s), see(s), who, …) • Many sentence types: • N V N . (girl sees boys.) • N V N who V N . (girl sees boys who like boy.) • N who N V V N . (girl who boy sees likes boy.) • N who V N who N V . (girls who like boys see boys who girl likes.) • Unlimited recursion (girls see boy who sees boy who sees man who …) • Number agreement between nouns and verbs

Strong systematicity • In training sentences: females as grammatical subjects, males as grammatical objects (girl sees boy) • In test sentences: vice versa (boy sees girl) • Positive performance on all words of four test sentences types: • N who V N V N . (boy who likes girls sees woman.) • N V N who V N . (boy likes girls who see woman.) • N who N V V N . (boys who man likes see girl.) • N V N who N V . (boys like girl who man sees.)

Conclusions • ESNs can display both weak and strong systematicity • Even with few training sentences and many test sentences • By doing less training, the network can learn more: • Training fewer connections gives better results • Training a smaller part of possible sentences gives better results • Can connectionism explain systematicity? • No, because neural networks do not need to be systematic • Yes, because they need to adapt to systematicity in the training input. • The source of systematicity is not the cognitive system, but the external world.

Systematicity in sentence processing by recurrent neural networks

Systematicity in sentence processing by recurrent neural networks

Presentation Transcript

Recurrent neural networks (I)

Sentence Processing

III. Recurrent Neural Networks

Learning in Recurrent Networks

Recurrent Neural Networks or Associative Memories

Generating Text with Recurrent Neural Networks

Learning in Recurrent Networks

Unsupervised recurrent networks

Simultaneous Recurrent Neural Networks for Static Optimization Problems By: Amol Patwardhan

Recurrent Networks

Speech Sound Production: Recognition Using Recurrent Neural Networks

RECURRENT NEURAL NETWORKS

Learning from relational databases using recurrent neural networks

Genetic Specification of Recurrent Neural Networks: Initial Thoughts

Associative Memory by Recurrent Neural Networks with Delay Elements

Sentence Processing:

Recurrent Networks

Recurrent Neural Networks