Neural Networks Functions: Learning and Network Configurations

Neural Networks

Functions Input Output 4, 4 8 2, 3 5 1, 9 10 6, 7 13 341, 257 598

Functions Input Output rock rock sing sing alqz alqz dark dark lamb lamb

Functions Input Output 0 0 0 1 0 0 0 1 0 1 1 1

Functions Input Output look looked rake raked sing sang go went want wanted

Functions Input Output John left 1 Wallace fed Gromit 1 Fed Wallace Gromit 0 Who do you like Mary and? 0

Learning Functions • In training, network is shown examples of what the function generates, and has to figure out what the function is. • Think of language/grammar as a very big function (or set of functions). Learning task is similar – learner is presented with examples of what the function generates, and has to figure out what the system is. • Main question in language acquisition: what does the learner need to know in order to successfully figure out what this function is? • Questions about Neural Networks • How can a network represent a function? • How can the network discover what this function is?

AND Network Input Output 0 0 01 0 00 1 01 1 1

OR Network Input Output 0 0 01 0 10 1 11 1 1 • NETWORK CONFIGURED BY TLEARN • # weights after 10000 sweeps • # WEIGHTS • # TO NODE 1 • -1.9083807468 ## bias to 1 • 4.3717832565 ## i1 to 1 • 4.3582129478 ## i2 to 1 • 0.0000000000

2-layer XOR Network • In order for the network to model the XOR function, we need activation of either of the inputs to turn the output node “on” – just as in the OR network. This was achieved easily by making the negative weight on the bias be smaller in magnitude than the positive weight on either of the inputs. However, in the XOR network we also want the effect of turning both inputs on to be to turn the output node “off”. Since turning both nodes on can only increase the total input to the output node, and the output is switched “off” when it receives less input, this effect cannot be achieved. • The XOR function is not linearly separable, and hence it cannot be represented by a two-layer network. This is a classic result in the theory of neural networks.

XOR Network Input Output 0 0 01 0 10 1 11 1 1 -4.4429202080 ## bias to output 9.0652370453 ## 1 to output 8.9045801163 ## 2 to output The mapping from the hidden units to output is an OR network, that never receives a [1 1] input. Input Output 0 0 01 0 10 1 01 1 0 Input Output 0 0 01 0 00 1 11 1 0 -3.0456776619 ## bias to 1 5.5165352821 ## i1 to 1 -5.7562727928 ## i2 to 1 -3.6789164543 ## bias to 2 -6.4448370934 ## i1 to 2 6.4957633018 ## i2 to 2

Learning Rate • The learning rate, which is explained in chapter 1 (pp. 12-13), is a training parameter which basically determines how strongly the network responds to an error signal at each training cycle. The higher the learning rate, the bigger the change the network will make in response to a large error. Sometimes having a high learning rate will be beneficial, at other times it can be quite disastrous for the network. An example of sensitivity to learning rate can be found in the case of the XOR network discussed in chapter 4. • Why should it be a bad thing to make big corrections in response to big errors? The reason for this is that the network is looking for the best general solution to mapping all of the input-output pairs, but the network normally adjusts weights in response to an individual input-output pair. Since the network has no knowledge of how representative any individual input-output pair is of the general trend in the training set, it would be rash for the network to respond too strongly to any individual error signal. By making many small responses to the error signals, the network learns a bit more slowly, but it is protected against being messed up by outliers in the data.

Momentum Just as with learning rate, sometimes the learning algorithm can only find a good solution to a problem if the momentum training parameter is set to a specific value. What does this mean, and why should it make a difference? If momentum is set to a high value, then the weight changes made by the network are very similar from one cycle to the next. If momentum is set to a low value, then the weight changes made by the network can be very different on adjacent cycles. So what?

Momentum In searching for the best available configuration to model the training data, the network has no ‘knowledge’ of what the best solution is, or even whether there is a particularly good solution at all. It therefore needs some efficient and reliable way of searching the range of possible weight-configurations for the best solution. One thing that can be done is for the network to test whether any small changes to its current weight-configuration lead to improved performance. If so, then it can make that change. Then it can ask the same question in its new weight-configuration, and again modify the weights if there is a small change that leads to improvement. This is a fairly effective way for a blind search to proceed, but it has inherent dangers – the network might come across a weight-configuration which is better than all very similar configurations, but is not the best configuration of all. In this situation, the network can figure out that no small changes improve performance, and will therefore not modify its weights. It therefore ‘thinks’ that it has reached an optimal solution, but this is an incorrect conclusion. This problem is known as a local maximum or local minimum.

Momentum Momentum can serve to help the network avoid local maxima, by controlling the ‘scale’ at which the search for a solution proceeds. If momentum is set high, then changes in the weight-configuration are very similar from one cycle to the next. A consequence of this is that early in training, when error levels are typically high, weight changes will be consistently large. Because weight changes are forced to be large, this can help the network avoid getting trapped in a local maximum. A decision about the momentum value to be used for learning amounts to a hypothesis about the nature of the problem being learned, i.e., it is a form of innate knowledge, although not of the kind that we are accustomed to dealing with.

The Past Tense and Beyond

Classic Developmental Story • Initial mastery of regular and irregular past tense forms • Overregularization appears only later (e.g. goed, comed) • ‘U-Shaped’ developmental pattern taken as evidence for learning of a morphological rule V + [+past] --> stem + /d/

Rumelhart & McClelland 1986 Model learns to classify regulars and irregulars,based on sound similarity alone.Shows U-shaped developmental profile.

What is really at stake here? • Abstraction • Operations over variables • Symbol manipulation • Algebraic computation • Learning based on input • How do learners generalize beyond input? y = 2x

What is not at stake here • Feedback, negative evidence, etc.

Who has the most at stake here? • Those who deny the need for rules/variables in language have the most to lose here…if the English past tense is hard, just wait until you get to the rest of natural language! • …but if they are successful, they bring with them a simple and attractive learning theory, and mechanisms that can readily be grounded at the neural level • However, if the advocates of rules/variables succeed here or elsewhere, they face the more difficult challenge at the neuroscientific level

Pinker Ullman

Beyond Sound Similarity Regulars and Associative Memory 1. Are regulars different?2. Do regulars implicate operations over variables? Neuropsychological Dissociations Other Domains of Morphology

(Pinker & Ullman 2002)

Zero-derived denominals are regular Soldiers ringed the city *Soldiers rang the city high-sticked, grandstanded, … *high-stuck, *grandstood, … Productive in adults & children Shows sensitivity to morphological structure[[ stemN] ø V]-ed Provides good evidence that sound similarity is not everything But nothing prevents a model from using richer similarity metric morphological structure (for ringed) semantic similarity (for low-lifes) Beyond Sound Similarity

Regulars are productive, need not be stored Irregulars are not productive, must be stored But are regulars immune to effects of associative memory? frequency over-irregularization Pinker & Ullman: regulars may be stored but they can also be generated on-the-fly ‘race’ can determine which of the two routes wins some tasks more likely to show effects of stored regulars Regulars & Associative Memory

Specific Language Impairment Early claims that regulars show greater impairment than irregulars are not confirmed Pinker & Ullman 2002b ‘The best explanation is that language-impaired people are indeed impaired with rules, […] but can memorize common regular forms.’ Regulars show consistent frequency effects in SLI, not in controls. ‘This suggests that children growing up with a grammatical deficit are better at compensating for it via memorization than are adults who acquired their deficit later in life.’ Child vs. Adult Impairments

Ullman et al. 1997 Alzheimer’s disease patients Poor memory retrieval Poor irregulars Good regulars Parkinson’s disease patients Impaired motor control, good memory Good irregulars Poor regulars Striking correlation involving laterality of effect Marslen-Wilson & Tyler 1997 Normals past tense primes stem 2 Broca’s Patients irregulars prime stems inhibition for regulars 1 patient with bilateral lesion regulars prime stems no priming for irregulars or semantic associates Neuropsychological Dissociations

Lexical Decision Task CAT, TAC, BIR, LGU, DOG press ‘Yes’ if this is a word Priming facilitation in decision times when related word precedes target (relative to unrelated control) e.g., {dog, rug} - cat Marslen-Wilson & Tyler 1997 Regular{jumped, locked} - jump Irregular{found, shows} - find Semantic{swan, hay} - goose Sound{gravy, sherry} - grave Morphological Priming

Bird et al. 2003 complain that arguments for selective difficulty with regulars are confounded with the phonological complexity of the word-endings Pinker & Ullman 2002 weight of evidence still supports dissociation; Bird et al.’s materials contained additional confounds Neuropsychological Dissociations

Jaeger et al. 1996, Language PET study of past tense Task: generate past from stem Design: blocked conditions Result: different areas of activation for regulars and irregulars Is this evidence decisive? task demands very different difference could show up in network doesn’t implicate variables Münte et al. 1997 ERP study of violations Task: sentence reading Design: mixed Result: regulars: ~LAN irregulars: ~N400 Is this evidence decisive? allows possibility of comparison with other violations Brain Imaging Studies

Regular Irregular Nonce (Jaeger et al. 1996)

(Clahsen, 1999)

German Plurals die Straße die Straßendie Frau die Frauen der Apfel die Äpfeldie Mutter die Mütter das Auto die Autosder Park die Parks die Schmidts -s plural low frequency, used for loan-words, denominals, names, etc. Response frequency is not the critical factor in a system that focuses on similarity distribution in the similarity space is crucial similarity space with islands of reliability network can learn islands or network can learn to associate a form with the space between the islands Low-Frequency Defaults

Similarity Space

German Plurals (Hahn & Nakisa 2000)

Arabic Broken Plural • CvCC • nafs nufuus ‘soul’ • qidh qidaah ‘arrow’ • CvvCv(v)C • xaatam xawaatim ‘signet ring’ • jaamuus jawaamiis ‘buffalo’ • Sound Plural • shuway?ir shuway?ir-uun ‘poet (dim.)’ • kaatib kaatib-uun ‘writing (participle)’ • hind hind-aat ‘Hind (fem. name)’ • ramadaan ramadaan-aat ‘Ramadan (month)’

How far can a model generalize to novel forms? • All novel forms that it can represent • Only some of the novel forms that it can represent • Velar fricative [x], e.g., Bach • Could the Lab 2b model generate the past tense for Bach?

Neural Networks Functions: Learning and Network Configurations

Neural Networks Functions: Learning and Network Configurations

Presentation Transcript

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural networks

NEURAL NETWORKS

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks