1 / 107

Fabrizio Costa University of Florence

Reti Neurali Ricorsive e sistemi connessionisti per il NLP Il problema del First-Pass Attachment Disambiguation. Fabrizio Costa University of Florence. Overview . What is this talk about: brief review of connectionist architectures for NLP

isleen
Download Presentation

Fabrizio Costa University of Florence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reti Neurali Ricorsive e sistemi connessionisti per il NLPIl problema del First-Pass Attachment Disambiguation Fabrizio Costa University of Florence

  2. Overview • What is this talk about: • brief review of connectionist architectures for NLP • introduction of a connectionist recursive system for syntactic parsing • Hybrid model: • Dynamic grammar: strong incrementality hypothesis • Recursive connectionist predictor • Investigations: • linguistic preferences expressed by the system • Advances: • enhance performance of the proposed model by information reduction, domain partitioning

  3. Connectionism and NLP • Even if strongly criticized (Fodor,Pylyshyn. Connectionism and Cognitive architecture. A critical analyses. Cognition,1998) connectionist system are applied to a variety of NLP tasks: • grammaticality prediction • case role assignment • syntactic parsers • multy-features system (lexical,syntactic,semantic)

  4. Advantages • Parallel satisfaction of multiple and diverse constraints • in contrast rule-based sequential processing needs to compute solution that will later on be discarded if not meeting the requirements • Develop and process distributed representations for complex structures

  5. Connectionist Architectures • Brief review of: • Neural Networks • Recurrent Networks • Recursive Auto Associative Memory • Introduction to Recursive Neural Networks

  6. Neural Networks • What is an Artificial Neural Network (ANN)? • A simple model: the perceptron x1 x2 x3 . . xn w1 w2 y output inputs w3  w4  y=i=1..n wi xi - 

  7. What can it learn? • A perceptron can learn linear decision boundaries from examples • Ex: bidimensional input x2 x2 Classe C1 + + Classe C2 x1 x1 + + + + + + w1 x1 + w2 x2 - =0 Esempi

  8. How can it learn? • Iterative algorithm: • at every presentation of the input the procedure reduces the error x2 x2 x2 + + + x1 x1 x1 + + + + + + + + + Iterazione 1 Iterazione 2 Iterazione 3

  9. Getting more complex • Multylayer Feed-forward Network • hidden layer • non linearity hidden layer 2 x1 x2 x3 . . xn Input layer output layer 1 hidden layer 1

  10. Signal Flow • Learning takes place in 2 phases: • Forward phase (compute prediction) • Backward phase (propagate error signal) Forward x1 x2 x3 . . xn Backward

  11. + + + + + + + Decision Boundaries • Now decision boundaries can be made arbitrary complex increasing the number of neurons in the hidden layer.

  12. ANN and NLP • Since ANN can potentially learn any mapping from input values to any desired output • ...why not apply them to NLP problems? • How to code linguistic information to make it processable by ANN? • What are the properties of linguistic information?

  13. Neural Networks and NLP • Problems: • in the easiest formulation NLP input consists of a sequence of variable length of tokens • the output at any time depends on the input received an arbitrary numer of time-steps in the past • standard neural networks can directly process only sequences of fixed size • Idea: • introduce an explicit state representation • recycle this state as input for the next time-step

  14. Recurrent Networks Output Units Holding Units Hidden Units Holding Units Rumelhart, Hinton, Williams. Parallel distributed processing: Explorations in the microstructure of Cognition. MIT Press 1986 Input Units

  15. Working Principle • A complete forward propagation (from input to output) is 1 time step • the backward connections copy activations to Holding Units for next time step • at the first time step all Holding Units have a fixed conventional value (generally all 0s) • at a general time step: • the Hidden Units receive input from both the Input Unit and the appropriate Holding Units • the Output Unit receive input from both the Hidden Unit and the appropriate Holding Units

  16. Simple Recurrent Networks • J.L.Elman proposed and studied the properties of a Simple Recurrent Networks applied to linguistic problems • Elman. Finding structure in time. Cognitive Science, 1990 • Elman. Distributed Representations, Simple Recurrent Networks and Grammatical Structure. Machine Learning, 1991 • Simple Recurrent Networks are Recurrent networks that • have a single set of holding units for the hidden layer called context layer • backpropagation of error signal is trucated at context layer

  17. Simple Recurrent Networks Output Units Hidden Units Context Layer Input Units

  18. Elman Experiment • Task: predict next token in a sequence representing a sentence • Claim: in order to exhibit good prediction performance the system has to • learn syntactic constraints • learn to preserve relevant information of previous inputs in the Context Layer • learn to discard irrelevant information

  19. Elman Setting • Simplified grammar capable of generating embedded relative clauses • 23 words plus the end of sentence marker ‘.’ • number agreement • verb argument structure • interaction with relative clause • the agent or subject in a RC is omitted • center embedding • viable sentences • the end of sentence marker cannot occur at all positions in the sentence

  20. Network and Training • Local representations for input and output • Hidden layer with 70 units • Training set: 4 sets of 10.000 sentences • sets increasingly “difficult”, with a higher percentage of sentences containing RCs • 5 epochs

  21. Results • On sentences from the training set: • the network has learned to predict verb agreement in number with the noun and all the other constraints present in the grammar • Good generalization on other sentences not present in the training set

  22. Critics • Possibility of rote learning • the vocabulary simbols are 23 but there are only 10 different classes (ie. ‘boy’ ‘dog’ ‘girl’ are equivalent) • the number of disinct sentence pattern is small (there are 400 different sentences of 3 words but only 18 different patterns if we consider the equivalences) • there are very few sentence patterns that are in the test set that are not in the training set

  23. Critics • The simplified error backproagation algoritm doesn’t allow learning of long term dependencies • the backpropagation of the error signal is trucated at the context layer. • This makes computation simplier (no need to store the history of activations), and local in time • The calculated gradient forces the network to transfer informatioin about the input into the hidden layer only if that information is usefull for the current output, but if it is usefull more than one time step in the future, there is no guarantee that it will be somehow preserved

  24. NP PP D N P NP D N The servant of the actress Processing Linguistic data • Linguistic Data as syntactic information is “naturally” represented in a structured way The servant of the actress Flat info Syntactic info

  25. A B C (A(B(DEFGH)C)) DEFGH Processing structured data • It is possible to “flatten” the information and transform it in a vector representation • BUT • flattening makes dependencies “further” apart, more difficult to process • We need to directly process structured information!!

  26. How to process structured information with a connectionist architecture? • Idea: recursively compress sub trees into distributed vector representations • Pollack. Recursive distributed representations. Artificial Intelligence, 1990 • Autoencoder net to learn how to compress the fields of a node to a label and uncompress the label to the fields

  27. Recursive Auto Associative MemoryRAAM left right whole left right

  28. A r B C q D p q r A B r D q C Example A((BC)D) p q A r D B C

  29. Training • It is a moving target training • when learning (B C)  r  (B C) the representation of r changes • this changes the target for another example (r D)  q  (r D) • this can cause instability so the learning rate has to be kept small and hence we have very long trining times

  30. Coding and Decoding • A RAAM can be viewed as 2 nets trained simultaneously: • an encoding network that learns the reduced representation r for BC • a decoding network that takes in input the reduced representation r and decodes it to BC • Both encoding and decoding are recursive operations • The encoding net knows, from the structure of the tree, how many times it must recursively compress representations • The decoding network has to decide if a decoded field is a terminal node or an internal node which should be further decoded

  31. Decoding • Pollack solution: • use binary codes for terminal nodes • internal representations have codes with values in [0,1] but not close to 0 or 1 • Decision rule: • if a decoded field has all its values close to 0 or 1 then it is a terminal node and it is not decoded further

  32. Experiment • Task: coding and decoding of compositional prepositions • Ex: • Pat thought that John loved Mary • (thought Pat (loved John (Mary)))

  33. Experimental Setting • The network has: • 48 input units • 16 hidden units • 48 output units • Training set: 13 complex propositions

  34. Results • After training the network is able to code and decode some (not all) novel propositions • Cluster analysis shows that similar trees are encoded with similar codes • Ex: (loved John Pat) and (loved Pat Mary) are more similar to each other than any of the codes for other trees

  35. Further results • Pollak tried to test if a network could be trained to manipulate representations without decoding them • Ex: train a network to transfor reduced representation for (is liked X Y) into (likes Y X) • Results: • if trained on 12 of the 16 possible propositions • the network generalizes correctly to the other 4 propositions • Chalmers experiments with a network that transforms reduced representations from passive to active structures

  36. Problems • Generalization problems for new structures • Very long training times • Information storage limits: • all configurations have to be stored in fixed-size vector representations • as the hight grows we have an exponential number of possible trees • the numerical precision limits are exceeded by small trees • Unknown (but believed not good) scaling properties

  37. Recursive Neural Network • We want to • directly process complex tree structures • work in a supervised framework • We use Recursive Neural Networks (RNN) specialized for tree structures • Frasconi, Gori, Sperduti. A general Framework for Adaptive Processing of Data Structures. IEE Transactions on Neural Networks, 1998

  38. node state ... Output label encoding 1st child state last child state Root state What is a RNN? • Recursive Neural Networks for trees are composed of several replicas of a Recursive Neetwork and one Output Network Recursive Network Output Network

  39. How does a RNN process tree data structures? • General processing step: • Structure unfolding • Prediction phase: • Recursive state update • Learning phase: • Backpropagation through structure

  40. S VP NP NP PP NP VBZ DT NN IN PRP It has no bearing on Structure unfolding

  41. Structure unfolding S VP NP NP NP PP It has no bearing on

  42. Structure unfolding S VP NP It has no bearing on

  43. Structure unfolding S VP It has no bearing on

  44. Structure unfolding S It has no bearing on

  45. Structure unfolding It has no bearing on

  46. Structure unfolding Output network It has no bearing on

  47. Prediction phaseInformation Flow It has no bearing on

  48. Prediction phaseInformation Flow It has no bearing on

  49. Prediction phaseInformation Flow It has no bearing on

  50. Prediction phaseInformation Flow It has no bearing on

More Related