Fabrizio Costa University of Florence

Reti Neurali Ricorsive e sistemi connessionisti per il NLPIl problema del First-Pass Attachment Disambiguation Fabrizio Costa University of Florence

Overview • What is this talk about: • brief review of connectionist architectures for NLP • introduction of a connectionist recursive system for syntactic parsing • Hybrid model: • Dynamic grammar: strong incrementality hypothesis • Recursive connectionist predictor • Investigations: • linguistic preferences expressed by the system • Advances: • enhance performance of the proposed model by information reduction, domain partitioning

Connectionism and NLP • Even if strongly criticized (Fodor,Pylyshyn. Connectionism and Cognitive architecture. A critical analyses. Cognition,1998) connectionist system are applied to a variety of NLP tasks: • grammaticality prediction • case role assignment • syntactic parsers • multy-features system (lexical,syntactic,semantic)

Advantages • Parallel satisfaction of multiple and diverse constraints • in contrast rule-based sequential processing needs to compute solution that will later on be discarded if not meeting the requirements • Develop and process distributed representations for complex structures

Connectionist Architectures • Brief review of: • Neural Networks • Recurrent Networks • Recursive Auto Associative Memory • Introduction to Recursive Neural Networks

Neural Networks • What is an Artificial Neural Network (ANN)? • A simple model: the perceptron x1 x2 x3 . . xn w1 w2 y output inputs w3  w4  y=i=1..n wi xi - 

What can it learn? • A perceptron can learn linear decision boundaries from examples • Ex: bidimensional input x2 x2 Classe C1 + + Classe C2 x1 x1 + + + + + + w1 x1 + w2 x2 - =0 Esempi

How can it learn? • Iterative algorithm: • at every presentation of the input the procedure reduces the error x2 x2 x2 + + + x1 x1 x1 + + + + + + + + + Iterazione 1 Iterazione 2 Iterazione 3

Getting more complex • Multylayer Feed-forward Network • hidden layer • non linearity hidden layer 2 x1 x2 x3 . . xn Input layer output layer 1 hidden layer 1

Signal Flow • Learning takes place in 2 phases: • Forward phase (compute prediction) • Backward phase (propagate error signal) Forward x1 x2 x3 . . xn Backward

+ + + + + + + Decision Boundaries • Now decision boundaries can be made arbitrary complex increasing the number of neurons in the hidden layer.

ANN and NLP • Since ANN can potentially learn any mapping from input values to any desired output • ...why not apply them to NLP problems? • How to code linguistic information to make it processable by ANN? • What are the properties of linguistic information?

Neural Networks and NLP • Problems: • in the easiest formulation NLP input consists of a sequence of variable length of tokens • the output at any time depends on the input received an arbitrary numer of time-steps in the past • standard neural networks can directly process only sequences of fixed size • Idea: • introduce an explicit state representation • recycle this state as input for the next time-step

Recurrent Networks Output Units Holding Units Hidden Units Holding Units Rumelhart, Hinton, Williams. Parallel distributed processing: Explorations in the microstructure of Cognition. MIT Press 1986 Input Units

Working Principle • A complete forward propagation (from input to output) is 1 time step • the backward connections copy activations to Holding Units for next time step • at the first time step all Holding Units have a fixed conventional value (generally all 0s) • at a general time step: • the Hidden Units receive input from both the Input Unit and the appropriate Holding Units • the Output Unit receive input from both the Hidden Unit and the appropriate Holding Units

Simple Recurrent Networks • J.L.Elman proposed and studied the properties of a Simple Recurrent Networks applied to linguistic problems • Elman. Finding structure in time. Cognitive Science, 1990 • Elman. Distributed Representations, Simple Recurrent Networks and Grammatical Structure. Machine Learning, 1991 • Simple Recurrent Networks are Recurrent networks that • have a single set of holding units for the hidden layer called context layer • backpropagation of error signal is trucated at context layer

Simple Recurrent Networks Output Units Hidden Units Context Layer Input Units

Elman Experiment • Task: predict next token in a sequence representing a sentence • Claim: in order to exhibit good prediction performance the system has to • learn syntactic constraints • learn to preserve relevant information of previous inputs in the Context Layer • learn to discard irrelevant information

Elman Setting • Simplified grammar capable of generating embedded relative clauses • 23 words plus the end of sentence marker ‘.’ • number agreement • verb argument structure • interaction with relative clause • the agent or subject in a RC is omitted • center embedding • viable sentences • the end of sentence marker cannot occur at all positions in the sentence

Network and Training • Local representations for input and output • Hidden layer with 70 units • Training set: 4 sets of 10.000 sentences • sets increasingly “difficult”, with a higher percentage of sentences containing RCs • 5 epochs

Results • On sentences from the training set: • the network has learned to predict verb agreement in number with the noun and all the other constraints present in the grammar • Good generalization on other sentences not present in the training set

Critics • Possibility of rote learning • the vocabulary simbols are 23 but there are only 10 different classes (ie. ‘boy’ ‘dog’ ‘girl’ are equivalent) • the number of disinct sentence pattern is small (there are 400 different sentences of 3 words but only 18 different patterns if we consider the equivalences) • there are very few sentence patterns that are in the test set that are not in the training set

Critics • The simplified error backproagation algoritm doesn’t allow learning of long term dependencies • the backpropagation of the error signal is trucated at the context layer. • This makes computation simplier (no need to store the history of activations), and local in time • The calculated gradient forces the network to transfer informatioin about the input into the hidden layer only if that information is usefull for the current output, but if it is usefull more than one time step in the future, there is no guarantee that it will be somehow preserved

NP PP D N P NP D N The servant of the actress Processing Linguistic data • Linguistic Data as syntactic information is “naturally” represented in a structured way The servant of the actress Flat info Syntactic info

A B C (A(B(DEFGH)C)) DEFGH Processing structured data • It is possible to “flatten” the information and transform it in a vector representation • BUT • flattening makes dependencies “further” apart, more difficult to process • We need to directly process structured information!!

How to process structured information with a connectionist architecture? • Idea: recursively compress sub trees into distributed vector representations • Pollack. Recursive distributed representations. Artificial Intelligence, 1990 • Autoencoder net to learn how to compress the fields of a node to a label and uncompress the label to the fields

Recursive Auto Associative MemoryRAAM left right whole left right

A r B C q D p q r A B r D q C Example A((BC)D) p q A r D B C

Training • It is a moving target training • when learning (B C)  r  (B C) the representation of r changes • this changes the target for another example (r D)  q  (r D) • this can cause instability so the learning rate has to be kept small and hence we have very long trining times

Coding and Decoding • A RAAM can be viewed as 2 nets trained simultaneously: • an encoding network that learns the reduced representation r for BC • a decoding network that takes in input the reduced representation r and decodes it to BC • Both encoding and decoding are recursive operations • The encoding net knows, from the structure of the tree, how many times it must recursively compress representations • The decoding network has to decide if a decoded field is a terminal node or an internal node which should be further decoded

Decoding • Pollack solution: • use binary codes for terminal nodes • internal representations have codes with values in [0,1] but not close to 0 or 1 • Decision rule: • if a decoded field has all its values close to 0 or 1 then it is a terminal node and it is not decoded further

Experiment • Task: coding and decoding of compositional prepositions • Ex: • Pat thought that John loved Mary • (thought Pat (loved John (Mary)))

Experimental Setting • The network has: • 48 input units • 16 hidden units • 48 output units • Training set: 13 complex propositions

Results • After training the network is able to code and decode some (not all) novel propositions • Cluster analysis shows that similar trees are encoded with similar codes • Ex: (loved John Pat) and (loved Pat Mary) are more similar to each other than any of the codes for other trees

Further results • Pollak tried to test if a network could be trained to manipulate representations without decoding them • Ex: train a network to transfor reduced representation for (is liked X Y) into (likes Y X) • Results: • if trained on 12 of the 16 possible propositions • the network generalizes correctly to the other 4 propositions • Chalmers experiments with a network that transforms reduced representations from passive to active structures

Problems • Generalization problems for new structures • Very long training times • Information storage limits: • all configurations have to be stored in fixed-size vector representations • as the hight grows we have an exponential number of possible trees • the numerical precision limits are exceeded by small trees • Unknown (but believed not good) scaling properties

Recursive Neural Network • We want to • directly process complex tree structures • work in a supervised framework • We use Recursive Neural Networks (RNN) specialized for tree structures • Frasconi, Gori, Sperduti. A general Framework for Adaptive Processing of Data Structures. IEE Transactions on Neural Networks, 1998

node state ... Output label encoding 1st child state last child state Root state What is a RNN? • Recursive Neural Networks for trees are composed of several replicas of a Recursive Neetwork and one Output Network Recursive Network Output Network

How does a RNN process tree data structures? • General processing step: • Structure unfolding • Prediction phase: • Recursive state update • Learning phase: • Backpropagation through structure

S VP NP NP PP NP VBZ DT NN IN PRP It has no bearing on Structure unfolding

Structure unfolding S VP NP NP NP PP It has no bearing on

Structure unfolding S VP NP It has no bearing on

Structure unfolding S VP It has no bearing on

Structure unfolding S It has no bearing on

Structure unfolding It has no bearing on

Structure unfolding Output network It has no bearing on

Prediction phaseInformation Flow It has no bearing on

Fabrizio Costa University of Florence

Fabrizio Costa University of Florence

Presentation Transcript

University of Costa Rica

UNIVERSITY OF FLORENCE DEPARTMENT OF PSYCHOLOGY

Florence

Alessandro M. Vannucchi Section of Hematology , University of Florence, Italy

M. Catelani, Valeria L. Scarano University of Florence, Italy

Florence

UNIVERSITY of FLORENCE Department of Plant, Soil and Environmental Science

Leonardo Grilli University of Florence

Mara Bruzzi, D. Menichelli, M. Scaringella INFN Florence, University of Florence

Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

Fabrizio Cirilli

Fabrizio Frati

University of Florence

University of Costa Rica

Florence

Enrico Marchi University of Florence - Italy enrico.marchi@unifi.it

Mara Bruzzi, D. Menichelli, M. Scaringella INFN Florence, University of Florence

Enrico Marchi University of Florence - Italy enrico.marchi@unifi.it