A Memory-Based Model of Syntactic Analysis: Data Oriented Parsing

A Memory-Based Model of Syntactic Analysis: Data Oriented Parsing Remko Scha, Rens Bod, Khalil Sima’an Institute for Logic, Language and Computation University of Amsterdam

Outline of the lecture • Introduction • Disambiguation • Data Oriented Parsing • DOP1 computational aspects and experiments • Memory Based Learning framework • Conclusions

Introduction • Human language cognition: • Analogy-based processes on a store of past experiences • Modern linguistics • Set of rules • Language processing algorithms • Performance model of human language processing • Competence grammar as broad framework to performance models. • Memory / Analogy - based language processing

The Problem of Ambiguity Resolution • Every input string has unmanageable large number of analyses • Uncertain input – generate guesses and choose one • Syntactic disambiguation might be a side effect of semantic one

The Problem of Ambiguity Resolution • Frequency of occurrence of lexical item and syntactic structures: • People register frequencies • People prefer analyses they already experienced than constructing a new ones • More frequent analyses are preferred to less frequent ones

From Probabilistic Competence-Grammars to Data-Oriented Parsing • Probabilistic information derived from past experience • Characterization of the possible sentence-analyses of the language • Stochastic Grammar • Define : all sentences, all analyses. • Assign : probability for each • Achieve : preference that people display when they choose sentence or analyses.

Stochastic Grammar • These predictions are limited • Platitudes and conventional phrases • Allow redundancy • Use Tree Substitution Grammar

Stochastic Tree Substitution Grammar • Set of elementary trees • Tree rewrite process • Redundant model • Statistically relevant phrases • Memory based processing model

Memory based processing model • Data oriented parsing approach: • Corpus of utterances – past experience • STSG to analyze new input • In order to describe a specific DOP model • A formalism for representing utterance-analyses • An extraction function • Combination operations • A probability model

A Simple Data Oriented Parsing Model: DOP1 • Our corpus: DOP1 - Imaginary corpus of two trees • Possible sub trees: • t consists of more than one node • t is connected • except for the leaf nodes of t, each node in t has the same daughter-nodes as the corresponding node in T • Stochastic Tree Substitution Grammar – set of sub trees • Generation process – composition: • A B – B is substituted on the leftmost non terminal leaf node of A

Example of sub trees

DOP1 - Imaginary corpus of two trees

Derivation and parse #1 She saw the dress with the telescope.

Derivation and parse #2 She saw the dress with the telescope.

Probability Computations: • Probability of substituting a sub tree t on a specific node • Probability of Derivation • Probability of Parse Tree

Computational Aspects of DOP1 • Parsing • Disambiguation • Most Probable Derivation • Most Probable Parse • Optimizations

Parsing • Chart-like parse forest • Derivation forest • Elementary tree t as a context-free rule: root(t)—> yield(t) • Label phrase with it’s syntactic category and its full elementary tree

Elementary trees of an example STSG 0 1 2 3 4 abcd

Derivation forest for the string abcd

Derivations and parse trees for the string abcd

Disambiguation • Derivation forest define all derivation and parses • Most likely parse must be chosen • MPP in DOP1 • MPP vs. MPD

Most Probable Derivation • Viterbi algorithm: • Eliminate low probability sub derivations using bottom-up fashion • Select the most probable sub derivation at each chart entry, eliminate other sub derivation of that root node.

Viterbi algorithm • Two derivations for abc • d1 > d2 : eliminate the right derivation

Algorithm 1 – Computing the probability of most probable derivation • Input : STSG , S , R , P • Elementary trees in R are in CNF • A—>tH : tree t, root A, sequence of labels H. • <A, i, j> - non terminal A in chart entry (i,j) after parsing the input W1,...,Wn . • PPMPD– probability of MPD of input string W1,...,Wn.

Algorithm 1 – Computing the probability of most probable derivation

The Most Probable Parse • Computing MPP in STSG is NP hard • Monte Carlo method • Sample derivations • Observe frequent parse tree • Estimate parse tree probability • Random – first search • The algorithm • Law of Large Numbers

Algorithm 2: Sampling a random derivation • for length := 1 to n do • for start := 0 to n - length do • for each root node X chart-entry (start, start + length) do: 1. select at random a tree from the distribution of elementary trees with root node X 2. eliminate the other elementary trees with root node X from this chart-entry

Results of Algorithm 2 • Random derivation for the whole sentence • First guess for MPP • Compute the size of the sampling set • Probability of error • Upper bound • 0 index of MPP,i index of parse i, N derivation • No unique MPP – ambiguity

Reminder

Conclusions – lower bound for N • Lower bound for N: • Pi is probability of parse i • B - Estimated probability by frequencies in N • Var(B) = Pi*(1-Pi)/N • 0 < Pi^2 <= 1 -> Var(B) <= 1/(4*N) • s = sqrt(Var(B)) -> S <= 1/(2*sqrt(N)) • 1/(4*s^2) <= N • 100 <= N -> s <= 0.05

Algorithm 3: Estimating the parse probabilities • Given a derivation forest of a sentence and a threshold sm for the standard error: • N := the smallest integer larger than 1/(4 sm2) • repeat N times: • sample a random derivation from the derivation forest • store the parse generated by this derivation • for each parse i: • estimate the conditional probability given the sentence by pi := #(i) / N

Complexity of Algorithm 3 • Assumes value of max allowed standard error • Samples number of derivations which is guaranteed to achieve the error • Number of needed samples is quadratic in chosen error

Optimizations • Sima’an : MPD in linear time in STSG size • Bod : MPP on small random corpus of sub trees • Sekine and Grishman : use only sub trees rooted with S or NP • Goodman : different polynomial time

Experimental Properties of DOP1 • Experiments on the ATIS corpus • MPP vs. MPD • Impact of fragment size • Impact of fragment lexicalization • Impact of fragment frequency • Experiments on SRI-ATIS and OVIS • Impact of sub tree depth

Experiments on ATIS corpus • ATIS = Air Travel Information System • 750 annotated sentence analyses • Annotated by Penn Treebank • Purpose: compare accuracy obtained in undiluted DOP1 with the one obtained in restricted STSG

Experiments on ATIS corpus • Divide into training and test sets • 90% = 675 in training set • 10% = 75 in test set • Convert training set into fragments and enrich with probabilities • Test set sentences parsed with sub trees from the training set • MPP was estimated from 100 sampled derivations • Parse accuracy = % of MPP that are identical to test set parses

Results • On 10 random training / test splits of ATIS: • Average parse accuracy = 84.2% • Standard deviation = 2.9 %

Impact of overlapping fragments MPP vs. MPD • Can MPD achieve parse accuracies similar to MPP • Can MPD do better than MPP • Overlapping fragments • Accuracies generated by MPD on test set • The result is 69% • Comparing to accuracy achieved with MPP on test set : 69% vs. 85% • Conclusion: overlapping fragments play important role in predicting the appropriate analysis of a sentence

The impact of fragment size • Large fragments capture more lexical/syntactic dependencies than small ones. • The experiment: • Use DOP1 with restricted maximum depth • Max depth 1 -> DOP1 = SCFG • Compute the accuracies both for MPD and MPP for each max depth

Impact of fragment size

Impact of fragment lexicalization • Lexicalized fragment • More words -> more lexical dependencies • Experiment: • Different version of DOP1 • Restrict max number of words per fragment • Check accuracy for MPP and MPD

Impact of fragment lexicalization

Impact of fragment frequency • Frequent fragments contribute more • large fragments are less frequent than small ones but might contribute more • Experiment: • Restrict frequency to min number of occurrences • Not other restrictions • Check accuracy for MPP

Impact of fragment frequency

Experiments on SRI-ATIS and OVIS • Employ MPD because the corpus is bigger • Tests performed on DOP1 and SDOP • Use set of heuristic criteria for selecting the fragments: • Constraints of the form of sub trees • d - upper bound on depth • n – number of substitution sites • l – number of terminals • L – number of consecutive terminals • Apply constraints on all sub trees besides those with depth 1

Experiments on SRI-ATIS and OVIS • d4 n2 l7 L3 • DOP(i) • Evaluation metrics: • Recognized • Tree Language Coverage – TLC • Exact match • Labeled bracketing recall and precision

Experiments on SRI-ATIS • 13335 annotated syntactically utterances • Annotation scheme originated from Core Language Engine system • Fixed parameters except sub tree bound: • n2 l4 L3 • Training set – 12335 trees • Test set – 1000 trees • Experiment: • Train and test on different depths upper bounds (takes more than 10 days for DOP(4) !!! )

Impact of sub tree depth SRI-ATIS

Experiments on OVIS corpus • 10000 syntactically and semantically annotated trees • Both annotations treated as one • More non terminal symbols • Utterances are answers to questions in dialog -> short utterances (avg. 3.43) • Sima’an results – sentences with at least 2 words, avg. 4.57 • n2 l7 L3

A Memory-Based Model of Syntactic Analysis: Data Oriented Parsing