Parsing Unrestricted Text

Parsing Unrestricted Text Joakim Nivre

Two Notions of Parsing • Grammar parsing: • Given a grammar G and an input string x *, derive some or all of the analyses y assigned to x by G. • Text parsing: • Given a text T = (x1, …, xn), derive the correct analysis yi for every sentence xi T.

Grammar Parsing • Properties of grammar parsing: • Abstract problem: Mapping from (G, x) to y. • Parsing implies recognition; analyses defined only if xL(G). • Correctness (consistency and completeness) can be proven without considering any input string x.

Text Parsing • Properties of text parsing: • Not a well-defined abstract problem (the text language is not a formal language). • Parsing does not imply recognition (recognition presupposes a formal language). • Empirical approximation problem. • Correctness can only be established with reference to empirical samples of the text language (statistical inference).

Two Methods for Text Parsing • Grammar-driven text parsing: • Text parsing approximated by grammar parsing. • Data-driven text parsing: • Text parsing approximated by statistical inference. • Not mutually exclusive methods: • Grammars can be combined with statistical inference (e.g. PCFG).

Grammar-Driven Text Parsing • Basic assumption: • The text language L can be approximated by L(G). • Potential problems (evaluation criteria): • Robustness • Disambiguation • Accuracy • Efficiency

Robustness • Basic issue: • What happens if xL(G)? • Two cases: • xL(G), xL (coverage) • xL(G), x  L (robustness) • Techniques: • Constraint relaxation • Partial parsing

Disambiguation • Basic issue: • What happens when G assigns more than one analysis y to a sentence x? • Two cases: • String ambiguity (real) (disambiguation) • Grammar ambiguity (spurious) (leakage) • Techniques: • Grammar specialization • Deterministic parsing • Eliminative parsing • Data-driven parsing (e.g. PCFG)

Accuracy • Basic issue: • How often can the parser deliver a single correct analysis? • Grammar-driven techniques: • Linguistically adequate analyses? • Adequacy undermined by techniques to handle robustness and disambiguation.

Efficiency • Theoretical complexity: • Many linguistically motivated formalisms have intractable parsing problems. • Even polynomially parsable formalims often have high complexity. • Practical efficiency is also affected by: • Grammar constants • Techniques for handling robustness and disambiguation

Data-Driven Text Parsing • Basic assumption: • The text language L can be approximated by statistical inference from text samples. • Components: • A formal model M defining permissible representations for sentences in L • A sample of text Tt = (x1, …, xn) from L, with or without the correct analyses At = (y1, …, yn) • An inductive inference scheme I defining actual analyses for the sentences of any text T = (x1,…,xn) in L, relative to M and Tt (and possibly At)

Robustness • Basic issue: • Is M a grammar or not (cf. PCFG)? • Radical constraint relaxation: • Ensure that every string has at least one analysis. • Example (DOP3): • M permits any parse tree composed from subtrees in Tt, with free insertion of (even unseen) words from x. • Tt is annotated with context-free parse trees. • I defines the probability P(x, y) to be the sum of the probabilities of each derivation of y for x (for any x, y).

Disambiguation • Basic issue: • How rank different analyses yi of x? • Structure of I: • A parameterized stochastic model M, assigning a score S(x, yi) to each permissible analysis yi of x, relative to a set of parameters . • A parsing method, i.e. a method for computing the best yi according to S(x, yi) (given ). • A learning method, i.e. a method for instantiating  based on inductive inference from Tt. • Example: PCFG

Accuracy • Basic issue: • How often can the parser deliver a single correct analysis? • Data-driven techniques: • Empirically adequate ranking of alternatives? • Accuracy undermined by combinatorial explosion due to radical constraint relaxation.

Efficiency • Theoretical complexity: • Many data-driven models have intractable inference problems. • Even polynomially parsable models often have high complexity. • Practical efficiency is also affected by: • Model constants • Techniques for handling robustness and disambiguation

Converging Approaches? • Text parsing: • Complex optimization problem • Two optimization strategies: • Start with good accuracy, improve robustness and disambiguation (while controlling efficiency). • Start with good disambiguation (and robustness), improve accuracy (while controlling efficiency). • Strategies converging on the same solution? • Constraint relaxation for robustness • Data-driven models for disambiguation • Heuristic search techniques for efficiency

Parsing Unrestricted Text

Parsing Unrestricted Text

Presentation Transcript

Unrestricted Net Assets

Unrestricted Submarine Warfare

Unrestricted Submarine Warfare

Parsing

Unrestricted Grammars

Parsing

Parsing

Parsing

Towards Parsing Unrestricted Text into PropBank Predicate-Argument Structures

Parsing

Parsing and Validating Text Input

Parsing

A method for WSD on Unrestricted Text

Parsing

Parsing

Parsing

Text Parsing in Python

Parsing

Unrestricted Net Assets