Data-Driven Dependency Parsing

Data-Driven Dependency Parsing Kenji SagaeCSCI-544

Background: Natural Language Parsing • Syntactic analysis • String to (tree) structure • S • VP • NP PARSER • NP • He likes fish • N • Prn • V • He • likes • fish Input Output

S • VP • NP PARSER • NP • He likes fish • N • Prn • V • He • likes • fish

PARSER • He likes fish • Useful in Natural Language Understanding • NL interfaces, conversational agents • Language technology applications • Machine translation, question answering, information extraction • Scientific study of language • Syntax • Language processing models • S • VP • NP • NP • N • Prn • V • He • likes • fish

PARSER • He likes fish • S • VP Not enough coverage, Too much ambiguity • NP S → NP VP NP → N NP → NP PP VP → V NP VP → V NP PP VP → VP PP … • NP • N • Prn • V • He • likes • fish GRAMMAR

PARSER • He likes fish • S • S • S • S • S • S • S Charniak (1996); Collins (1996); Charniak (1997) • VP • VP • VP • VP • VP • VP • VP • NP • NP • NP • NP • NP • NP • NP S → NP VP NP → N NP → NP PP VP → V NP VP → V NP PP VP → VP PP … • AdvP • AdvP • AdvP • AdvP • NP • N • Det • Det • N • N • Prn • N • N • V • V • V • V • V • V • V • Adv • Adv • Adv • Adv • The • The • Dogs • Dogs • Dogs • Dogs • He • runs • run • run • likes • runs • run • run • fast • fast • fast • fish • fast • N • N • boy • boy GRAMMAR TREEBANK

PARSER • He likes fish • S • S • S • S • S • S • S • VP • VP • VP • VP • VP • VP • VP • NP • NP • NP • NP • NP • NP • NP S → NP VP NP → N NP → NP PP VP → V NP VP → V NP PP VP → VP PP … • AdvP • AdvP • AdvP • AdvP • NP • N • Det • Det • N • Prn • N • N • N • V • V • V • V • V • V • V • Adv • Adv • Adv • Adv • The • The • Dogs • Dogs • Dogs • Dogs • He • runs • likes • run • runs • run • run • run • fast • fast • fast • fast • fish • N • N • boy • boy GRAMMAR TREEBANK

Phrase Structure Tree (Constituent Structure) • S • VP • NP • NP • Det • N • N • Det • N • V • boy • cheese • sandwich • The • ate • the Dependency Structure • boy • cheese • sandwich • The • ate • the

ate • S ate • VP boy sandwich • NP • NP • Det • N • N • Det • N • V • boy • cheese • sandwich • The • ate • the • boy • cheese • sandwich • The • ate • the

LABEL HEAD ate OBJ SUBJ DEPENDENT sandwich boy DET MOD DET The the cheese OBJ DET DET SUBJ MOD • boy • cheese • sandwich • The • ate • the

Background: Linear Classification with the Perceptron • Classification: given an input x predict output y • Example: x is a document, y ∈ {Sports, Politics, Science} • x is represented as a feature vector f(x) • Example: x f(x) y • Just add feature weights given in a vector w Wednesday night, when the Lakers play the Mavericks at American Airlines Center, they get to see first hand … # games: 5 # Lakers: 4 # said: 3 # rebounds: 3 # democrat: 0 # republican: 0 # science: 0 Sports

Multiclass Perceptron • Learn vectors of feature weights wclass for each class c wc= 0 For N iterations For each training example (xi, yi) zi= argmaxzwz• f(xi) if zi≠ yi wzi= wzi– f(xi) wyi= wyi+ f(xi) • Try to classify each example. If a mistake is made, update the weights.

Shift-Reduce Dependency Parsing • Two main data structures • StackS (initially empty) • QueueQ (initialized to contain each word in the input sentence) • Two types of actions • Shift: removes a word from Q, pushes onto S • Reduce: pops two items from S, pushes a new item onto S • New item is a tree that contains the two popped items • This can be applied to either dependencies (Nivre, 2004) or constituents (Sagae & Lavie, 2005)

Shift Before SHIFT After SHIFT SHIFT to … and pushes this new item onto the stack a shift action removes the next token from the input list… Under a proposal… Under a proposal… PMOD PMOD expand IRAs a to expand IRAs a Stack Input string Input string Stack

Reduce expand to to expand VMOD Under a proposal… Under a proposal… PMOD PMOD IRAs a $2000 IRAs a $2000 Before REDUCE After REDUCE REDUCE-RIGHT-VMOD a reduce action pops these two items… … and pushes this new item Stack Input Stack Input

REDUCE-RIGHT-SUBJ REDUCE-LEFT-OBJ SHIFT SHIFT SHIFT Parser Action: SUBJ He likes SUBJ OBJ He likes fish He likes fish STACK QUEUE

Choosing Parser Actions • No grammar, no action table • Learn to associate stack/queue configurations with appropriate parser actions • Classifier • Treated as a black-box • Perceptron, SVM, maximum entropy, memory-based learning, etc • Features: top two items on the stack, next input token, context, lookahead, … • Classes: parser actions

Features: stack(0) = likes stack(0).POS = VBZ stack(1) = He stack(1).POS = PRP stack(2) = 0 stack(2).POS = 0 queue(0) = fish queue(0).POS = NN queue(1) = 0 queue(1).POS = 0 queue(2) = 0 queue(2).POS = 0 likes He fish STACK QUEUE

Features: stack(0) = likes stack(0).POS = VBZ stack(1) = He stack(1).POS = PRP stack(2) = 0 stack(2).POS = 0 queue(0) = fish queue(0).POS = NN queue(1) = 0 queue(1).POS = 0 queue(2) = 0 queue(2).POS = 0 Class: Reduce-Right-SUBJ likes He fish STACK QUEUE

Features: stack(0) = likes stack(0).POS = VBZ stack(1) = He stack(1).POS = PRP stack(2) = 0 stack(2).POS = 0 queue(0) = fish queue(0).POS = NN queue(1) = 0 queue(1).POS = 0 queue(2) = 0 queue(2).POS = 0 Class: Reduce-Right-SUBJ He likes fish STACK QUEUE

Features: stack(0) = likes stack(0).POS = VBZ stack(1) = He stack(1).POS = PRP stack(2) = 0 stack(2).POS = 0 queue(0) = fish queue(0).POS = NN queue(1) = 0 queue(1).POS = 0 queue(2) = 0 queue(2).POS = 0 Class: Reduce-Right-SUBJ SUBJ He likes fish STACK QUEUE

Accurate Parsing with Greedy Search • Experiments: • WSJ Penn Treebank • 1M words of WSJ text • Accuracy: ~90% (unlabeled dependency links) • Other languages (CoNLL 2006, 2007 shared tasks) • Arabic, Basque, Chinese, Czech, Japanese, Greek, Hungarian, Turkish, … • about 75% to 92% • Good accuracy, fast (linear time), easy to implement!

Maximum Spanning Tree Parsing(McDonald et al., 2005) • Dependency tree is a graph (obviously) • Words are vertices, dependency links are edges • Imagine instead a fully connected weighted graph • Each weight is the score for the dependency link • Each scores is independent of other dependencies • Edge-factored model • Find the Maximum Spanning Tree • Score for the tree is the sum of the scores of its individual dependencies • How are edge weights determined?

I ate a sandwich 1 2 3 4 0 (root) 2 (ate) 1 (I) 4 (sandwich) 3 (a)

I ate a sandwich 1 2 3 4 12 0 (root) 2 (ate) -8 -11 2 8 -3 3 1 5 1 (I) 7 3 3 9 3 5 1 4 (sandwich) 0 -2 9 3 (a) -2

I ate a sandwich 1 2 3 4 12 0 (root) 2 (ate) -8 -11 2 8 -3 3 1 5 1 (I) 7 3 3 -1 3 5 1 4 (sandwich) 0 -2 9 3 (a) -2

Structured Classification • x is a sentence, G is a dependency tree, f(G) is a vector of features for the entire tree • Features: h(ate):d(sandwich) hPOS(VBD):dPOS(NN) h(ate):d(I) hPOS(VBD):dPOS(PRP) h(sandwich):d(a) hPOS(NN):dPOS(DT) hPOS(VBD) hPOS(NN) dPOS(NN) dPOS(DT) dPOS(NN) dPOS(PRP) h(ate) h(sandwich) d(sandwich) … (many more) • To assign edge weights, we learn a feature weight vector w

Structured Perceptron • Learn a vector of feature weights w w = 0 For N iterations For each training example (xi,Gi) G’i= argmaxG’ ∈GEN(xi)w• f(G’) if G’i≠ Gi w = w + f(Gi) – f(G’i) • The same as before, but to find the argmaxwe use MST, since each Gis a tree (which also contains the corresponding input x). If G’iis not the right tree, update the feature vector

Question: Are there trees that an MST parser can find, but a Shift-Reduce parser* can’t?(*shift-reduce parser as described in slides 13-19)

Accurate Parsing with Edge-Factored Models • The Maximum Spanning Tree algorithm for directed trees (Chu & Liu, 1965; Edmonds, 1967) runs in quadratic time • Finds the best out of exponentially many trees • Exact inference! • Edge-factored: each dependency link is considered independently from the others • Compare to Shift-Reduce parsing • Greedy inference • Rich set of features includes partially built trees • McDonald and Nivre (2007) show that shift-reduce and MST parsing get similar accuracy, but have different strengths

Parser Ensembles • By using different types of classifiers and algorithms, we get several different parsers • Ensemble idea: combine the output of several parsers to obtain a single more accurate result Parser A I like cheese Parser B I like cheese I like cheese I like cheese Parser C I like cheese

Parser Ensembles with Maximum Spanning Trees(Sagae and Lavie, 2006) • First, build a graph • Create a node for each word in the input sentence (plus one extra “root” node) • Each dependency proposed by any of the parsers is an weighted edge • If multiple parsers propose the same dependency, add weight to the corresponding edge • Then, simply find the MST • Maximizes the votes • Structure guaranteed to be a dependency tree

I ate a sandwich 1 2 3 4 Parser A Parser B Parser C 0 (root) 2 (ate) 1 (I) 4 (sandwich) 3 (a)

MST Parser Ensembles Are Very Accurate • Highest accuracy in CoNLL 2007 shared task on multilingual dependency parsing (a parser bake-off with 22 teams) • Nilson et al. (2007); Sagae and Tsujii (2007) • Improvement depends on selection of parsers for the ensemble • With four parsers with accuracy between 89 and 91, ensemble accuracy = 92.7

Data-Driven Dependency Parsing

Data-Driven Dependency Parsing

Presentation Transcript

Computational Paninian Grammar for Dependency Parsing

Dependency Parsing: Machine Learning Approaches

Dependency Parsing by Belief Propagation

Dependency Parsing

Partial Dependency Parsing for Irish

Unsupervised Dependency Parsing

Dependency Parsing

Dependency Hashing for n-best CCG Parsing

Dependency Parsing

Dependency Parsing as a Classification Problem

Dependency Parsing by Belief Propagation

Question Answering Passage Retrieval Using Dependency Parsing

DEPENDENCY PARSING ， Framenet , SEMANTIC ROLE LABELING, SEMANTIC PARSING

Lexical Dependency Parsing

Exploiting Reducibility in Unsupervised Dependency Parsing

Dependency Parsing as a Classification Problem

Table-driven parsing

Unsupervised Dependency Parsing

Table-driven parsing