- 99 Views
- Uploaded on
- Presentation posted in: General

Data-Driven Dependency Parsing

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Data-Driven Dependency Parsing

Kenji SagaeCSCI-544

- Syntactic analysis
- String to (tree) structure

- S

- VP

- NP

PARSER

- NP

- He likes fish

- N

- Prn

- V

- He

- likes

- fish

Input

Output

- S

- VP

- NP

PARSER

- NP

- He likes fish

- N

- Prn

- V

- He

- likes

- fish

PARSER

- He likes fish

- Useful in Natural Language Understanding
- NL interfaces, conversational agents

- Language technology applications
- Machine translation, question answering, information extraction

- Scientific study of language
- Syntax
- Language processing models

- S

- VP

- NP

- NP

- N

- Prn

- V

- He

- likes

- fish

PARSER

- He likes fish

- S

- VP

Not enough coverage,

Too much ambiguity

- NP

S → NP VP

NP → N

NP → NP PP

VP → V NP

VP → V NP PP

VP → VP PP

…

- NP

- N

- Prn

- V

- He

- likes

- fish

GRAMMAR

PARSER

- He likes fish

- S

- S

- S

- S

- S

- S

- S

Charniak (1996);

Collins (1996);

Charniak (1997)

- VP

- VP

- VP

- VP

- VP

- VP

- VP

- NP

- NP

- NP

- NP

- NP

- NP

- NP

S → NP VP

NP → N

NP → NP PP

VP → V NP

VP → V NP PP

VP → VP PP

…

- AdvP

- AdvP

- AdvP

- AdvP

- NP

- N

- Det

- Det

- N

- N

- Prn

- N

- N

- V

- V

- V

- V

- V

- V

- V

- Adv

- Adv

- Adv

- Adv

- The

- The

- Dogs

- Dogs

- Dogs

- Dogs

- He

- runs

- run

- run

- likes

- runs

- run

- run

- fast

- fast

- fast

- fish

- fast

- N

- N

- boy

- boy

GRAMMAR

TREEBANK

PARSER

- He likes fish

- S

- S

- S

- S

- S

- S

- S

- VP

- VP

- VP

- VP

- VP

- VP

- VP

- NP

- NP

- NP

- NP

- NP

- NP

- NP

S → NP VP

NP → N

NP → NP PP

VP → V NP

VP → V NP PP

VP → VP PP

…

- AdvP

- AdvP

- AdvP

- AdvP

- NP

- N

- Det

- Det

- N

- Prn

- N

- N

- N

- V

- V

- V

- V

- V

- V

- V

- Adv

- Adv

- Adv

- Adv

- The

- The

- Dogs

- Dogs

- Dogs

- Dogs

- He

- runs

- likes

- run

- runs

- run

- run

- run

- fast

- fast

- fast

- fast

- fish

- N

- N

- boy

- boy

GRAMMAR

TREEBANK

Phrase Structure Tree

(Constituent Structure)

- S

- VP

- NP

- NP

- Det

- N

- N

- Det

- N

- V

- boy

- cheese

- sandwich

- The

- ate

- the

Dependency Structure

- boy

- cheese

- sandwich

- The

- ate

- the

ate

- S

ate

- VP

boy

sandwich

- NP

- NP

- Det

- N

- N

- Det

- N

- V

- boy

- cheese

- sandwich

- The

- ate

- the

- boy

- cheese

- sandwich

- The

- ate

- the

LABEL

HEAD

ate

OBJ

SUBJ

DEPENDENT

sandwich

boy

DET

MOD

DET

The

the

cheese

OBJ

DET

DET

SUBJ

MOD

- boy

- cheese

- sandwich

- The

- ate

- the

- Classification: given an input x predict output y
- Example: x is a document, y ∈ {Sports, Politics, Science}

- x is represented as a feature vector f(x)
- Example:
x f(x) y

- Example:
- Just add feature weights given in a vector w

Wednesday night, when the Lakers play the Mavericks at American Airlines Center, they get to see first hand …

# games:5

# Lakers:4

# said:3

# rebounds:3

# democrat:0

# republican:0

# science:0

Sports

- Learn vectors of feature weights wclass
for each class c

wc= 0

For N iterations

For each training example (xi, yi)

zi= argmaxzwz• f(xi)

if zi≠ yi

wzi= wzi– f(xi)

wyi= wyi+ f(xi)

- Try to classify each example. If a mistake is made, update the weights.

- Two main data structures
- StackS (initially empty)
- QueueQ (initialized to contain each word in the input sentence)

- Two types of actions
- Shift: removes a word from Q, pushes onto S
- Reduce: pops two items from S, pushes a new item onto S
- New item is a tree that contains the two popped items

- This can be applied to either dependencies (Nivre, 2004) or constituents (Sagae & Lavie, 2005)

Before SHIFT

After SHIFT

SHIFT

to

… and pushes

this new item onto

the stack

a shift action removes the next token

from the input list…

Under a proposal…

Under a proposal…

PMOD

PMOD

expand

IRAs

a

to

expand

IRAs

a

Stack

Input string

Input string

Stack

expand

to

to expand

VMOD

Under a proposal…

Under a proposal…

PMOD

PMOD

IRAs

a

$2000

IRAs

a

$2000

Before REDUCE

After REDUCE

REDUCE-RIGHT-VMOD

a reduce action

pops these

two items…

… and pushes

this new item

Stack

Input

Stack

Input

REDUCE-RIGHT-SUBJ

REDUCE-LEFT-OBJ

SHIFT

SHIFT

SHIFT

Parser Action:

SUBJ

He likes

SUBJ OBJ

He likes fish

He

likes

fish

STACK

QUEUE

- No grammar, no action table
- Learn to associate stack/queue configurations with appropriate parser actions
- Classifier
- Treated as a black-box
- Perceptron, SVM, maximum entropy, memory-based learning, etc
- Features: top two items on the stack, next input token, context, lookahead, …
- Classes: parser actions

Features:

stack(0) = likes stack(0).POS = VBZ

stack(1) = Hestack(1).POS = PRP

stack(2) = 0stack(2).POS = 0

queue(0) = fishqueue(0).POS = NN

queue(1) = 0queue(1).POS = 0

queue(2) = 0queue(2).POS = 0

likes

He

fish

STACK

QUEUE

Features:

stack(0) = likes stack(0).POS = VBZ

stack(1) = Hestack(1).POS = PRP

stack(2) = 0stack(2).POS = 0

queue(0) = fishqueue(0).POS = NN

queue(1) = 0queue(1).POS = 0

queue(2) = 0queue(2).POS = 0

Class: Reduce-Right-SUBJ

likes

He

fish

STACK

QUEUE

Features:

stack(0) = likes stack(0).POS = VBZ

stack(1) = Hestack(1).POS = PRP

stack(2) = 0stack(2).POS = 0

queue(0) = fishqueue(0).POS = NN

queue(1) = 0queue(1).POS = 0

queue(2) = 0queue(2).POS = 0

Class: Reduce-Right-SUBJ

He likes

fish

STACK

QUEUE

Features:

stack(0) = likes stack(0).POS = VBZ

stack(1) = Hestack(1).POS = PRP

stack(2) = 0stack(2).POS = 0

queue(0) = fishqueue(0).POS = NN

queue(1) = 0queue(1).POS = 0

queue(2) = 0queue(2).POS = 0

Class: Reduce-Right-SUBJ

He likes

fish

STACK

QUEUE

Features:

stack(0) = likes stack(0).POS = VBZ

stack(1) = Hestack(1).POS = PRP

stack(2) = 0stack(2).POS = 0

queue(0) = fishqueue(0).POS = NN

queue(1) = 0queue(1).POS = 0

queue(2) = 0queue(2).POS = 0

Class: Reduce-Right-SUBJ

SUBJ

He likes

fish

STACK

QUEUE

- Experiments:
- WSJ Penn Treebank
- 1M words of WSJ text
- Accuracy: ~90% (unlabeled dependency links)

- Other languages (CoNLL 2006, 2007 shared tasks)
- Arabic, Basque, Chinese, Czech, Japanese, Greek, Hungarian, Turkish, …
- about 75% to 92%

- WSJ Penn Treebank
- Good accuracy, fast (linear time), easy to implement!

- Dependency tree is a graph (obviously)
- Words are vertices, dependency links are edges

- Imagine instead a fully connected weighted graph
- Each weight is the score for the dependency link
- Each scores is independent of other dependencies
- Edge-factored model

- Find the Maximum Spanning Tree
- Score for the tree is the sum of the scores of its individual dependencies

- How are edge weights determined?

I ate a sandwich

1 2 3 4

0 (root)

2 (ate)

1 (I)

4 (sandwich)

3 (a)

I ate a sandwich

1 2 3 4

12

0 (root)

2 (ate)

-8

-11

2

8

-3

3

1

5

1 (I)

7

3

3

9

3

5

1

4 (sandwich)

0

-2

9

3 (a)

-2

I ate a sandwich

1 2 3 4

12

0 (root)

2 (ate)

-8

-11

2

8

-3

3

1

5

1 (I)

7

3

3

-1

3

5

1

4 (sandwich)

0

-2

9

3 (a)

-2

- x is a sentence, G is a dependency tree, f(G) is a vector of features for the entire tree
- Features:
h(ate):d(sandwich) hPOS(VBD):dPOS(NN)h(ate):d(I)hPOS(VBD):dPOS(PRP)h(sandwich):d(a)hPOS(NN):dPOS(DT)hPOS(VBD)hPOS(NN)dPOS(NN)dPOS(DT)dPOS(NN)dPOS(PRP)

h(ate)h(sandwich)d(sandwich)

… (many more)

- To assign edge weights, we learn a feature weight vector w

- Learn a vector of feature weights w
w = 0

For N iterations

For each training example (xi,Gi)

G’i= argmaxG’ ∈GEN(xi)w• f(G’)

if G’i≠ Gi

w = w + f(Gi) – f(G’i)

- The same as before, but to find the argmaxwe use MST, since each Gis a tree (which also contains the corresponding input x). If G’iis not the right tree, update the feature vector

Question: Are there trees that an MST parser can find, but a Shift-Reduce parser* can’t?(*shift-reduce parser as described in slides 13-19)

- The Maximum Spanning Tree algorithm for directed trees (Chu & Liu, 1965; Edmonds, 1967) runs in quadratic time
- Finds the best out of exponentially many trees
- Exact inference!

- Edge-factored: each dependency link is considered independently from the others
- Compare to Shift-Reduce parsing
- Greedy inference
- Rich set of features includes partially built trees

- Compare to Shift-Reduce parsing
- McDonald and Nivre (2007) show that shift-reduce and MST parsing get similar accuracy, but have different strengths

- By using different types of classifiers and algorithms, we get several different parsers
- Ensemble idea: combine the output of several parsers to obtain a single more accurate result

Parser A

I like cheese

Parser B

I like cheese

I like cheese

I like cheese

Parser C

I like cheese

- First, build a graph
- Create a node for each word in the input sentence (plus one extra “root” node)
- Each dependency proposed by any of the parsers is an weighted edge
- If multiple parsers propose the same dependency, add weight to the corresponding edge

- Then, simply find the MST
- Maximizes the votes
- Structure guaranteed to be a dependency tree

I ate a sandwich

1 2 3 4

0 (root)

2 (ate)

1 (I)

4 (sandwich)

3 (a)

I ate a sandwich

1 2 3 4

0 (root)

2 (ate)

1 (I)

4 (sandwich)

3 (a)

I ate a sandwich

1 2 3 4

Parser A

Parser B

Parser C

0 (root)

2 (ate)

1 (I)

4 (sandwich)

3 (a)

I ate a sandwich

1 2 3 4

0 (root)

2 (ate)

1 (I)

4 (sandwich)

3 (a)

I ate a sandwich

1 2 3 4

0 (root)

2 (ate)

1 (I)

4 (sandwich)

3 (a)

- Highest accuracy in CoNLL 2007 shared task on multilingual dependency parsing (a parser bake-off with 22 teams)
- Nilson et al. (2007); Sagae and Tsujii (2007)

- Improvement depends on selection of parsers for the ensemble
- With four parsers with accuracy between 89 and 91, ensemble accuracy = 92.7