Loading in 5 sec....

Learning with Probabilistic Features for Improved Pipeline ModelsPowerPoint Presentation

Learning with Probabilistic Features for Improved Pipeline Models

- 80 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Learning with Probabilistic Features for Improved Pipeline Models' - andrew-buckner

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Learning with Probabilistic Featuresfor Improved Pipeline Models

Razvan C. Bunescu

Electrical Engineering and Computer Science

Ohio University

Athens, OH

EMNLP, October 2008

Introduction

- NLP systems often depend on the output of other NLP systems.

POS Tagging

Syntactic Parsing

Question Answering

Named Entity Recognition

Semantic Role Labeling

Traditional Pipeline Model: M1

- The best annotationfrom one stage is used in subsequent stages.

x

POS Tagging

Syntactic Parsing

- Problem: Errors propagate between pipeline stages!

Probabilistic Pipeline Model: M2

- All possible annotationsfrom one stage are used in subsequent stages.

x

POS Tagging

Syntactic Parsing

probabilistic features

- Problem: Z(x) has exponential cardinality!

Probabilistic Pipeline Model: M2

- Feature-wise formulation:

- When original i‘s are count features, it can be shown that:

Probabilistic Pipeline Model

- Feature-wise formulation:

- When original i‘s are count features, it can be shown that:

An instance of feature i , i.e. the actual evidence used from example (x,y,z).

Probabilistic Pipeline Model

- Feature-wise formulation:

- When original i‘s are count features, it can be shown that:

The set of all instances of feature i in (x,y,z), across all annotations zZ(x).

Example: POS Dependency Parsing

- Feature i RB VBD
- The set of feature instances Fi is:

0.91

RB

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

- Feature i RB VBD
- The set of feature instances Fi is:

0.91

0.01

RB

RB

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

- Feature i RB VBD
- The set of feature instances Fi is:

0.91

0.01

0.1

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

- Feature i RB VBD
- The set of feature instances Fi is:

0.91

0.01

0.1

0.001

RB

RB

RB

VBD

VBD

VBD

RB

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

- Feature i RB VBD
- The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

RB

RB

RB

VBD

VBD

VBD

RB

RB

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

- Feature i RB VBD
- The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

0.002

RB

RB

RB

VBD

VBD

VBD

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

- Feature i RB VBD
- The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

…

…

0.002

RB

RB

RB

VBD

VBD

VBD

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

N(N-1) feature instances in Fi .

Example: POS Dependency Parsing

- Feature i RB VBD uses a limited amount of evidence:
- the set of feature instances Fi has cardinality N(N-1).
- computing takes O(N|P|2) time using a constrained version of
- the forward-backward algorithm:

- Therefore, computing i takes O(N3|P|2) time.

Probabilistic Pipeline Model: M2

- All possible annotations from one stage are used in subsequent stages.

x

POS Tagging

Syntactic Parsing

polynomial time

- In general, the time complexity of computing i depends on the complexity of the evidence used by feature i.

Probabilistic Pipeline Model: M3

- The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence:

x

POS Tagging

Syntactic Parsing

Probabilistic Pipeline Model: M3

- The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence:

x

POS Tagging

Syntactic Parsing

The set of instances of feature i using only the best annotation

Probabilistic Pipeline Model: M3

- Like the traditional pipeline model M1, except that it uses the probabilistic confidence values associated with annotation features.
- More efficient than M2, but less accurate.
- Example: POS Dependency Parsing
- shows features generated by template ti tjand their probabilities.

y:

0.81

0.92

0.85

0.97

0.95

0.98

0.91

0.97

0.90

0.98

DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11

x:

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Two Applications

- Dependency Parsing
- Named Entity Recognition

x

POS Tagging

Syntactic Parsing

x

POS Tagging

Syntactic Parsing

Named Entity Recognition

1) Dependency Parsing

- Use MSTParser [McDonald et al. 2005]:
- The score of a dependency tree the sum of the edge scores:
- Feature templates use words and POS tags at positions u and v and their neighbors u 1 and v 1.

- Use CRF [Lafferty et al. 2001] POS tagger:
- Compute probabilistic features using a constrained forward-backwardprocedure.
- Example: feature titj has probability p(ti, tj)
- constrain the state transitions to pass through tags tiand tj.

1) Dependency Parsing

- Two approximations of model M2:
- Model M2’:
- Consider POS tags independent:
- p(ti RB,tj VBD|x) p(ti RB|x) p(tj VBD|x)

- Ignore tags with low marginal probability:
- p(ti) 1/(|P|)

- Consider POS tags independent:
- Model M2”:
- Like M2’, but use constrained forward-backward to compute marginal probabilities when the tag chunks are less than 4 tokens apart.

- Model M2’:

1) Dependency Parsing: Results

- Train MSTParser on sections 2-21 of Pen WSJ Treebank using gold POS tagging.
- Test MST Parser on section 23, using POS tags from CRF tagger.
- Absolute error reduction of “only” 0.19% :
- But POS tagger has a very high accuracy of 96.25%.

- Expect more substantial improvement when upstream stages in the pipeline are less accurate.

2) Named Entity Recognition

- Model NER as a sequence tagging problem using CRFs:

z2:

z1:

DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11

y:

x:

O I O O O O O O O O O

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

- Flat features:unigram, bigram and trigram that extend either left or right:
- sailors, the sailors, sailors RB, sailors RB thought…

- Tree features: unigram, bigram and trigram that extend in any direction in the undirected dependency tree:
- sailors thought, sailors thought RB, NNS thought RB, …

Named Entity Recognition: Model M2

x

POS Tagging

Syntactic Parsing

Named Entity Recognition

- Probabilistic features:

- Example feature NNS2thought4 RB3:

Named Entity Recognition: Model M3’

- M3’ is an approximation of M3 in which confidence scores are computed as follows:
- Consider POS tagging and dependency parsing independent.
- Consider POS tags independent.
- Consider dependency arcs independent.
- Example feature NNS2thought4 RB3:

- Need to compute marginals p(uv|x).

Probabilistic Dependency Features

- To compute probabilistic POS features, we used a constrained version of the forward-backward algorithm.
- To compute probabilistic dependency features, we use a constrained version of Eisner’s algorithm:
- Compute normalized scores n(uv | x) using the softmax function:
- Transform scores n(uv|x) into probabilities p(uv|x) using isotonic regression [Zadrozny & Elkan, 2002].

Named Entity Recognition: Results

- Implemented the CRF models in MALLET [McCallum, 2002]
- Trained and tested on the standard split from the ACE 2002 + 2003 corpus (674 training, 97 testing).
- POS tagger and MSTParser were trained on sections 2-21 of WSJ Treebank
- Isotonic regression for MSTParser on section 23.

Area under PR curve

Named Entity Recognition: Results

- M3’ (probabilistic) vs. M1 (traditional) using tree features:

Conclusions & Related Work

- A general method for improving the communication between consecutive stages in pipeline models:
- based on computing expectations for count features.
- an efective method for associating probabilities with output substructures.

- adds polynomial time complexity to pipeline whenever the inference step at each stage is done in polynomial time.

- based on computing expectations for count features.
- Can be seen as complementary to the sampling approach of [Finkel et al. 2006]:
- approximate vs. exact in polynomial time.
- used in testing vs. used in training and testing.

Future Work

- Try full model M2 / its approximation M2’ on NER.
- Extend model to pipeline graphs containing cycles.

Download Presentation

Connecting to Server..