Learning with Probabilistic Features for Improved Pipeline Models

1 / 33

# Learning with Probabilistic Features for Improved Pipeline Models - PowerPoint PPT Presentation

Learning with Probabilistic Features for Improved Pipeline Models. Razvan C. Bunescu. Electrical Engineering and Computer Science Ohio University Athens, OH. bunescu@ohio.edu. EMNLP, October 2008. Introduction. NLP systems often depend on the output of other NLP systems. POS Tagging.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Learning with Probabilistic Features for Improved Pipeline Models' - andrew-buckner

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Learning with Probabilistic Featuresfor Improved Pipeline Models

Razvan C. Bunescu

Electrical Engineering and Computer Science

Ohio University

Athens, OH

bunescu@ohio.edu

EMNLP, October 2008

Introduction
• NLP systems often depend on the output of other NLP systems.

POS Tagging

Syntactic Parsing

Named Entity Recognition

Semantic Role Labeling

• The best annotationfrom one stage is used in subsequent stages.

x

POS Tagging

Syntactic Parsing

• Problem: Errors propagate between pipeline stages!
Probabilistic Pipeline Model: M2
• All possible annotationsfrom one stage are used in subsequent stages.

x

POS Tagging

Syntactic Parsing

probabilistic features

• Problem: Z(x) has exponential cardinality!
Probabilistic Pipeline Model: M2
• Feature-wise formulation:
• When original i‘s are count features, it can be shown that:
Probabilistic Pipeline Model
• Feature-wise formulation:
• When original i‘s are count features, it can be shown that:

An instance of feature i , i.e. the actual evidence used from example (x,y,z).

Probabilistic Pipeline Model
• Feature-wise formulation:
• When original i‘s are count features, it can be shown that:

The set of all instances of feature i in (x,y,z), across all annotations zZ(x).

Example: POS Dependency Parsing
• Feature i RB  VBD
• The set of feature instances Fi is:

0.91

RB

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing
• Feature i RB  VBD
• The set of feature instances Fi is:

0.91

0.01

RB

RB

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing
• Feature i RB  VBD
• The set of feature instances Fi is:

0.91

0.01

0.1

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing
• Feature i RB  VBD
• The set of feature instances Fi is:

0.91

0.01

0.1

0.001

RB

RB

RB

VBD

VBD

VBD

RB

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing
• Feature i RB  VBD
• The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

RB

RB

RB

VBD

VBD

VBD

RB

RB

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing
• Feature i RB  VBD
• The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

0.002

RB

RB

RB

VBD

VBD

VBD

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing
• Feature i RB  VBD
• The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

0.002

RB

RB

RB

VBD

VBD

VBD

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

 N(N-1) feature instances in Fi .

Example: POS Dependency Parsing
• Feature i RB  VBD uses a limited amount of evidence:
•  the set of feature instances Fi has cardinality N(N-1).
•  computing takes O(N|P|2) time using a constrained version of
• the forward-backward algorithm:
• Therefore, computing i takes O(N3|P|2) time.
Probabilistic Pipeline Model: M2
• All possible annotations from one stage are used in subsequent stages.

x

POS Tagging

Syntactic Parsing

polynomial time

• In general, the time complexity of computing i depends on the complexity of the evidence used by feature i.
Probabilistic Pipeline Model: M3
• The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence:

x

POS Tagging

Syntactic Parsing

Probabilistic Pipeline Model: M3
• The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence:

x

POS Tagging

Syntactic Parsing

The set of instances of feature i using only the best annotation

Probabilistic Pipeline Model: M3
• Like the traditional pipeline model M1, except that it uses the probabilistic confidence values associated with annotation features.
• More efficient than M2, but less accurate.
• Example: POS  Dependency Parsing
• shows features generated by template ti  tjand their probabilities.

y:

0.81

0.92

0.85

0.97

0.95

0.98

0.91

0.97

0.90

0.98

DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11

x:

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Two Applications
• Dependency Parsing
• Named Entity Recognition

x

POS Tagging

Syntactic Parsing

x

POS Tagging

Syntactic Parsing

Named Entity Recognition

1) Dependency Parsing
• Use MSTParser [McDonald et al. 2005]:
• The score of a dependency tree  the sum of the edge scores:
• Feature templates use words and POS tags at positions u and v and their neighbors u  1 and v  1.
• Use CRF [Lafferty et al. 2001] POS tagger:
• Compute probabilistic features using a constrained forward-backwardprocedure.
• Example: feature titj has probability p(ti, tj)
• constrain the state transitions to pass through tags tiand tj.
1) Dependency Parsing
• Two approximations of model M2:
• Model M2’:
• Consider POS tags independent:
• p(ti RB,tj VBD|x) p(ti RB|x)  p(tj  VBD|x)
• Ignore tags with low marginal probability:
• p(ti)  1/(|P|)
• Model M2”:
• Like M2’, but use constrained forward-backward to compute marginal probabilities when the tag chunks are less than 4 tokens apart.
1) Dependency Parsing: Results
• Train MSTParser on sections 2-21 of Pen WSJ Treebank using gold POS tagging.
• Test MST Parser on section 23, using POS tags from CRF tagger.
• Absolute error reduction of “only” 0.19% :
• But POS tagger has a very high accuracy of 96.25%.
• Expect more substantial improvement when upstream stages in the pipeline are less accurate.
2) Named Entity Recognition
• Model NER as a sequence tagging problem using CRFs:

z2:

z1:

DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11

y:

x:

O I O O O O O O O O O

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

• Flat features:unigram, bigram and trigram that extend either left or right:
• sailors, the sailors, sailors RB, sailors RB thought…
• Tree features: unigram, bigram and trigram that extend in any direction in the undirected dependency tree:
• sailors thought, sailors  thought  RB, NNS thought  RB, …
Named Entity Recognition: Model M2

x

POS Tagging

Syntactic Parsing

Named Entity Recognition

• Probabilistic features:
• Example feature NNS2thought4  RB3:
Named Entity Recognition: Model M3’
• M3’ is an approximation of M3 in which confidence scores are computed as follows:
• Consider POS tagging and dependency parsing independent.
• Consider POS tags independent.
• Consider dependency arcs independent.
• Example feature NNS2thought4  RB3:
• Need to compute marginals p(uv|x).
Probabilistic Dependency Features
• To compute probabilistic POS features, we used a constrained version of the forward-backward algorithm.
• To compute probabilistic dependency features, we use a constrained version of Eisner’s algorithm:
• Compute normalized scores n(uv | x) using the softmax function:
• Transform scores n(uv|x) into probabilities p(uv|x) using isotonic regression [Zadrozny & Elkan, 2002].
Named Entity Recognition: Results
• Implemented the CRF models in MALLET [McCallum, 2002]
• Trained and tested on the standard split from the ACE 2002 + 2003 corpus (674 training, 97 testing).
• POS tagger and MSTParser were trained on sections 2-21 of WSJ Treebank
• Isotonic regression for MSTParser on section 23.

Area under PR curve

Named Entity Recognition: Results
• M3’ (probabilistic) vs. M1 (traditional) using tree features:
Conclusions & Related Work
• A general method for improving the communication between consecutive stages in pipeline models:
• based on computing expectations for count features.
• an efective method for associating probabilities with output substructures.
• adds polynomial time complexity to pipeline whenever the inference step at each stage is done in polynomial time.
• Can be seen as complementary to the sampling approach of [Finkel et al. 2006]:
• approximate vs. exact in polynomial time.
• used in testing vs. used in training and testing.
Future Work
• Try full model M2 / its approximation M2’ on NER.
• Extend model to pipeline graphs containing cycles.