1 / 33

# Learning with Probabilistic Features for Improved Pipeline Models - PowerPoint PPT Presentation

Learning with Probabilistic Features for Improved Pipeline Models. Razvan C. Bunescu. Electrical Engineering and Computer Science Ohio University Athens, OH. bunesc[email protected] EMNLP, October 2008. Introduction. NLP systems often depend on the output of other NLP systems. POS Tagging.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Learning with Probabilistic Features for Improved Pipeline Models' - andrew-buckner

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Learning with Probabilistic Featuresfor Improved Pipeline Models

Razvan C. Bunescu

Electrical Engineering and Computer Science

Ohio University

Athens, OH

EMNLP, October 2008

• NLP systems often depend on the output of other NLP systems.

POS Tagging

Syntactic Parsing

Named Entity Recognition

Semantic Role Labeling

• The best annotationfrom one stage is used in subsequent stages.

x

POS Tagging

Syntactic Parsing

• Problem: Errors propagate between pipeline stages!

• All possible annotationsfrom one stage are used in subsequent stages.

x

POS Tagging

Syntactic Parsing

probabilistic features

• Problem: Z(x) has exponential cardinality!

• Feature-wise formulation:

• When original i‘s are count features, it can be shown that:

• Feature-wise formulation:

• When original i‘s are count features, it can be shown that:

An instance of feature i , i.e. the actual evidence used from example (x,y,z).

• Feature-wise formulation:

• When original i‘s are count features, it can be shown that:

The set of all instances of feature i in (x,y,z), across all annotations zZ(x).

Example: POS Dependency Parsing

• Feature i RB  VBD

• The set of feature instances Fi is:

0.91

RB

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

• Feature i RB  VBD

• The set of feature instances Fi is:

0.91

0.01

RB

RB

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

• Feature i RB  VBD

• The set of feature instances Fi is:

0.91

0.01

0.1

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

• Feature i RB  VBD

• The set of feature instances Fi is:

0.91

0.01

0.1

0.001

RB

RB

RB

VBD

VBD

VBD

RB

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

• Feature i RB  VBD

• The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

RB

RB

RB

VBD

VBD

VBD

RB

RB

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

• Feature i RB  VBD

• The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

0.002

RB

RB

RB

VBD

VBD

VBD

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Example: POS Dependency Parsing

• Feature i RB  VBD

• The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

0.002

RB

RB

RB

VBD

VBD

VBD

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

 N(N-1) feature instances in Fi .

Example: POS Dependency Parsing

• Feature i RB  VBD uses a limited amount of evidence:

•  the set of feature instances Fi has cardinality N(N-1).

•  computing takes O(N|P|2) time using a constrained version of

• the forward-backward algorithm:

• Therefore, computing i takes O(N3|P|2) time.

• All possible annotations from one stage are used in subsequent stages.

x

POS Tagging

Syntactic Parsing

polynomial time

• In general, the time complexity of computing i depends on the complexity of the evidence used by feature i.

• The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence:

x

POS Tagging

Syntactic Parsing

• The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence:

x

POS Tagging

Syntactic Parsing

The set of instances of feature i using only the best annotation

• Like the traditional pipeline model M1, except that it uses the probabilistic confidence values associated with annotation features.

• More efficient than M2, but less accurate.

• Example: POS  Dependency Parsing

• shows features generated by template ti  tjand their probabilities.

y:

0.81

0.92

0.85

0.97

0.95

0.98

0.91

0.97

0.90

0.98

DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11

x:

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

Model M2

Model M3

• Dependency Parsing

• Named Entity Recognition

x

POS Tagging

Syntactic Parsing

x

POS Tagging

Syntactic Parsing

Named Entity Recognition

1) Dependency Parsing

• Use MSTParser [McDonald et al. 2005]:

• The score of a dependency tree  the sum of the edge scores:

• Feature templates use words and POS tags at positions u and v and their neighbors u  1 and v  1.

• Use CRF [Lafferty et al. 2001] POS tagger:

• Compute probabilistic features using a constrained forward-backwardprocedure.

• Example: feature titj has probability p(ti, tj)

• constrain the state transitions to pass through tags tiand tj.

1) Dependency Parsing

• Two approximations of model M2:

• Model M2’:

• Consider POS tags independent:

• p(ti RB,tj VBD|x) p(ti RB|x)  p(tj  VBD|x)

• Ignore tags with low marginal probability:

• p(ti)  1/(|P|)

• Model M2”:

• Like M2’, but use constrained forward-backward to compute marginal probabilities when the tag chunks are less than 4 tokens apart.

1) Dependency Parsing: Results

• Train MSTParser on sections 2-21 of Pen WSJ Treebank using gold POS tagging.

• Test MST Parser on section 23, using POS tags from CRF tagger.

• Absolute error reduction of “only” 0.19% :

• But POS tagger has a very high accuracy of 96.25%.

• Expect more substantial improvement when upstream stages in the pipeline are less accurate.

2) Named Entity Recognition

• Model NER as a sequence tagging problem using CRFs:

z2:

z1:

DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11

y:

x:

O I O O O O O O O O O

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

• Flat features:unigram, bigram and trigram that extend either left or right:

• sailors, the sailors, sailors RB, sailors RB thought…

• Tree features: unigram, bigram and trigram that extend in any direction in the undirected dependency tree:

• sailors thought, sailors  thought  RB, NNS thought  RB, …

x

POS Tagging

Syntactic Parsing

Named Entity Recognition

• Probabilistic features:

• Example feature NNS2thought4  RB3:

• M3’ is an approximation of M3 in which confidence scores are computed as follows:

• Consider POS tagging and dependency parsing independent.

• Consider POS tags independent.

• Consider dependency arcs independent.

• Example feature NNS2thought4  RB3:

• Need to compute marginals p(uv|x).

• To compute probabilistic POS features, we used a constrained version of the forward-backward algorithm.

• To compute probabilistic dependency features, we use a constrained version of Eisner’s algorithm:

• Compute normalized scores n(uv | x) using the softmax function:

• Transform scores n(uv|x) into probabilities p(uv|x) using isotonic regression [Zadrozny & Elkan, 2002].

• Implemented the CRF models in MALLET [McCallum, 2002]

• Trained and tested on the standard split from the ACE 2002 + 2003 corpus (674 training, 97 testing).

• POS tagger and MSTParser were trained on sections 2-21 of WSJ Treebank

• Isotonic regression for MSTParser on section 23.

Area under PR curve

• M3’ (probabilistic) vs. M1 (traditional) using tree features:

• A general method for improving the communication between consecutive stages in pipeline models:

• based on computing expectations for count features.

• an efective method for associating probabilities with output substructures.

• adds polynomial time complexity to pipeline whenever the inference step at each stage is done in polynomial time.

• Can be seen as complementary to the sampling approach of [Finkel et al. 2006]:

• approximate vs. exact in polynomial time.

• used in testing vs. used in training and testing.

• Try full model M2 / its approximation M2’ on NER.

• Extend model to pipeline graphs containing cycles.