Learning with probabilistic features for improved pipeline models
Download
1 / 33

Learning with Probabilistic Features for Improved Pipeline Models - PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on

Learning with Probabilistic Features for Improved Pipeline Models. Razvan C. Bunescu. Electrical Engineering and Computer Science Ohio University Athens, OH. bunesc[email protected] EMNLP, October 2008. Introduction. NLP systems often depend on the output of other NLP systems. POS Tagging.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Learning with Probabilistic Features for Improved Pipeline Models' - andrew-buckner


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning with probabilistic features for improved pipeline models

Learning with Probabilistic Featuresfor Improved Pipeline Models

Razvan C. Bunescu

Electrical Engineering and Computer Science

Ohio University

Athens, OH

[email protected]

EMNLP, October 2008


Introduction
Introduction

  • NLP systems often depend on the output of other NLP systems.

POS Tagging

Syntactic Parsing

Question Answering

Named Entity Recognition

Semantic Role Labeling


Traditional pipeline model m 1
Traditional Pipeline Model: M1

  • The best annotationfrom one stage is used in subsequent stages.

x

POS Tagging

Syntactic Parsing

  • Problem: Errors propagate between pipeline stages!


Probabilistic pipeline model m 2
Probabilistic Pipeline Model: M2

  • All possible annotationsfrom one stage are used in subsequent stages.

x

POS Tagging

Syntactic Parsing

probabilistic features

  • Problem: Z(x) has exponential cardinality!


Probabilistic pipeline model m 21
Probabilistic Pipeline Model: M2

  • Feature-wise formulation:

  • When original i‘s are count features, it can be shown that:


Probabilistic pipeline model
Probabilistic Pipeline Model

  • Feature-wise formulation:

  • When original i‘s are count features, it can be shown that:

An instance of feature i , i.e. the actual evidence used from example (x,y,z).


Probabilistic pipeline model1
Probabilistic Pipeline Model

  • Feature-wise formulation:

  • When original i‘s are count features, it can be shown that:

The set of all instances of feature i in (x,y,z), across all annotations zZ(x).


Example pos dependency parsing
Example: POS Dependency Parsing

  • Feature i RB  VBD

  • The set of feature instances Fi is:

0.91

RB

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11


Example pos dependency parsing1
Example: POS Dependency Parsing

  • Feature i RB  VBD

  • The set of feature instances Fi is:

0.91

0.01

RB

RB

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11


Example pos dependency parsing2
Example: POS Dependency Parsing

  • Feature i RB  VBD

  • The set of feature instances Fi is:

0.91

0.01

0.1

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11


Example pos dependency parsing3
Example: POS Dependency Parsing

  • Feature i RB  VBD

  • The set of feature instances Fi is:

0.91

0.01

0.1

0.001

RB

RB

RB

VBD

VBD

VBD

RB

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11


Example pos dependency parsing4
Example: POS Dependency Parsing

  • Feature i RB  VBD

  • The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

RB

RB

RB

VBD

VBD

VBD

RB

RB

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11


Example pos dependency parsing5
Example: POS Dependency Parsing

  • Feature i RB  VBD

  • The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

0.002

RB

RB

RB

VBD

VBD

VBD

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11


Example pos dependency parsing6
Example: POS Dependency Parsing

  • Feature i RB  VBD

  • The set of feature instances Fi is:

0.91

0.01

0.1

0.001

0.001

0.002

RB

RB

RB

VBD

VBD

VBD

RB

RB

RB

VBD

VBD

VBD

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

 N(N-1) feature instances in Fi .


Example pos dependency parsing7
Example: POS Dependency Parsing

  • Feature i RB  VBD uses a limited amount of evidence:

    •  the set of feature instances Fi has cardinality N(N-1).

    •  computing takes O(N|P|2) time using a constrained version of

    • the forward-backward algorithm:

  • Therefore, computing i takes O(N3|P|2) time.


Probabilistic pipeline model m 22
Probabilistic Pipeline Model: M2

  • All possible annotations from one stage are used in subsequent stages.

x

POS Tagging

Syntactic Parsing

polynomial time

  • In general, the time complexity of computing i depends on the complexity of the evidence used by feature i.


Probabilistic pipeline model m 3
Probabilistic Pipeline Model: M3

  • The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence:

x

POS Tagging

Syntactic Parsing


Probabilistic pipeline model m 31
Probabilistic Pipeline Model: M3

  • The best annotationfrom one stage is used in subsequent stages, together with its probabilistic confidence:

x

POS Tagging

Syntactic Parsing

The set of instances of feature i using only the best annotation


Probabilistic pipeline model m 32
Probabilistic Pipeline Model: M3

  • Like the traditional pipeline model M1, except that it uses the probabilistic confidence values associated with annotation features.

  • More efficient than M2, but less accurate.

  • Example: POS  Dependency Parsing

    • shows features generated by template ti  tjand their probabilities.

y:

0.81

0.92

0.85

0.97

0.95

0.98

0.91

0.97

0.90

0.98

DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11

x:

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11



Two applications
Two Applications

  • Dependency Parsing

  • Named Entity Recognition

x

POS Tagging

Syntactic Parsing

x

POS Tagging

Syntactic Parsing

Named Entity Recognition


1 dependency parsing
1) Dependency Parsing

  • Use MSTParser [McDonald et al. 2005]:

    • The score of a dependency tree  the sum of the edge scores:

    • Feature templates use words and POS tags at positions u and v and their neighbors u  1 and v  1.

  • Use CRF [Lafferty et al. 2001] POS tagger:

    • Compute probabilistic features using a constrained forward-backwardprocedure.

    • Example: feature titj has probability p(ti, tj)

      • constrain the state transitions to pass through tags tiand tj.


1 dependency parsing1
1) Dependency Parsing

  • Two approximations of model M2:

    • Model M2’:

      • Consider POS tags independent:

        • p(ti RB,tj VBD|x) p(ti RB|x)  p(tj  VBD|x)

      • Ignore tags with low marginal probability:

        • p(ti)  1/(|P|)

    • Model M2”:

      • Like M2’, but use constrained forward-backward to compute marginal probabilities when the tag chunks are less than 4 tokens apart.


1 dependency parsing results
1) Dependency Parsing: Results

  • Train MSTParser on sections 2-21 of Pen WSJ Treebank using gold POS tagging.

  • Test MST Parser on section 23, using POS tags from CRF tagger.

  • Absolute error reduction of “only” 0.19% :

    • But POS tagger has a very high accuracy of 96.25%.

  • Expect more substantial improvement when upstream stages in the pipeline are less accurate.


2 named entity recognition
2) Named Entity Recognition

  • Model NER as a sequence tagging problem using CRFs:

z2:

z1:

DT1 NNS2 RB3 VBD4 EX5 MD6 VB7 NNS8 IN9 DT10 NN11

y:

x:

O I O O O O O O O O O

The1 sailors2 mistakenly3 thought4 there5 must6 be7 diamonds8 in9 the10 soil11

  • Flat features:unigram, bigram and trigram that extend either left or right:

    • sailors, the sailors, sailors RB, sailors RB thought…

  • Tree features: unigram, bigram and trigram that extend in any direction in the undirected dependency tree:

    • sailors thought, sailors  thought  RB, NNS thought  RB, …


Named entity recognition model m 2
Named Entity Recognition: Model M2

x

POS Tagging

Syntactic Parsing

Named Entity Recognition

  • Probabilistic features:

  • Example feature NNS2thought4  RB3:


Named entity recognition model m 3
Named Entity Recognition: Model M3’

  • M3’ is an approximation of M3 in which confidence scores are computed as follows:

    • Consider POS tagging and dependency parsing independent.

    • Consider POS tags independent.

    • Consider dependency arcs independent.

    • Example feature NNS2thought4  RB3:

  • Need to compute marginals p(uv|x).


Probabilistic dependency features
Probabilistic Dependency Features

  • To compute probabilistic POS features, we used a constrained version of the forward-backward algorithm.

  • To compute probabilistic dependency features, we use a constrained version of Eisner’s algorithm:

    • Compute normalized scores n(uv | x) using the softmax function:

    • Transform scores n(uv|x) into probabilities p(uv|x) using isotonic regression [Zadrozny & Elkan, 2002].


Named entity recognition results
Named Entity Recognition: Results

  • Implemented the CRF models in MALLET [McCallum, 2002]

  • Trained and tested on the standard split from the ACE 2002 + 2003 corpus (674 training, 97 testing).

  • POS tagger and MSTParser were trained on sections 2-21 of WSJ Treebank

    • Isotonic regression for MSTParser on section 23.

Area under PR curve


Named entity recognition results1
Named Entity Recognition: Results

  • M3’ (probabilistic) vs. M1 (traditional) using tree features:


Conclusions related work
Conclusions & Related Work

  • A general method for improving the communication between consecutive stages in pipeline models:

    • based on computing expectations for count features.

      • an efective method for associating probabilities with output substructures.

    • adds polynomial time complexity to pipeline whenever the inference step at each stage is done in polynomial time.

  • Can be seen as complementary to the sampling approach of [Finkel et al. 2006]:

    • approximate vs. exact in polynomial time.

    • used in testing vs. used in training and testing.


Future work
Future Work

  • Try full model M2 / its approximation M2’ on NER.

  • Extend model to pipeline graphs containing cycles.



ad