Accelerating Corpus Annotation through Active Learning
This presentation is the property of its rightful owner.
Sponsored Links
1 / 75

Accelerating Corpus Annotation through Active Learning PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

Accelerating Corpus Annotation through Active Learning. Peter McClanahan Eric Ringger, Robbie Haertel, George Busby, Marc Carmen, James Carroll , Kevin Seppi, Deryle Lonsdale (PROJECT ALFA) ‏. March 15, 2008 – AACL. Our Situation. Operating on a fixed budget

Download Presentation

Accelerating Corpus Annotation through Active Learning

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Accelerating corpus annotation through active learning

Accelerating Corpus Annotation through Active Learning

Peter McClanahan

Eric Ringger, Robbie Haertel, George Busby, Marc Carmen, James Carroll , Kevin Seppi, Deryle Lonsdale (PROJECT ALFA)‏

March 15, 2008 – AACL


Accelerating corpus annotation through active learning

Our Situation

  • Operating on a fixed budget

  • Need morphologically annotated text

  • For a significant part of the project, we can only afford to annotate a subset of the text

  • Where to focus manual annotation efforts in order to produce a complete annotation of highest quality?


Accelerating corpus annotation through active learning

Improving on Manual Tagging

  • Can we reduce the amount of human effort while maintaining high levels of accuracy?

  • Utilize probabilistic models for computer-aided tagging

  • Focus: Active Learning for Sequence Labeling


Accelerating corpus annotation through active learning

Outline

  • Statistical Part-of-Speech Tagging

  • Active Learning

    • Query by Uncertainty

    • Query by Committee

  • Cost Model

  • Experimental Results

  • Conclusions


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

<s> Perhaps the biggest hurdle


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

<s> Perhaps the biggest hurdle

?

Task: predict a tag for a word in a given context

Actually: predict a probability

distribution over possible tags


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

RB .87

RBS .05

...

FW 6E-7

<s> Perhaps the biggest hurdle

Task: predict a tag for a word in

a given context

Actually: predict a probability

distribution over possible tags


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

RB .87

RBS .05

...

FW 6E-7

<s> Perhaps the biggest hurdle

?

Task: predict a tag for a word in

a given context

Actually: predict a probability

distribution over possible tags


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

RB .87 DT .95

RBS .05 NN .003

... ...

FW 6E-7 FW 1E-8

<s> Perhaps the biggest hurdle

Task: predict a tag for a word in

a given context

Actually: predict a probability

distribution over possible tags


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

RB .87 DT .95

RBS .05 NN .003

... ...

FW 6E-7 FW 1E-8

<s> Perhaps the biggest hurdle

?

Task: predict a tag for a word in

a given context

Actually: predict a probability

distribution over possible tags


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

RB .87 DT .95 JJS .67

RBS .05 NN .003 JJ .17

... ... ...

FW 6E-7 FW 1E-8 UH 3E-7

<s> Perhaps the biggest hurdle

Task: predict a tag for a word in

a given context

Actually: predict a probability

distribution over possible tags


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

RB .87 DT .95 JJS .67

RBS .05 NN .003 JJ .17

... ... ...

FW 6E-7 FW 1E-8 UH 3E-7

<s> Perhaps the biggest hurdle

Task: predict a tag for a word in

a given context

Actually: predict a probability

distribution over possible tags


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

RB .87 DT .95 JJS .67 NN .85

RBS .05 NN .003 JJ .17 VV .13

... ... ... ...

FW 6E-7 FW 1E-8 UH 3E-7 CD 2E-7

<s> Perhaps the biggest hurdle

Task: predict a tag for a word in

a given context

Actually: predict a probability

distribution over possible tags


Accelerating corpus annotation through active learning

Probabilistic POS Tagging

  • Train a model to perform precisely this task:

    • Estimate a distribution over tags for a word in given contexts.

  • Probabilistic POS Tagging is a fairly easy computational task

    • On the Penn Treebank (Wall Street Journal text)

      • No context

      • Choosing most frequent tag for a given word

      • 92.45% accuracy

    • There are much smarter things to do

      • Maximum Entropy Markov Models


Accelerating corpus annotation through active learning

MEMMs


Accelerating corpus annotation through active learning

MEMMs


Accelerating corpus annotation through active learning

Sequence Labeling

  • Insteadof predicting one tag, we must predict entire sequence

  • The best tag for each word might give an unlikely sequence

  • We use Viterbi algorithm—a dynamic programming approach that gives best sequence.


Accelerating corpus annotation through active learning

Features for POS Tagging

  • Combination of lexical, orthographic, contextual, and frequency-based information.

    • Following Toutanova & Manning (2000) approximately

  • For each word:

    • the textual form of the word itself

    • the POS tags of the preceding two words

    • all prefix and suffix substrings up to 10 characters long

    • whether words contain hyphens, digits, or capitalization

    • Additional features

    • prefix and suffix substrings up to 10 characters long


Accelerating corpus annotation through active learning

More Data for Less Work

  • The goal is to produce annotated corpora with less time and annotator effort

  • Active Learning

    • A machine learning meta-algorithm

    • A model is trained with the help of an oracle

    • Oracle helps with the annotation process

      • Human annotator(s)‏

      • Simulated with selective querying of hidden labels on pre-labeled data


Accelerating corpus annotation through active learning

Active Learning Data Sets

  • Annotated: data labeled by oracle

  • Unannotated: data not (yet) labeled by oracle

  • Test: labeled data used to determine accuracy


Accelerating corpus annotation through active learning

Active Learning Models

  • Sentence Selecting Model

    • Model that chooses the sentences to be annotated

  • MEMM

    • Model that tags the sentences


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Corrected

Sentences

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Corrected

Sentences

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Corrected

Sentences

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Corrected

Sentences

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Corrected

Sentences

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Most “Important”

Samples

Oracle

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Corrected

Sentences

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Test Data

Active Learning Algorithm

Unannotated Data

Annotated Data

Automatic Tagger

Sentence Selector Model


Accelerating corpus annotation through active learning

Determining Sentence Importance

  • Determining which samples will provide the most information:

    • Random Baseline (Random)‏

    • Query-By-Uncertainty

      • Using entropy (QBU)‏

      • Using Viterbi score (QBUV)

    • Query-By-Committee (QBC)‏


Accelerating corpus annotation through active learning

Random Baseline

  • Instead of giving informative sentences to the oracle, we choose random sentences.

    • This is very fast

    • Not the smartest thing to do


Accelerating corpus annotation through active learning

Query by Uncertainty

  • Importance of a sample corresponds to the uncertainty in the model’s decision

  • “Uncertainty Sampling”

  • Thrun & Moeller, 1992

  • Lewis & Gale, 1994

  • Hwa, 2000

  • Anderson, 2005

  • Ringger et al., 2007


Accelerating corpus annotation through active learning

Query-by-Uncertainty using Entropy

  • How is uncertainty determined?

    • Per-word entropy

    • Measure of the uncertainty in the distribution over tags for a single word

    • Easy to calculate

    • Our approximation to sentence entropy = the sum of per-word entropies

    • This performs equivalently to full-sequence entropy.


Accelerating corpus annotation through active learning

QBUV

  • Alternative approach to QBU

    • Ringger et al., 2007 show that it often outperforms QBU

  • Estimate per-sentence uncertainty with 1-P(t)‏

  • Where t is the best tag sequence from the Viterbi decoder (hence the “V”)‏


Accelerating corpus annotation through active learning

Query by Committee

  • Utilize a committee (or ensemble) of N models

    • Each model is bootstrap sampled (with replacement) from annotated data

  • Each model “votes” on the correct tagging of a sentence

  • Those sentences on which the committee has most disagreement should yield the most useful information


Accelerating corpus annotation through active learning

Previous Work on QBC

  • Seung, Opper, & Sompolinsky, 1992

  • Freund, Seung, Shamir, & Tishby, 1997

    • Analysis of QBC

  • Engelsohn (Argamon) & Dagan, 1996

    • QBC with HMMs for POS tagging

    • Most relevant to our work


Accelerating corpus annotation through active learning

Experimental Setup

  • We use 80% of data for training data

    • Data is Penn Treebank

  • Start with one random sentence

  • Assume the rest of training data to be unlabeled (hide the tags)

  • We simulate annotation by the oracle by revealing labels

  • Results are averaged from 10 randomized runs


Accelerating corpus annotation through active learning

Pay by the sentence


Accelerating corpus annotation through active learning

Pay by the word


Accelerating corpus annotation through active learning

Pay by the corrected words


Accelerating corpus annotation through active learning

Which model?

  • The best active learning model depends on how the annotator is paid

  • We usually pay by the hour

  • We need a good cost model


Accelerating corpus annotation through active learning

User Study

  • See Ringger et al., 2008

  • 47 participants (undergraduate linguistic students)

  • Each tagged 36 sentences from Penn Treebank

  • Tracked correcting times for correcting tags in a sentence


  • Accelerating corpus annotation through active learning

    Cost Model

    • From the User Study, we performed linear regression to infer a cost model.

    • Cost (in seconds)= 3.795 * L + 5.387 * C + 12.57

    • L = length of the sentence

    • C = number of tags the annotator needs to change


    Accelerating corpus annotation through active learning

    Pay by the hour


    Accelerating corpus annotation through active learning

    How much do you actually save?

    Say I want 95% accuracy


    Accelerating corpus annotation through active learning

    How much do you actually save?

    41.6 hrs

    70 hrs


    Accelerating corpus annotation through active learning

    How much do you actually save?


    Accelerating corpus annotation through active learning

    Is active learning better, though?

    • The Penn Treebank has approximately one million words.

    • According to the cost model, an average PTB sentence would take about 4 minutes.

    • This means 26,908 hours for the full corpus.

    • With active learning we achieve 96% accuracy after about 70 hours.


    Accelerating corpus annotation through active learning

    Portability

    • Do the results hold up on other data?

      • We use random selections of English poetry from the BNC

      • We also use Kiraz Syriac New Testament


    Accelerating corpus annotation through active learning

    British National Corpus


    Accelerating corpus annotation through active learning

    Show Syriac


    Accelerating corpus annotation through active learning

    Conclusions

    • Active learning model depends on how cost is determined

    • Demonstrated significant annotation savings for POS tagging

      • On English news prose

      • On English poetry

      • On Syriac


    Accelerating corpus annotation through active learning

    Questions?


  • Login