native judgments of non native usage experiments in preposition error detection n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection PowerPoint Presentation
Download Presentation
Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection

Loading in 2 Seconds...

play fullscreen
1 / 25

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection. Joel Tetreault [Educational Testing Service] Martin Chodorow [Hunter College of CUNY]. Preposition Cloze Examples. “We sat _____ the sunshine” in under by at

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection' - yadid


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
native judgments of non native usage experiments in preposition error detection

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection

Joel Tetreault [Educational Testing Service]

Martin Chodorow [Hunter College of CUNY]

preposition cloze examples
Preposition Cloze Examples
  • “We sat _____ the sunshine”
    • in
    • under
    • by
    • at
  • “…the force of gravity causes the sap to move _____ the underside of the stem.”
    • to
    • on
    • toward
    • onto
    • through
motivation
Motivation
  • Number of non-native speakers in English schools rising in the past decade
    • Currently 300 million Chinese ESL learners!
  • Highlights need for NLP tools to assist in language learning
  • Evaluation of NLP learner tools
    • Important for development
    • Error-annotation: time consuming and costly
    • Usually one rater
      • [Izumi ’03; Eeg-Olofsson ‘02; Han ’06; Nagata ’06; Gamon ’08]
objective
Objective
  • Problem
    • Single human annotation has been used as the gold standard
    • Sidesteps the issue of annotator reliability
  • Objective
    • Show that rating preposition usage is actually a very contentious task
    • Recommend an approach to make annotation more feasible
experiments in rater reliability
Experiments in Rater Reliability
  • Judgments of Native Usage
    • Difficulty of preposition selection with cloze and choice experiments
  • Judgments of Non-Native Usage
    • Double-annotate a large ESL corpus
    • Show that one rater can be unreliable and skew system results
  • Sampling Approach
    • Propose an approach to alleviate the cost and time associated with double annotation
background
Background
  • Use system for preposition error detection developed
    • [Tetreault and Chodorow ’08]
  • Performance:
    • Native text: as high as 79%
    • TOEFL essays: P=84%, R=19%
    • State-of-the-art when compared with other methods: [Gamon ’08] [De Felice ’08]
  • Raters:
    • Two native speakers (East Coast US)
    • Ages: 26,29
    • Two years experience with other NLP annotation
1 human judgments of native usage
(1) Human Judgments of Native Usage
  • Is the task of preposition selection in native texts difficult for human raters?
  • Cloze Test:
    • Raters presented with 200 Encarta sentences, with one preposition replaced with a blank
    • Asked to fill in the blank with the best preposition
    • “We sat _____ the sunshine”
1 native usage cloze test
(1) Native Usage – Cloze Test

* System’s mismatch suggestions were often not as good as Raters’

1 native usage choice test
(1) Native Usage – Choice Test
  • Many contexts that license multiple prepositions
  • Using an exact match can underestimate performance
  • Choice Test:
    • Raters presented with 200 Encarta sentences
    • Asked to blindly choose between system’s and writer’s preposition
    • “We sat {in/under} the sunshine”
1 native usage choice test1
(1) Native Usage – Choice Test
  • Results:
    • Both Raters 1 & 2 considered the system’s preposition equal to or better than the writer’s 28% of the time
    • So a system that performs at 75% with exact-metric, is actually performing as high as 82%
      • 28% of the 25% mismatch rate = +7%
2 human judgments of non native usage
(2) Human Judgments of Non-Native Usage
  • Using one rater can be problematic
    • linguistic drift, age, location, fatigue, task difficulty
  • Question: is using only one rater reliable?
  • Experiment:
    • Two raters double-annotate TOEFL essays for preposition usage errors
    • Compute Agreement/Kappa measures
    • Evaluate system performance vs. two raters
annotation error targeting
Annotation: Error Targeting
  • Schemes target many different types of errors [Izumi ’03] [Granger ’03]
  • Problematic:
    • High cognitive load on rater to keep track of dozens of error types
    • Some contexts have several different errors (many different ways of correcting)
    • Can degrade reliability
  • Targeting one error type reduces the effects of these issues
annotation scheme
Annotation Scheme
  • Annotators were presented sentences from TOEFL essays with each preposition flagged
  • Preposition Annotation:
    • Extraneous
    • Wrong Choice – if incorrect preposition is used (and list substitution(s))
    • OK – preposition is perfect for that context
    • Equal – preposition is perfect, but there are others that are acceptable as well (list others)
    • Then mark confidence in judgment (binary)
procedure
Procedure
  • Raters given blocks of 500 preposition contexts
  • Took roughly 5 hours per block
  • After two blocks each, raters did an overlap set of ~100 contexts (1336 contexts total)
  • Every overlap set was adjudicated by two other human raters:
    • Sources of disagreement were discussed with original raters
    • Agreement and Kappa computed
how well do humans compare
How well do humans compare?
  • For all overlap segments:
    • OK and Equal are collapsed to OK
    • Agreement = 0.952
    • Kappa = 0.630
    • Kappa ranged from 0.411 to 0.786
confusion matrix
Confusion Matrix

Rater 1

Rater 2

implications for system evaluation
Implications for System Evaluation
  • Comparing a system [Chodorow et al. ’07] to one rater’s judgments can skew evaluation results
  • Test: 2 native speakers rated 2,000 prepositions from TOEFL essays:
    • Diff. of 10% precision, 5% recall (rater as gold standard)
implications of using multiple raters
Implications of Using Multiple Raters
  • Advantages of multiple raters:
    • Can indicate the variability of system evaluation
    • Allows listing of more substitutions
  • Standard annotation with multiple annotators is problematic:
    • Expensive
    • Time-Consuming (training, adjudication)
  • Is there an approach that can make annotation more efficient?
3 sampling approach
(3) Sampling Approach

OK

90%

  • Sampling Approach:
  • Sample system’s output classifications
  • Annotate smaller error-skewed corpus
  • Estimate rates of hits, false positives, and misses
      •  Can calculate precision and recall

9000

OK

2000

Error

Error

10%

1000

1000

Problem: to make an eval corpus of 1000 errors can take 100hrs!

sampling methodology
Sampling Methodology

Learner Corpus

System

Sys Flags Error

Sys Accepts OK

“Error Sub-Corpus”

“OK Sub-Corpus”

Random Error Sample

Random OK Sample

Annotation Corpus

sampling methodology1
Sampling Methodology

Learner Corpus

1000

System

Sys Flags Error

Sys Accepts OK

100

900

“Error Sub-Corpus”

“OK Sub-Corpus”

Sample Rate = 0.33

Sample Rate = 1.0

Random Error Sample

Random OK Sample

100

300

400

Hits = 70

FP = 30

Both OK = 200

Misses = 100

Annotation Corpus

sampling results
Sampling Results
  • Two raters working in tandem on sampled corpus
  • Compare against standard annotation
  • Results:
    • Standard: P = 0.79, R = 0.18
    • Sampling: P = 0.79, R = 0.16
  • Related Work
    • [Chodorow & Leacock ’00] – usage of targeted words
    • Active Learning [Dagan & Engelson ’95] – finding the most informative training examples for ML
summary
Summary
  • Are two or more annotators better than one?
    • Annotators vary in their judgments of usage errors
    • Evaluation based on a single annotator under- or over-estimates system performance
  • Value of multiple annotators:
    • Gives information about the range of performance
      • Dependent on number of annotators
    • Multiple prep’s per context handled better
  • Issues not unique to preposition task:
    • Collocation kappa scores: 0.504 to 0.554
summary1
Summary
  • Sampling Approach: shown to be a good alternative strategy to exhaustive annotation approach

Advantages

  • Less costly & time-consuming
  • Results are similar to exhaustive annotation
  • Avoid fatigue problem

Drawbacks

  • Less reliable estimate of recall
  • Hard to re-test system
  • System comparison difficult
future work
Future Work
  • Do another sampling comparison to validate results
  • Leverage confidence annotations