1 / 25

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection. Joel Tetreault [Educational Testing Service] Martin Chodorow [Hunter College of CUNY]. Preposition Cloze Examples. “We sat _____ the sunshine” in under by at

yadid
Download Presentation

Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection Joel Tetreault [Educational Testing Service] Martin Chodorow [Hunter College of CUNY]

  2. Preposition Cloze Examples • “We sat _____ the sunshine” • in • under • by • at • “…the force of gravity causes the sap to move _____ the underside of the stem.” • to • on • toward • onto • through

  3. Motivation • Number of non-native speakers in English schools rising in the past decade • Currently 300 million Chinese ESL learners! • Highlights need for NLP tools to assist in language learning • Evaluation of NLP learner tools • Important for development • Error-annotation: time consuming and costly • Usually one rater • [Izumi ’03; Eeg-Olofsson ‘02; Han ’06; Nagata ’06; Gamon ’08]

  4. Objective • Problem • Single human annotation has been used as the gold standard • Sidesteps the issue of annotator reliability • Objective • Show that rating preposition usage is actually a very contentious task • Recommend an approach to make annotation more feasible

  5. Experiments in Rater Reliability • Judgments of Native Usage • Difficulty of preposition selection with cloze and choice experiments • Judgments of Non-Native Usage • Double-annotate a large ESL corpus • Show that one rater can be unreliable and skew system results • Sampling Approach • Propose an approach to alleviate the cost and time associated with double annotation

  6. Background • Use system for preposition error detection developed • [Tetreault and Chodorow ’08] • Performance: • Native text: as high as 79% • TOEFL essays: P=84%, R=19% • State-of-the-art when compared with other methods: [Gamon ’08] [De Felice ’08] • Raters: • Two native speakers (East Coast US) • Ages: 26,29 • Two years experience with other NLP annotation

  7. (1) Human Judgments of Native Usage • Is the task of preposition selection in native texts difficult for human raters? • Cloze Test: • Raters presented with 200 Encarta sentences, with one preposition replaced with a blank • Asked to fill in the blank with the best preposition • “We sat _____ the sunshine”

  8. (1) Native Usage – Cloze Test * System’s mismatch suggestions were often not as good as Raters’

  9. (1) Native Usage – Choice Test • Many contexts that license multiple prepositions • Using an exact match can underestimate performance • Choice Test: • Raters presented with 200 Encarta sentences • Asked to blindly choose between system’s and writer’s preposition • “We sat {in/under} the sunshine”

  10. (1) Native Usage – Choice Test • Results: • Both Raters 1 & 2 considered the system’s preposition equal to or better than the writer’s 28% of the time • So a system that performs at 75% with exact-metric, is actually performing as high as 82% • 28% of the 25% mismatch rate = +7%

  11. (2) Human Judgments of Non-Native Usage • Using one rater can be problematic • linguistic drift, age, location, fatigue, task difficulty • Question: is using only one rater reliable? • Experiment: • Two raters double-annotate TOEFL essays for preposition usage errors • Compute Agreement/Kappa measures • Evaluate system performance vs. two raters

  12. Annotation: Error Targeting • Schemes target many different types of errors [Izumi ’03] [Granger ’03] • Problematic: • High cognitive load on rater to keep track of dozens of error types • Some contexts have several different errors (many different ways of correcting) • Can degrade reliability • Targeting one error type reduces the effects of these issues

  13. Annotation Scheme • Annotators were presented sentences from TOEFL essays with each preposition flagged • Preposition Annotation: • Extraneous • Wrong Choice – if incorrect preposition is used (and list substitution(s)) • OK – preposition is perfect for that context • Equal – preposition is perfect, but there are others that are acceptable as well (list others) • Then mark confidence in judgment (binary)

  14. Procedure • Raters given blocks of 500 preposition contexts • Took roughly 5 hours per block • After two blocks each, raters did an overlap set of ~100 contexts (1336 contexts total) • Every overlap set was adjudicated by two other human raters: • Sources of disagreement were discussed with original raters • Agreement and Kappa computed

  15. How well do humans compare? • For all overlap segments: • OK and Equal are collapsed to OK • Agreement = 0.952 • Kappa = 0.630 • Kappa ranged from 0.411 to 0.786

  16. Confusion Matrix Rater 1 Rater 2

  17. Implications for System Evaluation • Comparing a system [Chodorow et al. ’07] to one rater’s judgments can skew evaluation results • Test: 2 native speakers rated 2,000 prepositions from TOEFL essays: • Diff. of 10% precision, 5% recall (rater as gold standard)

  18. Implications of Using Multiple Raters • Advantages of multiple raters: • Can indicate the variability of system evaluation • Allows listing of more substitutions • Standard annotation with multiple annotators is problematic: • Expensive • Time-Consuming (training, adjudication) • Is there an approach that can make annotation more efficient?

  19. (3) Sampling Approach OK 90% • Sampling Approach: • Sample system’s output classifications • Annotate smaller error-skewed corpus • Estimate rates of hits, false positives, and misses •  Can calculate precision and recall 9000 OK 2000 Error Error 10% 1000 1000 Problem: to make an eval corpus of 1000 errors can take 100hrs!

  20. Sampling Methodology Learner Corpus System Sys Flags Error Sys Accepts OK “Error Sub-Corpus” “OK Sub-Corpus” Random Error Sample Random OK Sample Annotation Corpus

  21. Sampling Methodology Learner Corpus 1000 System Sys Flags Error Sys Accepts OK 100 900 “Error Sub-Corpus” “OK Sub-Corpus” Sample Rate = 0.33 Sample Rate = 1.0 Random Error Sample Random OK Sample 100 300 400 Hits = 70 FP = 30 Both OK = 200 Misses = 100 Annotation Corpus

  22. Sampling Results • Two raters working in tandem on sampled corpus • Compare against standard annotation • Results: • Standard: P = 0.79, R = 0.18 • Sampling: P = 0.79, R = 0.16 • Related Work • [Chodorow & Leacock ’00] – usage of targeted words • Active Learning [Dagan & Engelson ’95] – finding the most informative training examples for ML

  23. Summary • Are two or more annotators better than one? • Annotators vary in their judgments of usage errors • Evaluation based on a single annotator under- or over-estimates system performance • Value of multiple annotators: • Gives information about the range of performance • Dependent on number of annotators • Multiple prep’s per context handled better • Issues not unique to preposition task: • Collocation kappa scores: 0.504 to 0.554

  24. Summary • Sampling Approach: shown to be a good alternative strategy to exhaustive annotation approach Advantages • Less costly & time-consuming • Results are similar to exhaustive annotation • Avoid fatigue problem Drawbacks • Less reliable estimate of recall • Hard to re-test system • System comparison difficult

  25. Future Work • Do another sampling comparison to validate results • Leverage confidence annotations

More Related