1 / 16

Detecting Missing Hyphens in Learner Text

Detecting Missing Hyphens in Learner Text. Aoife Cahill * , Martin Chodorow ** , Susanne Wolff * and Nitin Madnani * * Educational Testing Service, 660 Rosedale Road, Princeton, NJ 08541, USA {acahill, swolff, nmadnani}@ets.org

Download Presentation

Detecting Missing Hyphens in Learner Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Missing Hyphens in Learner Text Aoife Cahill*, Martin Chodorow**, Susanne Wolff* and Nitin Madnani* *Educational Testing Service, 660 Rosedale Road, Princeton, NJ 08541, USA {acahill, swolff, nmadnani}@ets.org **Hunter College and the Graduate Center, City University of New York, NY 10065, USA martin.chodorow@hunter.cuny.edu

  2. Outline • Motivation • Baselines • New Model • Experiments and Results • Conclusion

  3. Motivation • Hyphen errors are infrequent • But are an important consideration for students aiming to improve the overall quality of their writing Dogs are lucky… most of them have built in fur coats! Brrrr! From: http://daughternumberthree.blogspot.com

  4. Motivation • Missing hyphen errors are not all lexical • Schools may have more after school sports. • I went to the dentist after school today. • Language Learner text introduces additional complications • My father like play basketball with me.

  5. Baselines • Baseline 1: Collins Dictionary [5,246] • predicts a missing hyphen between bigrams that appear hyphenated in the dictionary • Baseline 2: Wiki (counts) [1,095] • predicts a missing hyphen between bigrams that occur hyphenated more than 1,000 times in Wikipedia • Baseline 3: Wiki (probs) [673,269] • predicts a missing hyphen between bigrams where the probability of the hyphenated form as estimated from Wikipedia is greater than 0.66

  6. New Model • Logistic Regression Model • assigns a probability to the likelihood of a hyphen occurring between wi and wi+1

  7. Data • Training • Well-edited text (San Jose Mercury News) • Error-corrected data mined from Wikipedia Revisions • Combination • Test • Artificial errors: Brown corpus • Learner text: CLC-FCE corpus, TOEFL/GRE essays

  8. Evaluation on Artificial Errors • Brown Corpus: 24,243 sentences, automatically remove hyphens from 2,072 words • Each system makes a prediction for all bigrams about whether a hyphen should appear between the pair of words • precision: how many of the missing hyphen errors predicted by the system were true errors • recall: how many of the artificially removed hyphens the system detected as errors • f-score: the harmonic mean of precision and recall

  9. Artificial Error Results

  10. Evaluation on Learner Text (1) • CLC-FCE corpus: 173 instances of missing hyphen errors

  11. Evaluation on Learner Text (1) • Some observations: • Very low frequency error (173) • Dominated by one lexical item: make-up • Errors are not independent events

  12. Evaluation on Learner Text (2) • Precision-only manual evaluation • Random sample of 100 errors per system detected in 1,000 student essays • 2 native speaker judgements (0.79)

  13. Evaluation on Learner Text (2) • Native Speaker Judgements (Precision)

  14. Conclusions • Logistic Regression Model for predicting missing hyphens in learner text • Trained on: • A corpus of well-edited text • A corpus of automatically mined corrections • In general, the classifiers outperform the baselines, especially in terms of precision Thanks! Questions? Comments? http://blog.ezinearticles.com

  15. Brown Corpus: Precision/Recall

  16. CLC-FCE: Precision/Recall

More Related