1 / 28

A method for unsupervised broad-coverage lexical error detection and correction

A method for unsupervised broad-coverage lexical error detection and correction. 4th Workshop on Innovative Uses of NLP for Building Educational Applications Workshop NAACL June 5, 2009 Nai-Lung Tsao and David Wible National Central University, Taiwan. The Research Context.

robbin
Download Presentation

A method for unsupervised broad-coverage lexical error detection and correction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications Workshop NAACL June 5, 2009 Nai-Lung Tsao and David Wible National Central University, Taiwan

  2. The Research Context IWiLL Online Writing Platform www.iwillnow.org • Since 2000 under the support of MOE & Taipei Bureau of Education • IWiLL has been used in Taiwan by: • 455 schools • 2,804 teachers • 161,493 students and 22,791 independent learners. • Teachers have authored 9,429 web-based lessons with the system’s authoring tool. • The learner corpus (English TLC) has archived over 32,000 English essays • 5 million words of machine-readable running text written by Taiwan’s learners using the IWiLL writing platform. • 100,000 tokens of teacher comments on these student texts

  3. Second Language Learners’Error Detection and Correction • Lexical and Lexico-grammatical errors - an open-ended class - driving teachers crazy - either no rules involved or rules of very limited productivity

  4. 1. Target Language Knowledgebase: Two components to our system INPUT: user-produced string ‘on my opinion’ 2. Edit Distance Algorithm Hybrid n-grams extracted from BNC Error Detection/Correction Compares User’s string & Hybrid N-grams

  5. 1. Target Language Knowledgebase: The Knowledgebase of Hybrid N-grams What, Why, and How What is a hybrid n-gram? An n-gram that admit items of different levels - Traditional n-gram: ‘in my opinion’ - Hybrid n-gram: ‘in [dps] opinion’ Hybrid n-grams extracted from BNC Why use hybrid n-grams? Error Detection. - Traditional n-grams and error precision True positive: Enjoy to canoe > unattested > marked as error False positive: Enjoy canoeing> unattested > marked as error - POS n-grams and recall Based on attested strings like: enjoy hiking OR like watching We could extract the POS gram: V + VVg But this would accept: hope exploring How hybrid n-grams are extracted for the knowledgebase

  6. 1. Target Language Knowledgebase: How the hybrid n-grams are extracted Potential Hybrid N-grams for a string {POS rough} V V lexeme enjoy VVd hike VVg Hybrid n-grams extracted from BNC [POS detailed] enjoyed hiking word form Some hybrid n-grams for enjoyed hiking enjoyed + V enjoy + V enjoyed + VVg enjoy + VVg VVd + VVg enjoyed + hike enjoy + hike V + hiking etc. 4 categories of info for each item In an n-gram

  7. 1. Target Language Knowledgebase: Two components: INPUT: user-produced string ‘on my opinion’ 2. Edit Distance Algorithm Hybrid n-grams extracted from BNC Error Detection/Correction Compares User’s string & Hybrid N-grams

  8. Edit Distance Component Steps in measuring edit distance • Generate all hybrid n-grams from • the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Rank candidates by weighted edit distance between members of C and S

  9. Edit Distance Component Steps in measuring edit distance • Generate all hybrid n-grams from • the learner input string (Set C) enjoyed hiking Input from learner: enjoyed + V enjoy + V enjoyed + VVg enjoy + VVg VVd + VVg enjoyed + hike enjoy + hike V + hiking etc. Hybrid n-grams generated from learner string Set C =

  10. Edit Distance Component Steps in measuring edit distance • Generate all hybrid n-grams from • the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage c. Eliminate N-grams under frequency threshold We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Calculate weighted edit distance between members of C and S

  11. Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage c. Eliminate N-grams under frequency threshold

  12. Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage c. Eliminate N-grams under frequency threshold

  13. Hybrid n-grams generated from learner string enjoyed hiking enjoyed + V enjoy + V enjoyed + VVg enjoy + VVg VVd + VVg enjoyed + hike enjoy + hike V + hiking etc. hike enjoy Set C = Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S

  14. Hybrid n-grams generated from learner string enjoyed hiking enjoyed + V enjoy + V enjoyed + VVg enjoy + VVg VVd + VVg enjoyed + hike enjoy + hike V + hiking etc. Set C = Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S V V enjoy VVd hike VVg enjoyed hiking

  15. Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S V V enjoy VVd hike VVg enjoyed hiking

  16. V hike VVg Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S enjoy hiking

  17. Edit Distance Component Steps in measuring edit distance 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) Target Knowledgebase Hybrid N-grams Set S V hike enjoy VVd enjoyed

  18. Pruning Set S of Candidates X enjoy + V 100 tokens enjoy + VVg We prune the subsuming Hybrid N-gram in cases where a subsumed one accounts for 80% or more of the subsuming set 80 tokens

  19. Pruning Set S of Candidates enjoy + VVg We prune the subsuming Hybrid N-gram in cases where a subsumed one accounts for 80% or more of the subsuming set 80 tokens Pruning of the Knowledgebase will affect error recall The remaining Set S is filtered for frequency of member hybrid n-grams

  20. Edit Distance Component Steps in measuring edit distance • Generate all hybrid n-grams from • the learner input string (Set C) 2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S) b. Prune Set S using filter factor or coverage We limit edit distance to ‘substitution’. So we limit search to n-grams of the same length as the learner’s input string. 3. Rank candidates by weighted edit distance between members of C and S

  21. enjoyed to hike enjoy VVt enjoy V V to hike VVd to hike etc Weighting of Edit Distance Learner string ‘enjoyed to hike’ Generate Set S of Hybrid N-grams Generate Set C of Hybrid N-grams Distance = 1: string c and string s are identical but for one slot Correction candidates are those with a distance 1 or lower. enjoyed hiking enjoyed hike enjoy VVg VVd hiking V hiking VVd hike enjoy VVg enjoy learning Differing element = same lexeme but diff word form is closer than different lexeme Ranking of candidates with distance = 1 from learner string Differing element = same rough POS but diff detailed POS is closer than diff rough POS

  22. Examples 1 C-selection Enjoy to swim > enjoy swimming Enjoy to shop > enjoy shopping Enjoy to canoe > enjoy canoeing Enjoy to learn > *need to learn; ?want to learn; enjoy learning Enjoy to find > *try to find; *expect to find; *fail to find; *hope to find; *want to find Hope finding > hope to find Let us to know > let us know Get used to say > *get used to; *have used to say; Collocation with C-selection Spend time to fix > spend time fixing; take time to fix Take time fixing > take time to fix Take time recuperating > take time to recuperate Spend time to recuperate > spend time recuperating; take time to recuperate

  23. Examples 2 Preposition Fixed expressions: • On the outset > At the outset • In different reasons > For different reasons • In that time > at that time; by that time • On that time > at that time; by that time • On my opinion > in my opinion • In my point of view > from my point of view • I am interested of > I am interested in • She is interested of > she is interested in • I am interesting in > I am interested in • She is interesting in > She is interested in • Just on the time when > just at the time when; *just to the time when

  24. Examples 3 Preposition/Particle: Verb + preposition (particle) • Discuss to each other > *discussing to each other (should be discuss WITH each other) • Discuss this to them > discuss this with them • Waited to her > waited for her • Waited to them > waited for them Noun + preposition • His admiration to > his admiration for • His accomplishment on > * No suggestion • The opposite side to > the opposite side of • A crisis on > a crisis of; a crisis in • A crisis on his work > a crisis of his work (*a crisis on his work)

  25. Examples 4 Content Word Choice • Lead a miserable living > make a miserable living *leading a miserable living *led a miserable living lead a miserable life • Frame of mood > ??change of mood; frame of mind; * frame of reference

  26. Examples 5 Morpho-syntactic • She will ran > She will run • She will runs > She will run Pronoun case: • What made she change > * what made she change (no correction; • should be made HER change) Noun countability or number errors: • In modern time > in modern times Number agreement in head noun and determiner • Too much people > too many people • So much things > so many things • So many thing > so many things • One of the man > one of the men • One of the problem > one of the problems • In my opinions > in my opinion • A lot of problem > a lot of problems • Complementizer selection:I wonder that > I wonder if; I wonder whether

  27. Future Work • Improving POS tagging using 2nd order model • Machine learning of weighting for the various features determining edit distance • Incorporation of this into our IWiLL online writing environment • Incorporate MI for the knowledgebase’s hybrid n-grams

  28. Thank you

More Related