1 / 15

GDEX: Automatically finding good dictionary examples in a corpus

GDEX: Automatically finding good dictionary examples in a corpus. Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing Ltd, UK Masaryk University, Czech Rep A&C Black Publishers Ltd., UK Macmillan Education, UK Lexicography MasterClass Ltd., UK.

eunice
Download Presentation

GDEX: Automatically finding good dictionary examples in a corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing Ltd, UK Masaryk University, Czech Rep A&C Black Publishers Ltd., UK Macmillan Education, UK Lexicography MasterClass Ltd., UK

  2. Users appreciate examples • Paper: space constraints • Electronic: no space constraints • Give lots of examples Constraint: • Cost of selection, editing

  3. Project • Macmillan English dictionary • Licensing arrangement with A&C Black • Already had 1000 collocation boxes • See collocationality paper, ELX 2006 • Average 8 per box • New electronic version • All 8000 collocations need examples • Authentic; from corpus

  4. Old method • Lexicographer • Gets concordance for collocation • Reads through until they find a good example • Cut, paste, edit

  5. New method • Lexicographer • Gets sorted concordance • 20 best examples in spreadsheet • Less reading through • Tick the first good one, edit

  6. What makes a good example? • Readable • EFL users • Informative • Typical, for the collocation • Gives context which helps user understand the target word/phrase

  7. Readability • 70 years research • Not just (or mainly) EFL • Educational theory • Teaching children to read • Instruction manuals • Publishing

  8. Readability tests • Fleish Reading Ease test (1948) • Ave sentence length, ave word length • In some word processing software • Many similar measures • Recent work • Language modelling from training data • Target levels • US grades • Common European Framwork

  9. GDEX • Get concordance for collocation • For each sentence • Score it • Sort • Show best ones

  10. GDEX heuristics • Sentence length (10-26 words) • Mostly common words: good • Rare words: bad • Sentences • Start with capital, end with one of .!? • No [, ], <, >, http, \ • Penalise: • Other punctuation, numbers • More than 2 or 3 capitals • Typicality: third collocate is a plus

  11. Weighting • For each sentence • Score on each heuristic • Weight scores • Add together weighted score • How to set weights?

  12. Machine learning • Two students: • Manually judged 1000 “good examples” • Weights • set to mimic students´ choices

  13. Was it successful? • Did it save lexicographer time? • Definitely (says project manager) • Corpus choice • Started with BNC but • Too old • Not enough examples • If no good examples in corpus, GDEX can’t help • Changed to UKWaC • 20 times bigger; from web; contemporary • Better • Most web junk filtered out • Usually a good example in top twenty

  14. GDEX and TALC • TALC • Teaching and Language Corpora • Goal: bring corpora into lg teaching • Usual problem • Concordances are tough for learners to read • Way forward • GDEX examples • Half way between dictionary and corpus

  15. GDEX: Models for use • More examples for dictionaries • Speed up, as with MED or • Fully automatic “more examples” • Corpus query tool • Sort concordances, best first • Now an option in the Sketch Engine • Automatic collocations dictionary • http://forbetterenglish.com

More Related