1 / 11

Evaluating word sketches and corpora

Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex. Evaluating word sketches and corpora. Word sketches. Over 10 years Since 1999 Feedback Good but anecdotal Formal evaluation. Goal. Collocations dictionary

Download Presentation

Evaluating word sketches and corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex Evaluating word sketches and corpora

  2. Adam Kilgarriff Word sketches Over 10 years Since 1999 Feedback Good but anecdotal Formal evaluation

  3. Adam Kilgarriff Goal Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality Ask a lexicographer For 42 headwords For 20 best collocates per headwords “should we include this collocation in a published dictionary?”

  4. Adam Kilgarriff Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999)‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

  5. Adam Kilgarriff Precision and recall • We test precision • Recall is harder • How do we find all the collocations that the system should have found? • Current work • 200 collocates per headword • Selected from • All the corpora we have • Various parameter settings • Plus just-in-time evaluation for 'new' collocates

  6. Adam Kilgarriff Four languages, three families Dutch ANW, 102m-word lexicographic corpus English UKWaC, 1.5b web corpus Japanese JpWaC, 400m web corpus Slovene FidaPlus, 620m lexicographic corpus

  7. Adam Kilgarriff User evaluation Evaluate whole system Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation Can I make the system better? Evaluate each module separately Current work

  8. Adam Kilgarriff Components Grammar NLP tools Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics

  9. Adam Kilgarriff Practicalities Interface Good, Good-but Merge to good Maybe, Maybe-specialised, Bad Merge to bad For each language Two/three linguists/lexicographers If they disagree Don't use for computing performance

  10. Adam Kilgarriff Results Dutch 66% English 71% Japanese 87% Slovene 71%

  11. Adam Kilgarriff Corpus evaluation Collocation-finding Typical corpus task Recall Hold all else constant Statistic, NLP tools, grammar Best results: best corpus (for collocation-finding)‏ Pomikalek: de-duplication

More Related