1 / 28

Corpora by Web Services

Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex. Corpora by Web Services. Starting a PhD in NLP. Then Prolog Type in a few grammar rules Lexical entries Example sentences We’re off!. Now. Corpus Which? Budget/schedule

yuval
Download Presentation

Corpora by Web Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex Corpora by Web Services Kilgarriff: Corpora by Web Services

  2. Kilgarriff: Corpora by Web Services Starting a PhD in NLP Then Prolog Type in a few grammar rules Lexical entries Example sentences We’re off!

  3. Kilgarriff: Corpora by Web Services Now Corpus Which? Budget/schedule Howe much can we afford? Hard disk space Access software Build Big job, making it fast is hard – or Research, acquire, install, maintain …

  4. Kilgarriff: Corpora by Web Services Resarch question Morphology, syntax, discourse structure, semantics, anaphor First six months at least Acquiring data, software Complications

  5. Kilgarriff: Corpora by Web Services

  6. Kilgarriff: Corpora by Web Services If you’re not super-geeky Did I do it properly? Dumbing down Let’s choose an easier question Looking over shoulder

  7. Kilgarriff: Corpora by Web Services Disappointment

  8. Kilgarriff: Corpora by Web Services Making it easy Like picking up a hire car

  9. Kilgarriff: Corpora by Web Services Corpora by web services Possible? Already available

  10. Kilgarriff: Corpora by Web Services Sketch Engine Corpus querying Fast Handles large corpora In use for lexicography at OUP, CUP, Macmillan, Collins, Le Robert Word sketches Data-driven summary of a word’s grammatical and collocational behaviour

  11. Kilgarriff: Corpora by Web Services

  12. Corpora Arabic 174 Hindi 31 Russian 188 Chinese 456 Indonesian 102 Slovak 536 Czech 800 Irish 34 Slovene 738 Dutch 128 Italian 1910 Spanish 117 English 5508 Japanese 409 Swedish 114 French 126 Norwegian 95 Telugu 5 German 1627 Persian 6 Thai 108 Greek 149 Portuguese 66 Vietnamese 174 Romanian 53 Welsh 63 Kilgarriff: Corpora by Web Services

  13. Kilgarriff: Corpora by Web Services Big, High Quality corpora Big Performance Banko and Brill 2004 There’s no data like more data Ample data for rare phenomena Big subcorpora 5b Medical: 30m

  14. Kilgarriff: Corpora by Web Services Quality Bad data Spam Navigation-bars Duplicates Lists Bungled formatting Wrong language … Less discussed Maybe a footnote Quick fixes and run

  15. Kilgarriff: Corpora by Web Services The Google/Yahoo/Bing option Appeal Not setup costs Start googling today

  16. Kilgarriff: Corpora by Web Services but Limited hits-per-query Limited hits-per-day Sort order 'unsorted' not possible Snippets too short for research No (documented) morphology Limited query syntax

  17. Kilgarriff: Corpora by Web Services and At mercy of commercial company Might change at any time Not replicable

  18. Kilgarriff: Corpora by Web Services So Appeal No setup costs Serious research Many difficult practical issues Not a tool designed for linguists Conclusion If only SE indexes are big enough Yes Else no

  19. Kilgarriff: Corpora by Web Services Strategy More languages Corpus Factory, as Sharoff Bigger and better (English) Big Web Corpus (BiWeC)‏ 5.5b fully processed Rich markup New Model Corpus Collaboration model

  20. Kilgarriff: Corpora by Web Services TEDDCLOG Taiwan English Data-Driven CLOze Generation with Simon Smith and colleagues, Taipei API case study

  21. Kilgarriff: Corpora by Web Services Cloze 'fill-the gap' Several metal _____ violently with cold water A: behave B: react C: realise D: respond Popular with students, teachers, testers Unpopular with theorists :-(

  22. Kilgarriff: Corpora by Web Services One objection Test item writers make them up Not naturally-occurring language The Sinclair-Johns critique Also: expensive TEDDCLOG Uses corpus sentences and distractors

  23. Kilgarriff: Corpora by Web Services Thesaurus module behave, interact, respond behave realise respond react metals behave x metals respond x metals realise x metals react √ Diffs module Concordance module Several metals ___ violently with cold water. (a) behave (b) react (c) realise (d) respond Several metals react violently with cold water. Text processing module

  24. Kilgarriff: Corpora by Web Services API calls Find distractorts thesaurus Find key-only collocate Sketch diffs Needs optimising Find carrier sentence Concordance with GDEX module Good Dictionary Example Finder

  25. Kilgarriff: Corpora by Web Services Current status • TEDDCLOG • Next phase: producing decent results • Corpora by Web Services • Increasing server capacity • Looking for users

  26. Kilgarriff: Corpora by Web Services Not just like picking up a hire car

  27. Kilgarriff: Corpora by Web Services Not just like picking up a hire car more like picking up a Ferrari

  28. Kilgarriff: Corpora by Web Services Another announcement: DANTE Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins BNC, FrameNet, Euralex, COBUILD... English side, New English-Irish dictionary Available for NLP research imminently

More Related