1 / 35

The Welsh Natural Language Toolkit WNLT & CYMRIE

The Welsh Natural Language Toolkit WNLT & CYMRIE. Dr. Andreas Vlachidis, Dr. Daniel Cunliffe, Prof. Douglas Tudhope Hypermedia Research Group andreas.vlachidis@southwales.ac.uk http://hypermedia.research.southwales.ac.uk/kos/wnlt/ https://sourceforge.net/projects/wnlt/. Overview.

Download Presentation

The Welsh Natural Language Toolkit WNLT & CYMRIE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Welsh Natural Language ToolkitWNLT & CYMRIE Dr. Andreas Vlachidis, Dr. Daniel Cunliffe, Prof. Douglas Tudhope Hypermedia Research Group andreas.vlachidis@southwales.ac.uk http://hypermedia.research.southwales.ac.uk/kos/wnlt/ https://sourceforge.net/projects/wnlt/

  2. Overview • Background • Toolkit - WNLT • Named Entity Pipeline - CYMRIE • Known Limitations • Future Directions 2

  3. Background WNLT: funded by the Welsh-language technology and digital media grant Aims: to develop a suite of open source software modules that enable Welsh Language computational linguistic applications 3

  4. Background Builds: on the General Architecture for Text Engineering (GATE) by adapting and expanding existing modules. • Tokenizer • Sentence Splitter • Part of Speech Tagger • Lemmatizer • ANNIE (Gazetteer – NE Transducer) http://www.gate.ac.uk 4

  5. The Toolkit Adapting to Welsh: a series of steps enabling processing of Welsh text that involved • Algorithmic Arrangements • Adding New Classes • Expanding and Overriding • Knowledge Based Input • Glossaries, Lexicons, Gazetteers • Rules and Configuration 5

  6. Tokenizer Splits text into very simple tokens such as numbers, punctuation and words (upper case, lower case, orth) Tokenizer Classes. JAVA Rules. RegExp Post Process. Jape 6

  7. Adapting Tokenizer Hyphenation • Place NamesLlanarmon-yn-Ial • Commonly Used Prefixcyd-ddefnyddir • Separate constituentscybydd-dod 7

  8. Adapting Tokenizer Apostrophe • Initial Vowel Loss (concatenation)Dw i'n hoffy (yn) • Medially Vowel Lossi'engoed • Final Consonant Losscryf hapusa' 8

  9. Adapting Tokenizer • Ordinals1af, 2il, 3ydd • Special Cases (Compound Prepositions) • Ar gyfer (for) • Er mwyn (for the sake of) • Yn erbyn (against) • Oddi am (off - from) • Oddi ar (off) 9

  10. Sentence Splitter The sentence splitter segments the text into sentences. Sentence Splitter Classes. JAVA Lexicon of Abbreviations 10

  11. Adapting Sentence Splitter • Uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds. • Nearly 400 Abbreviations 11

  12. Adapting Sentence Splitter • Abbreviations List • Linguisticabs (absolute), cfst (synonym) • NarrativeBrth (British) , e.e (for example) • ScienceSeic (Psychology), Tiwt (Teutonic) • Spatial Morg (Glamorgan) • TemporalC.C (B.C), Mer (Wednesday) 12

  13. Part of Speech Tagger • Produces a part-of-speech tag as an annotation on each word POS Classes. JAVA Lexicon. txt RuleSet 13

  14. Part of Speech Tagger • EURFA Dictionary Eurfa is the largest Welsh dictionary under a free license • Contains verbal inflections • Does not contain mutated forms of nouns. • 210557 Records • Records contain • Lemma • Part of Speech • http://www.eurfa.org.uk/ 14

  15. Part of Speech Tagger • EURFA Tags: adj, adv, comp, cond, conj, dem, dim, f, fut, imper, hyp, imperf, infin, int, m, mf, neg, n, num, ord, past, pluperf, poss, pl, prep, preq, pron, quan, rel, sg, sp, subj, v MAPPING • Hepple Tags: CC CD DT EX FW IN JJ JJR JJS JJSS LS MD NN NNP NNPS NNS NP NPS PDT POS PP PRPR$ PRP PRP$ RB RBR RBS RP STAART SYM TO UH VBD VBG VBN VBP VB VBZ WDT WP$ WP WRB 15

  16. POS – Class and Lexicon 16

  17. Lemmatizer • Considering one token one at a time, it identifies its lemma. Uses a range of techniques in a cascading order to address major mutations Lemmatizer Rules. RegExp Postprocess Validation Lexicon. txt Classes. JAVA Mutation 17

  18. Lemmatizer - Lexicon • The lexicon is invoked first 18

  19. Lemmatizer - Rules • The Rules address some* plural forms *plural in Welsh is challenging 19

  20. Lemmatizer - Postprocess • The postprocess use contextual rules for finding mutation (soft, aspirate, nasal) • Primarily addressing contact mutation • Reverting to 'original' from a mutated form • Reverted Form is Validated via Gazetteer • Some mutation cases cannot be resolved contextually ( eg soft mutation of G ) 20

  21. Lemmatizer - Validation The reverted form is validated against Eurfa • If it exists in glossary then is valid • Else the reverted term is dropped *Eurfa contains many nouns but not their mutated forms 21

  22. CYMRIE • Information Extraction system for Welsh • CYMRIE adapts ANNIE to Welsh • ANNIE (A Nearly-New Information Extraction System) : GATE’s IE system 22

  23. CYMRIE • CYMRIE adapts ANNIE to Welsh input using a modified version of ANNIE targeted at the requirements of the Welsh language • A wide range of gazetteer lists • Named Entity Rules (NE Transducer) • CYMRIE does not currently include a co-reference resolution module 23

  24. CYMRIE - Gazetteer • The Gazetteer contains: 70 Welsh lists • Newly introduced • Result of translation • 70674 unique entries (not including spelling variations) • 51 original ANNIE lists relating to person name, place name, company names etc 24

  25. CYMRIE - Gazetteer • Major type groups • Date • Government • Facility • Job Title • Location • Organisation • Person • Stop-word • Time • Title Vocabulary Resources • TermCymry • The Welsh Assembly website • Wikipedia • National Gazetteer of Wales Minor type groups • Male • Female • Mountain • University • etc 25

  26. CYMRIE – NE Transducer • Named Entity Recognition (Semantic Tagger) • Adapting ANNIE's NE Transducer to Welsh • Persongender: male, female • LocationlocType: region, airport, city, country, county, province • OrganizationorgType: company, department, government, newspaper • Money / Percent • Date kind: date, time, dateTime • Addresskind: email, url, phone, postcode, complete, ip, other 26

  27. CYMRIE – NE Transducer • The adapted rules addressed • Syntactic behaviour (adjective after noun) • Post Brenhinol • Cymdeithas Hanes Ysbyty Gogledd Cymru • Use of the definite article • Heddlu 'r Abertawe • Controlled Vocabulary in rules • a, ac, (and) , San, Sant (saint) • Validation through noun phrase (proper nouns e.g. Mae) 27

  28. CYMRIE – NE Transducer 28

  29. Performance - Evaluation • Gold Standard • 2221 Tokens • 230 Entities (Date, Location, Organisation, Percent, Person) • Results • Tokenizer : Recall-99%, Precision-98%, F1-99% • POS: Recall-82%, Precision-81%, F1-81% • Lemma: Recall-80%, Precision-79%, F1-80% • NER: Recall-89%, Precision-86%, F1-87% *Partial matches weight as “half-matches” (average mode) 29

  30. Known Limitations • The finite nature of knowledge resources (i.e. Dictionary) vs the non-finite nature of language • The role of contextual evidence in part of speech tagging • e.g. when “y” is a definite article and when a pronoun! • Mutations beyond “simple” contact. (“Transitive” relationship mutations e.g. sosban fach wen ) 30

  31. Known Limitations • Evaluation • Currently the work is evaluated against a small Gold Standard • The Knowledge Base input is critical • Enhancing Eurfa with additional terms and lemmas • Contextual Analysis • Combination of training (Machine Learning) and Rules 31

  32. Future Directions • Improve performance of WNLT via: • Enhancing Knowledge Resources • Improving Rules / RegExp Files • Adding POS context driven rules • Expand the scope of CYMRIE application • CorCenCC (Welsh Corpus) project? • Other entities of interest 32

  33. Future Directions • Move into Social Media Analysis • Sentiment Analysis • Twitter feeds. • etc • Potential new modules • Co-reference • Noun Phraser • Verb Chunker 33

  34. Acknowledgements • Special Thanks to • Welsh for Adults – Glamorgan Centre (USW) • Gareth Clee for helping with grammar • School of Welsh – Cardiff University • Jeremy Evas • Benjamin Screen for his help on evaluation and translation 34

  35. The Welsh Natural Language ToolkitWNLT & CYMRIE Dr. Andreas Vlachidis, Dr. Daniel Cunliffe, Prof. Douglas Tudhope Hypermedia Research Unit andreas.vlachidis@southwales.ac.ukhttp://hypermedia.research.southwales.ac.uk/kos/wnlt/ https://sourceforge.net/projects/wnlt/

More Related