1 / 33

Dublin April 3 rd , 2009

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec http://nl.ijs.si/et/ Department of Knowledge Technologies Jožef Stefan Institute Ljubljana Slovenia. Dublin April 3 rd , 2009. Overview of the talk.

cybil
Download Presentation

Dublin April 3 rd , 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languagesTomaž Erjavechttp://nl.ijs.si/et/Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia Dublin April 3rd, 2009

  2. Overview of the talk • Part-of-speech tagging, tagsets and interoperability • MULTEXT(-East) morphosyntactic specifications • Languages, formats, transformations • An application: JOS resources for Slovene • Conclusions Dublin, 4.4.2009

  3. Part-of-speech tagging • The task of assigning the correct PoS tag to each word in a running text, e.g.Under/INthe/DTproposal/NN,/,Delmed/NNPwould/MDissue/VBabout/IN123.5/CDmillion/CDadditional/JJDelmed/NNPcommon/JJshares/NNSto/TOFresenius/NNP… • Important HLT infrastructure • Very useful annotations for linguists • Some applications: • pre-processing step for further analyses: lemmas, syntactic structure, etc. • text indexing, e.g. nouns are more useful than verbs Dublin, 4.4.2009

  4. Methods of PoS tagging • PoS tagging: • determine ambiguity class or word (saw → NN | VBD) • disambiguate to correct tag in (local) context(“I saw/VBDa saw/NN “) • Tagger training: • manually annotated corpus: source of probabilities for tags given a (local) context + • (lexicon: gives possible tags for each word-form) • Popular taggers: • TnT (HMM tagger), TreeTagger (decision trees), TBL (transformation based tagging) • Tagging usefulness as well as accuracy crucially depends on the tagset Dublin, 4.4.2009

  5. English tagsets • Tagging first developed for English (Brown, CLAWS, PTB tagsets) • English inflectionally very poor language → small tagsets ~ 50 different tags • Tags are typically “synthetic”, i.e. the tag does not transparently map to features e.g. : • to/TO (PoS?) • Delmed/NNP (number?) • shares/NNS (number?) Dublin, 4.4.2009

  6. Tagsets for other languages • will often have many more morphosyntactic features associated with a word, so tagsets will be larger • e.g. Slovene nouns: • type: common, proper • gender: masculine, feminine, neuter • number: singular, dual, plural • case: nom., gen., dat., acc., loc., ins. • (animacy: yes, no) • = 104 “PoS” tags just for Nouns • Russian, Czech, Slovene ~ 1000-2000 word level syntactict tags Dublin, 4.4.2009

  7. PoS tags vs. MSDs • PoS tags: • used in corpora for corpus annotations / tagging • typically synthetic • Morphosyntactic Descriptions (MSDs): • used in inflectional lexica for lexical annotations / morphological analysis • typically analytic • Relation of PoS tagsets to MSD tagsets/features • in general: |PoS| < |MSD| • but in most MULTEXT-East languages: [PoS] ≡ [MSD] Dublin, 4.4.2009

  8. Developing a multilingual morphosyntactic framework • Interoperability: Tagsets developed for various languages (or even for the same language) have no connection with each other and are often poorly documented • Best practice: Languages that do not yet have a tagset could benefit from an operational framework in which to model it Dublin, 4.4.2009

  9. so, wouldn’t it be nice to have: • an open, standardised, documented, flexible model for MSD/PoS tagset design, • that would be instantiated for lots of languages, • and could be simply applied to any language? Dublin, 4.4.2009

  10. EU standardisation efforts • EAGLES: Expert Advisory Group for Language Engineering Standards (1993-1996) • MULTEXT: Multilingual Text Tools and Corpora (1995) • MULTEXT-East: MULTEXT for Central and Eastern European Languages: • Version 1: TELRI edition (1998) • Version 2: Concede edition (2002) • Version 3: TEI edition (2004) • Version 4: MondiLex edition (2009?) • ... • ISO / TC 37 / LMF / isoCat (2008) Dublin, 4.4.2009

  11. MULTEXT-East morphosyntactic resources • Basic Language Resource Kit: • specifications:define features and MSDs • lexica (~15,000 lemmas):triplets: word-form / lemma / MSD • parallel corpus: MSD and lemma annotated • Freely available for research http://nl.ijs.si/ME/ Dublin, 4.4.2009

  12. 1984: aligned and annotated Dublin, 4.4.2009

  13. MULTEXT-East languages Dublin, 4.4.2009

  14. The MULTEX(-East) morphosyntactic specifications • They specify that e.g.”Ncmsn” • corresponds to the feature-structure[Noun, Type=common, Gender=masculine, Number=singular, Case=nominative] • is a valid MSD for Slovene • Specifications consist of • Front matter • Common part - common definitions for all languages (features) • Language particular parts - particulars for each language (MSD set) Dublin, 4.4.2009

  15. V4 specs draft in HTML Dublin, 4.4.2009

  16. Specifications in Version 4 • Encoded in XML / teiLite(in Version 3: LaTeX) • TEI = Text Encoding Initiative Guidelines P4 • Still in “book-like” in form, to make authoring easier • XSLT into other formats: • HTML • tabular mapping formats(e.g. MSD to features) • XML/TEI feature library • (OWL) Dublin, 4.4.2009

  17. The common specifications • Define categories (“parts-of-speech”) • For each category define features, i.e. attributes and their values • For each attribute-value specify for which languages it is appropriate • Give positional mapping to MSDs: • each attribute assigned a position • each attribute-value assigned a one-character code Dublin, 4.4.2009

  18. Common table (HTML) Dublin, 4.4.2009

  19. Common table (source XML/teiLite) Dublin, 4.4.2009

  20. Language particular sections • Recap the feature definitions for the language • Add “combinations”, i.e. feature-coocurrence restrictions • Add “lexicon”, i.e. list of all valid MSDs for language • Possibly localise the features and codes • Possibly give notes and examples Dublin, 4.4.2009

  21. Combinations Dublin, 4.4.2009

  22. Lexicon Dublin, 4.4.2009

  23. Jezikoslovno označevanje slovenščinehttp://nl.ijs.si/jos Dublin, 4.4.2009

  24. JOS as a bridge to MULTEXT-East Version 4 FidaPLUScorpus MTE V3 slvspecifications JOScorpora JOS (slv)specifications MTE V4 specifications MTE V4 (slv)specifications Dublin, 4.4.2009

  25. Dublin, 4.4.2009

  26. JOS specifications • XML/teiLite + XSLT transforms • Allow reordering of attribute positions(Vm-----d → Vmd) • i18n / slv+eng: • translation: specifications • localisation: attributes, values, codes • localisation: TEI element names Dublin, 4.4.2009

  27. Dublin, 4.4.2009

  28. Dublin, 4.4.2009

  29. MSD conversion tables • Tabular UTF-8 files • MSD-slv to -eng • MSD to features • Collating sequence e.g. 01N0101010100 Somei Ncmsn 01N0101010200 Somer Ncmsg 01N0101010300 Somed Ncmsd Ncmsn Noun Type=common Gender=masculine Number=singular Case=nominative Animacy=0 Ncmsg Noun Type=common Gender=masculine Number=singular Case=genitive Animacy=0 Ncmsd Noun Type=common Gender=masculine Number=singular Case=dative Animacy=0 Dublin, 4.4.2009

  30. Adding a new language • XSLT scripts: • mtems-split.xsl: make a template for the language particular section of a new language • mtems-merge: merge a new language particular section to the common tables • Maybe shortly to be tested on new Slavic languages in the scope of MondiLex Dublin, 4.4.2009

  31. Critiques • It’s just an exercise in encoding anyway • Same is different, different is same • The Procrustean bed of standards • Policy change: from unification to harmonisation (hippy school) Dublin, 4.4.2009

  32. Conclusions • Presented work-in-progress on “standardisation” of multilingual morphosyntactic specifications • Specifications are a de-facto standard for several languages (Romanian, Slovene, Croatian) • Could serve as “hub” encoding for multilingual applications, e.g. MT • and as an framework for new languages Dublin, 4.4.2009

  33. Further work • Finishing MTE V4! • Distribution: LDC, ELDA • Relation to ISO-TC37 standards: • LMF, isoCAT • Connecting to GOLD ontology • Adding new languages: • Slavic completion • Western European: MULTEXT • Japanese: chasen tagset, jpWaC(-L2) • Irish?☺ Dublin, 4.4.2009

More Related