1 / 38

The FIDA & MULTEXT-East language resources

The FIDA & MULTEXT-East language resources. Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.si , http://nl.ijs.si/et/ Gralis 2006 Institut für Slawistik der Universität Graz 2006-05-09. Overview. Background

nora
Download Presentation

The FIDA & MULTEXT-East language resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.si, http://nl.ijs.si/et/ Gralis 2006 Institut für Slawistik der Universität Graz 2006-05-09

  2. Overview • Background • FIDA: a reference corpus of Slovene • MULTEXT-East: morphosyntactic resources for Central and East-European languages • Other language resources for Slovene Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  3. Language Resources • LR comprise three layers of data: • corpora: mono- or multilingual, reference or specialised, … /variously annotated/ • lexica: vocabularies, morphosyntactic, syntactic, semantic, (ontologies) • standards: linguistic and technical encoding • LRs, esp. corpora are used for empirical language research: • linguistic studies:(annotated) corpus + (sophisticated) search engine • human language technology R&D:testing and training dataset Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  4. Part I.The FIDA corpus Slovene reference corpus for linguistic studies Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  5. FIDA http://www.fida.net/ Joint project (1997-2000) of • Filozofska fakultetaVojko Gorjanc, Marko Stabej, Špela Vintar • Institut Jožef StefanTomaž Erjavec • DZSSimon Krek • AmebisPeter Holozan, Miro Romih Financed by industry partnerns Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  6. Characteristics of FIDA • monolingual • synchronous • written language • reference • representative • balanced • annotated Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  7. Sizes Total 103,513,072 words29,177 textsAvg. text length 3,548 words Largest texts:Leksikon DZS: 508,370 words69 texts > 100.000 Smallest texts:2.648 < 100 words 2 x <w>rezgrtshdrghgth4</w> Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  8. Time Composition • Oldest/most recent text: 1989/2000 • Average date 1997-02 • Texts/Words with unknown date: 3.94%/8.28% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  9. FIDA taxonomoy:publication types … Ft.P.P.O (published) 95.72% Ft.P.P.O.K (books) 22.71% Ft.P.P.O.P (periodicals) 70.50% Ft.P.P.O.P.C (newspaper) 46.59% Ft.P.P.O.P.C.D (daily) 32.67% Ft.P.P.O.P.C.T (weekly) 66.18% Ft.P.P.O.P.C.V (multi-weekly)17.74% … Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  10. FIDA taxonomoy: text types Ft.Z (text type) 99.47% Ft.Z.N (non-ficiton) 93.57% Ft.Z.N.N (non-professional)75.14% Ft.Z.N.S (professional) 18.37% Ft.Z.N.S.H (hum. & soc. sci.) 10.57% Ft.Z.N.S.N (nat. & tech. sci.) 6.04% Ft.Z.U (fiction) 5.90% Ft.Z.U.D (drama) 0.10% Ft.Z.U.P (poetry) 0.17% Ft.Z.U.R (prose) 5.12% Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  11. Markup of FIDA • corpus elements annotated with meta-data (bibliographic, taxonomy) • text linguistically annotated • encoded according to international standards and recommendations • technical: SGML, TEI P3 • linguistic: MULTEXT-East(MULTEXT, EAGLES) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  12. Linguistic annotation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  13. Accesibility Exploitation by partners: • DZS: new dictionaries • Amebis: development of HLT • Arts faculty: teaching • IJS: research on HLT Availability to the public: • access via concordance engine by Amebis • free access, but displays only few hits • possibility of academic licences FIDA (web site) no longer maintained! Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  14. FIDA+ http://www.fidaplus.net/ • FIDA Plus project: • Filozofska fakulteta, Fakulteta za družbene vede, Institut Jožef Stefan • DZS, Amebis • Financed by the ministry+ ind. partners • Extend the corpus with • Web materials • spoken component • Better linguistic markup • Free concordances: up to 100 lines • Also possibility of licences Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  15. Concordancer Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  16. Output Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  17. Extended searches Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  18. Corpus “Nova Beseda”http://bos.zrc-sazu.si/ • being developed at Institute for Slovene language, ZRC SAZU (Primož Jakopin) • Web concordancer with no hit limit • now larger than FIDA • but much less varied: fiction, Delo, DZ • not linguistically annotated Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  19. Part II.MULTEXT-East multilingual morphosyntactic resources for HLT development Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  20. MULTEXT-East resources • MULTEXT-East: Copernicus Joint Project COP 106 (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages • Based on the results of EU MULTEXT (~West) • To produce a harmonised BLARK for six languages: • corpus encoding standardisation (TEI / CES) • multilingual parallel, comparable, speech corpora • morphosyntactic specifications (EAGLES / MULTEXT) • (inflectional) lexicon • annotated corpus • language processing tools Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  21. History of MULTEXT-East resources • First release 1998 on TELRI CD-ROM Vol II:already extended with new languages • Resources since 1998 available on the Web:http://nl.ijs.si/ME/ • Second release 2002 in scope of EU CONCEDE:re-encoding in XML/TEI, harmonisation • Third release 2004:merge of first two releases, further languages • Work (indirectly) supported by:TELRI, CONCEDE, NSF grant, bi-lateral projects Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  22. The Languages of MULTEXT-East • Germanic: English • Romance: Romanian • Baltic: • Latvian • Lithuanian • Finno-Ugric: • Estonian • Hungarian Slavic: • Russian (East Slavic) • Czech (West Slavic) • Slovene(South West Slavic) • Resian (Slovene dialect) • Croatian (South West Slavic) • Serbian (South West Slavic) • Bulgarian (South East Slavic) In progress: • Macedonian • Persian Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  23. Version 3 • Available on http://nl.ijs.si/ME/V3/ • Some parts completely free, others free for research  Web licence • Web pages gives: • extensive documentation • bibliography list • web licence form • resource download Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  24. The MULTEXT morphosyntactic trinity • MULTEXT-East morphosyntactic specifications • MULTEXT-East morphosyntactic lexica • MULTEXT-East morphosyntactically annotated "1984" corpus Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  25. 1. Morphosyntactic specifications • Based on EAGLES / MULTEXT • Define PoS, their attributes and values • The specs are a document containing: • introduction • common tables • language particular sections • Written in LaTeX  PDF & HTML • Derived XML/TEI encoding as feature structures Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  26. Example common table Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  27. Example language specific section table(shows only categories actually used) notes combinations lexicon for Slovene (FIDA):localisation of category names Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  28. Morphosyntactic Complexity Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  29. 2. The lexica • Medium size morphosyntactic lexica • Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. • ~ all word-forms of cca 15.000 lemmas • Lexical entry is composed of three fields: • the word-form: the inflected form of the word • the lemma: the base-form of the word • the morphosyntactic description (MSD) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  30. Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn … Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  31. Lexicon sizes Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  32. 3. The “1984” corpus • Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) • Structuraly annotated • Sentence aligned with English • Words annotated with lemma and MSD • Encoded in TEI P4 (XML) Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  33. Example linguistic encoding Sentence alignment & Context disambiguated lemmas and MSDs <text id="Osl." lang="sl"> <body> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>.</c> </s> … Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  34. Quantifying the corpus Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  35. Utility of MULTEXT-East LRs • Specifications became, for some, the “national” standard • Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILP • A base dataset for further annotation and experiments: • Word-sense disambiguation • WordNet development and evaluation • Syntactic parser induction • Teaching aid in HLT courses • ~ 100 registered users • As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  36. LRs @ JSIhttp://nl.ijs.si/nl.html#Resource Also ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUIS Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  37. Overview of Slovene LRs and services @ Slovenian Language Technologies Societyhttp://nl.ijs.si/sdjt/ Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

  38. Thank you! Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute

More Related