1 / 21

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Prim(j)ena M ULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des M ULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora Application of M ULTEXT-East and TEI in the compilation of parallel corpora. Tomaž Erjavec Department of Knowledge Technologies

huela
Download Presentation

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusaApplikation des MULTEXT-East und der TEI-Normen bei der Erstellung vonParallelkorporaApplication of MULTEXT-East and TEI in the compilation of parallel corpora Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.si, http://nl.ijs.si/et/

  2. Overview • The need for standardisation • Corpus encoding in TEI • MULTEXT-East morphosyntactic descriptions Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  3. Why standards (for digital language resources)? • public documentation (+ software) • (semi)automated validation • application independent • platform independent • do not become obsolescent (as fast) • However: • demand time to understand and use them • there are (too) many and not all are accepted • they are not perfectly tuned to application (overhead) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  4. TEI: the Text Encoding Initiative • TEI Guidelines are a vocabulary to describe text for scholarly purposes • They consist of: • XML schemas • documentation • P3 (1994), P4 (2002), P5 (0.9, 2007) • being developed by the TEI Consortium • large user base, web site, mailing list, tutorials, yearly meetings • increasingly popular for digital libraries, text-critical editions,…, to a certain extent for corpora Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  5. Jp-Sl dictionary <entry id="jaslo.113"> <form type="hw"> <orth type="roma">akeru</orth> <orth type="kana">あける</orth> <orth type="kanji">開ける</orth> </form> <gramGrp><pos>V1</pos> <subc>trans.</subc></gramGrp> <form type="infl"> <orth type="v-masu">あけます</orth> <orth type="v-te">あけて</orth> <orth type="v-nai">あけない</orth> </form> <trans><tr>odpreti</tr></trans> <eg><q>穴(あな)をあける</q> <tr>narediti luknjo</tr></eg> <eg><q>窓(まど)を開ける</q> <tr>odpreti okno</tr></eg> <xr type="related"> <lbl>prim.</lbl> <ref>開く(あく)</ref> <lbl>intr.</lbl> </xr> <usg type="level">4</usg> </entry> Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  6. Example: MULTEXT-East “1984”, Serbian <text id="mteo-sr." lang="sr"> <body id="Osr" lang="sr"> <div id="Osr.1" n="1" type="part"> <head>Prvi deo</head> <div id="Osr.1.2" n="1" type="chapter"> <head>1.</head> <p id="Osr.1.2.2"> <s id="Osr.1.2.2.1">Bio je vedar i hladan aprilski dan; na časovnicima je izbijalo trinaest.</s> <s id="Osr.1.2.2.2"><name>Vinston Smit</name>, brade zabijene u nedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju stambene zgrade <hi rend="it">Pobeda</hi>, no nedovoljno hitro da bi sprećio jednu spiralu oštre prašine da uđe zajedno s njim.</s> </p> … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  7. MULTEXT-East • MULTEXT-East: EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages • Based on the results of EU MULTEXT (~West) • To produce a harmonised BLARK for six languages: • morphosyntactic specifications (EAGLES / MULTEXT) • morphosyntacticaly annotated parallel corpus • inflectional lexica • multilingual comparable, speech corpora • language processing tools Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  8. History of MULTEXT-East resources • First release 1998 on CD-ROM:already extended with new languages • Resources since 1998 available on the Web:http://nl.ijs.si/ME/ • Second release 2002 (EU CONCEDE):re-encoding in XML/TEI, harmonisation • Third release 2004:merge of first two releases, further languages • Fourth release 2007 (?) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  9. The Languages of MULTEXT-East • Slavic: • Russian (East Slavic) • Czech (West Slavic) • Slovene (South West Slavic) • Resian (Slovene dialect) • Croatian (South West Slavic)-- Marko Tadič • Serbian(South West Slavic)-- C. Krstev, D. Vitas • Bulgarian (South East Slavic) • In progress: • Macedonian • Persian • Germanic: English • Romance: Romanian • Baltic: • Latvian • Lithuanian • Finno-Ugric: • Estonian • Hungarian • (BalkaNet): • Greek • Tukrish) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  10. The MULTEXT morphosyntactic trinity • MULTEXT-East morphosyntactic specifications (Croatian, Serbian) • MULTEXT-East morphosyntactic lexica (Serbian) • MULTEXT-East morphosyntactically annotated "1984" corpus (Serbian) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  11. 1. Morphosyntactic specifications • Based on EAGLES / MULTEXT • Define PoS, their attributes and values • The specs are a document containing: • introduction • common tables • language particular sections • Written in LaTeX  PDF & HTML • Derived XML/TEI encoding as feature structures • In Version 4 specifications to be fully in TEI/XML Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  12. Example common table Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  13. Example language specific table Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  14. 2. The lexica • Medium size morphosyntactic lexica • Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian. • ~ all word-forms of cca 15.000 lemmas • Lexical entry is composed of three fields: • the word-form: the inflected form of the word • the lemma: the base-form of the word • the morphosyntactic description (MSD) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  15. Example: Slovene lexicon abeced abeceda Ncfdg abeced abeceda Ncfpg abeceda = Ncfsn abecedah abeceda Ncfdl abecedah abeceda Ncfpl abecedam abeceda Ncfpd abecedama abeceda Ncfdd abecedama abeceda Ncfdi abecedami abeceda Ncfpi abecede abeceda Ncfpa abecede abeceda Ncfpn abecede abeceda Ncfsg abecedi abeceda Ncfda abecedi abeceda Ncfdn … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  16. 3. The “1984” corpus • Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) • Structurally annotated • Sentence aligned with English • Words annotated with lemma and MSD • Encoded in TEI P4 (XML) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  17. Example linguistic encoding Context disambiguated lemmas and MSDs <text id="Osl." lang="sl"> <body> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> <w lemma="biti" ana="Vcps-sma">Bil</w> <w lemma="biti" ana="Vcip3s--n">je</w> <w lemma="jasen" ana="Afpmsnn">jasen</w> <c>,</c> <w lemma="mrzel" ana="Afpmsnn">mrzel</w> <w lemma="aprilski" ana="Aopmsn">aprilski</w> <w lemma="dan" ana="Ncmsn">dan</w> <w lemma="in" ana="Ccs">in</w> <w lemma="ura" ana="Ncfpn">ure</w> <w lemma="biti" ana="Vcip3p--n">so</w> <w lemma="biti" ana="Vmps-pfa">bile</w> <w lemma="trinajst" ana="Mcnpnl">trinajst</w> <c>.</c> </s> … Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  18. Utility of MULTEXT-East LRs • Specifications became, for some, the “national” standard • Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILP • A base dataset for further annotation and experiments: • Word-sense disambiguation • WordNet development and evaluation • Syntactic parser induction • Teaching aid in HLT courses • ~ 100 registered users • As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian,Bosnian? Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  19. Corpora using TEI+MULTEXT-East • Reference corpus of Slovene:FIDA (100Mw), FIDA+ (600Mw)(+ other Sl. corpora) • Croatian National Corpus:HNK (100Mw) • Various Romanian corpora, … • En-Sl parallel annotated corpus:SVEZ-IJS (10Mw) Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  20. Conclusions • TEI provides a rich and flexible infrastructure to encode parallel corpora: meta-data, corpus and document structure, alignment, linguistic analysis • MULTEXT-East provides a harmonised and common infrastructure for word-level morphosyntactic descriptions • Both have already been used for a number of corpora • Maybe also for BKS? Tomaž Erjavec Dept. of Knowledge Technologies, Jozef Stefan Institute

  21. Thank you!

More Related