slide1
Download
Skip this Video
Download Presentation
Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Loading in 2 Seconds...

play fullscreen
1 / 21

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Prim(j)ena M ULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des M ULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora Application of M ULTEXT-East and TEI in the compilation of parallel corpora. Tomaž Erjavec Department of Knowledge Technologies

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana' - huela


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusaApplikation des MULTEXT-East und der TEI-Normen bei der Erstellung vonParallelkorporaApplication of MULTEXT-East and TEI in the compilation of parallel corpora

Tomaž Erjavec

Department of Knowledge Technologies

Jožef Stefan Institute, Ljubljana

[email protected], http://nl.ijs.si/et/

overview
Overview
  • The need for standardisation
  • Corpus encoding in TEI
  • MULTEXT-East morphosyntactic descriptions

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

why standards for digital language resources
Why standards (for digital language resources)?
  • public documentation (+ software)
  • (semi)automated validation
  • application independent
  • platform independent
  • do not become obsolescent (as fast)
  • However:
    • demand time to understand and use them
    • there are (too) many and not all are accepted
    • they are not perfectly tuned to application (overhead)

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

tei the text encoding initiative
TEI: the Text Encoding Initiative
  • TEI Guidelines are a vocabulary to describe text for scholarly purposes
  • They consist of:
    • XML schemas
    • documentation
  • P3 (1994), P4 (2002), P5 (0.9, 2007)
  • being developed by the TEI Consortium
  • large user base, web site, mailing list, tutorials, yearly meetings
  • increasingly popular for digital libraries, text-critical editions,…, to a certain extent for corpora

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

jp sl dictionary
Jp-Sl dictionary

<entry id="jaslo.113">

<form type="hw">

<orth type="roma">akeru</orth>

<orth type="kana">あける</orth>

<orth type="kanji">開ける</orth>

</form>

<gramGrp><pos>V1</pos> <subc>trans.</subc></gramGrp>

<form type="infl">

<orth type="v-masu">あけます</orth>

<orth type="v-te">あけて</orth>

<orth type="v-nai">あけない</orth>

</form>

<trans><tr>odpreti</tr></trans>

<eg><q>穴(あな)をあける</q> <tr>narediti luknjo</tr></eg>

<eg><q>窓(まど)を開ける</q> <tr>odpreti okno</tr></eg>

<xr type="related">

<lbl>prim.</lbl> <ref>開く(あく)</ref> <lbl>intr.</lbl>

</xr>

<usg type="level">4</usg>

</entry>

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

example multext east 1984 serbian
Example: MULTEXT-East “1984”, Serbian

<text id="mteo-sr." lang="sr">

<body id="Osr" lang="sr">

<div id="Osr.1" n="1" type="part">

<head>Prvi deo</head>

<div id="Osr.1.2" n="1" type="chapter">

<head>1.</head>

<p id="Osr.1.2.2">

<s id="Osr.1.2.2.1">Bio je vedar i hladan aprilski dan; na časovnicima

je izbijalo trinaest.</s>

<s id="Osr.1.2.2.2"><name>Vinston Smit</name>, brade zabijene u

nedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju

stambene zgrade <hi rend="it">Pobeda</hi>, no nedovoljno hitro

da bi sprećio jednu spiralu oštre prašine da uđe zajedno s njim.</s>

</p>

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

multext east
MULTEXT-East
  • MULTEXT-East: EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Central European Languages
  • Based on the results of EU MULTEXT (~West)
  • To produce a harmonised BLARK for six languages:
    • morphosyntactic specifications (EAGLES / MULTEXT)
    • morphosyntacticaly annotated parallel corpus
    • inflectional lexica
    • multilingual comparable, speech corpora
    • language processing tools

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

history of multext east resources
History of MULTEXT-East resources
  • First release 1998 on CD-ROM:already extended with new languages
  • Resources since 1998 available on the Web:http://nl.ijs.si/ME/
  • Second release 2002 (EU CONCEDE):re-encoding in XML/TEI, harmonisation
  • Third release 2004:merge of first two releases, further languages
  • Fourth release 2007 (?)

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

the languages of multext east
The Languages of MULTEXT-East
  • Slavic:
    • Russian (East Slavic)
    • Czech (West Slavic)
    • Slovene (South West Slavic)
    • Resian (Slovene dialect)
    • Croatian (South West Slavic)-- Marko Tadič
    • Serbian(South West Slavic)-- C. Krstev, D. Vitas
    • Bulgarian (South East Slavic)
  • In progress:
    • Macedonian
    • Persian
  • Germanic: English
  • Romance: Romanian
  • Baltic:
    • Latvian
    • Lithuanian
  • Finno-Ugric:
    • Estonian
    • Hungarian
  • (BalkaNet):
    • Greek
    • Tukrish)

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

the multext morphosyntactic trinity
The MULTEXT morphosyntactic trinity
  • MULTEXT-East morphosyntactic specifications (Croatian, Serbian)
  • MULTEXT-East morphosyntactic lexica (Serbian)
  • MULTEXT-East morphosyntactically annotated "1984" corpus (Serbian)

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

1 morphosyntactic specifications
1. Morphosyntactic specifications
  • Based on EAGLES / MULTEXT
  • Define PoS, their attributes and values
  • The specs are a document containing:
    • introduction
    • common tables
    • language particular sections
  • Written in LaTeX  PDF & HTML
  • Derived XML/TEI encoding as feature structures
  • In Version 4 specifications to be fully in TEI/XML

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

example common table
Example common table

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

example language specific table
Example language specific table

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

2 the lexica
2. The lexica
  • Medium size morphosyntactic lexica
  • Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian.
  • ~ all word-forms of cca 15.000 lemmas
  • Lexical entry is composed of three fields:
    • the word-form: the inflected form of the word
    • the lemma: the base-form of the word
    • the morphosyntactic description (MSD)

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

example slovene lexicon
Example: Slovene lexicon

abeced abeceda Ncfdg

abeced abeceda Ncfpg

abeceda = Ncfsn

abecedah abeceda Ncfdl

abecedah abeceda Ncfpl

abecedam abeceda Ncfpd

abecedama abeceda Ncfdd

abecedama abeceda Ncfdi

abecedami abeceda Ncfpi

abecede abeceda Ncfpa

abecede abeceda Ncfpn

abecede abeceda Ncfsg

abecedi abeceda Ncfda

abecedi abeceda Ncfdn

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

3 the 1984 corpus
3. The “1984” corpus
  • Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…))
  • Structurally annotated
  • Sentence aligned with English
  • Words annotated with lemma and MSD
  • Encoded in TEI P4 (XML)

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

example linguistic encoding
Example linguistic encoding

Context disambiguated

lemmas and MSDs

<text id="Osl." lang="sl">

<body>

<div type="part" id="Osl.1">

<div type="chapter" id="Osl.1.2">

<p id="Osl.1.2.2">

<s id="Osl.1.2.2.1">

<w lemma="biti" ana="Vcps-sma">Bil</w>

<w lemma="biti" ana="Vcip3s--n">je</w>

<w lemma="jasen" ana="Afpmsnn">jasen</w>

<c>,</c>

<w lemma="mrzel" ana="Afpmsnn">mrzel</w>

<w lemma="aprilski" ana="Aopmsn">aprilski</w>

<w lemma="dan" ana="Ncmsn">dan</w>

<w lemma="in" ana="Ccs">in</w>

<w lemma="ura" ana="Ncfpn">ure</w>

<w lemma="biti" ana="Vcip3p--n">so</w>

<w lemma="biti" ana="Vmps-pfa">bile</w>

<w lemma="trinajst" ana="Mcnpnl">trinajst</w>

<c>.</c>

</s>

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

utility of multext east lrs
Utility of MULTEXT-East LRs
  • Specifications became, for some, the “national” standard
  • Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILP
  • A base dataset for further annotation and experiments:
    • Word-sense disambiguation
    • WordNet development and evaluation
    • Syntactic parser induction
  • Teaching aid in HLT courses
  • ~ 100 registered users
  • As a BLARK “best practice” for new languages: Resian, Croatian, Macedonian, Persian,Bosnian?

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

corpora using tei multext east
Corpora using TEI+MULTEXT-East
  • Reference corpus of Slovene:FIDA (100Mw), FIDA+ (600Mw)(+ other Sl. corpora)
  • Croatian National Corpus:HNK (100Mw)
  • Various Romanian corpora, …
  • En-Sl parallel annotated corpus:SVEZ-IJS (10Mw)

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

conclusions
Conclusions
  • TEI provides a rich and flexible infrastructure to encode parallel corpora: meta-data, corpus and document structure, alignment, linguistic analysis
  • MULTEXT-East provides a harmonised and common infrastructure for word-level morphosyntactic descriptions
  • Both have already been used for a number of corpora
  • Maybe also for BKS?

Tomaž Erjavec

Dept. of Knowledge Technologies, Jozef Stefan Institute

ad