Tei for language resources a missed chance or a coming opportunity
Sponsored Links
This presentation is the property of its rightful owner.
1 / 36

TEI for language resources: a missed chance or a coming opportunity ? PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

TEI for language resources: a missed chance or a coming opportunity ?. Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia. Overview. Some history Why TEI isn‘t used for LRs (as much as expected) MULTEXT-East and other case studies Conclusions.

Download Presentation

TEI for language resources: a missed chance or a coming opportunity ?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


TEI for language resources: a missed chance or a coming opportunity?

Tomaž ErjavecDept. of Knowledge Technologies

Jožef Stefan Institute

Ljubljana, Slovenia


TEI for Language Resources

Overview

  • Some history

  • Why TEI isn‘t used for LRs (as much as expected)

  • MULTEXT-East and other case studies

  • Conclusions


TEI for Language Resources

History

At its inception TEI was meant to cover CL/NLP LRs, esp. corpora:

  • ACLone of the supporting associations

  • modules for corpora, linguistic analysis, feature-structures, graphs

  • BNC in TEI

  • At the time CL/NLP do not use SGML:clear playing field


TEI for Language Resources

The age of XML and LRs

Release of XML (more or less) corresponds to the begining of the era of Language resources:

1998: XML 1.0, First LREC conference

But developed LRs (mostly) did not use TEI. Why?


TEI for Language Resources

Reason 1: (X)CES

  • EAGLES Corpus Encoding Standard

    • „constraining or simplifying the TEIspecifications in order to ensure interoperability“(Ide 1998)

  • So, more compact and easier to apply than TEI

  • Almost TEI, but not quite

  • No methods for extension


TEI for Language Resources

Reason 2: Comp Sci attitude

  • I don‘t care about the data format, I want to develop algorithms! (... I even hate XML...)

  • If I use XML I will roll my own schema optimal for my experiments / application (...that‘s what ‚X‘ means...)

  • I won‘t spend weeks (months, years) just getting to know TEI (...I need only 4 different elements anyway...)


TEI for Language Resources

Reason 3: General gripes

  • Missing modules for syntactic analysis & lexical databases

  • Not perscriptive / precise enough

  • Too general elements

  • Too book oriented


TEI for Language Resources

Result

  • Project-local proposals:

    • TIGER treebank format

    • Concede lexical database format

    • GENIA NER format

    • ...

  • Semantic Web: DC, RDF, OWL

  • ISO TC 37 SC4:

    • LMF, isoCat,

    • LAF, MAF, SynAF, ...


TEI for Language Resources

MyTEI

  • MULTEXT-East: multilingual corpora and lexica

  • Fida(PLUS): Slovene Reference Corpus

  • IJS-ELAN, SVEZ-IJS: en-sl parallel corpora

  • jaSlo: Japanese-Slovene L2 dictionary

  • eZISS: Scholarly Digital Editions of Slovene Literature

  • JRC-ACQUIS: Parallel corpus of EC laws

  • SDT: Slovene Dependency Treebank

  • SBL: Slovene Biographic Lexicon

  • AHLib: DL/corpus of 19th century Slovene books

  • JOS: Slovene gold-standard corpus for HLT

  • MULTEXT-East...


TEI for Language Resources

MULTEXT-East

  • EU project 1995-97: MULTEXT sequel

  • Development of standardised language resources for Central and Eastern European languages + English hub

  • Corpora, lexica, morphosyn. specifications

  • V1: 1998, 7 languages, LaTeX + CES/SGML

  • V4: 2010, 16 languages, TEI P5

  • http://nl.ijs.si/ME/


TEI for Language Resources

MULTEXT-East Version 4 by language and resource type


TEI for Language Resources

Why TEI for MTE?

  • Because I like TEI

  • Varied resources:

    • Metadata / Documentation

    • „Document“ corpus: rich annotation structure

    • Lingustically annotated „1984“ corpus

    • Sentence alignments: stand-off markup

    • Morphosyntactic specifications: book-like

      Either choose several (moving target) schemas or use TEI.


TEI for Language Resources

Documentation


TEI for Language Resources

TEI Header-v4-v3-v2-v1-eci-ota-soas-


TEI for Language Resources

Annotated 1984

<text xml:id="Osl." xml:lang="sl"> <body> <div type="part" xml:id="Osl.1"> <div type="chapter" xml:id="Osl.1.2"> <p xml:id="Osl.1.2.2"> <s xml:id="Osl.1.2.2.1"> <w xml:id="Osl.1.2.2.1.1" lemma="biti" ana="#Va-p-sm">Bil</w> <w xml:id="Osl.1.2.2.1.2" lemma="biti" ana="#Va-r3s-n">je</w> <w xml:id="Osl.1.2.2.1.3" lemma="jasen" ana="#Agpmsnn">jasen</w> <c xml:id="Osl.1.2.2.1.4">,</c> ← sorry! <w xml:id="Osl.1.2.2.1.5" lemma="mrzel" ana="#Agpmsnn">mrzel</w> <w xml:id="Osl.1.2.2.1.6" lemma="aprilski" ana="#Agpmsny">aprilski</w> <w xml:id="Osl.1.2.2.1.7" lemma="dan" ana="#Ncmsn">dan</w> <w xml:id="Osl.1.2.2.1.8" lemma="in" ana="#Cc">in</w> <w xml:id="Osl.1.2.2.1.9" lemma="ura" ana="#Ncfpn">ure</w> <w xml:id="Osl.1.2.2.1.10" lemma="biti" ana="#Va-r3p-n">so</w> <w xml:id="Osl.1.2.2.1.11" lemma="biti" ana="#Va-p-pf">bile</w> <w xml:id="Osl.1.2.2.1.12" lemma="trinajst" ana="#Mlc-pa">trinajst</w> <c xml:id="Osl.1.2.2.1.13">.</c>


TEI for Language Resources

Whitespace

  • A long time ago „1984“ lost its spaces

  • Whitespace is brittlebut important:

    • Retokenisation

    • Reading

  • TEI <space> no good!

  • So <mte:space> </mte:space>, 24:1?

  • Sitting on the fence JOS solution: </S>

  • <mte:g/>?


TEI for Language Resources

Sentence alignments

In MTE V3:

<?xml version="1.0" encoding="us-ascii"?>

<!DOCTYPE cesAlign SYSTEM "xcesAlign.dtd">

<cesAlign version="4.1">

<linkList id="Oruen">

<linkGrp type="body" targType="s" domains="Oru Oen">

<link xtargets="Oru.1.1.1.1 ; Oen.1.1.1.1"/>

<link xtargets="Oru.1.1.16.6 Oru.1.1.16.7 ; Oen.1.1.15.6"/>

<link xtargets="Oru.1.3.4.1 ; Oen.1.3.4.1 Oen.1.3.4.2"/>

<link xtargets=" ; Oen.1.3.4.3"/>


TEI for Language Resources

TEI P5 Alignments

  • TEI way is with two level indirection: 1st grouping, 2nd alignment

  • Too complicated, esp. as 98% alignments are 1-1

  • Chose fence-sitting one-level:

    <linkGrp type="alignment" corresp="oana-mk.xml oana-sl.xml">

    <link n="1:1" targets="oana-mk.xml#Omk.1.1.1.1 oana-sl.xml#Osl.1.2.2.1"/>

    <link n="2:1" targets="oana-mk.xml#Omk.1.1.2.6 oana-mk.xml#Omk.1.1.2.7 oana-sl.xml#Osl.1.2.3.6"/>

    <link n="1:2" targets="oana-mk.xml#Omk.1.1.2.8

    oana-sl.xml#Osl.1.2.3.7 oana-sl.xml#Osl.1.2.3.8"/>

    <!--link n="0:1" targets="oana-sl.xml#Osl.4.12.2"/-->


TEI for Language Resources

Morphosyntactic specifications

  • Define categories (PoS) and their features

  • Map feature-structures to morphosyntactic descriptions (MSD tagsets)

  • Specify which languages have which features and tagsets

  • E.g. [Category=Adverb Type=general Degree=superlative] ≡ Rgs ∈ Tagsetsl

  • Complex morphology → complex specifications

  • MSD tagsets are grounded in lexicon and corpus


TEI for Language Resources

Example: common specifications

<table n="msd.cat" xml:lang="en" xml:id="msd.cat.Q"> <head>Common specifications for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name">CATEGORY</cell> <cell role="value">Particle</cell> <cell role="code">Q</cell> <cell role="lang">ro</cell> <cell role="lang">sl</cell> ... </row> <row role="attribute"> <cell role="position">1</cell> <cell role="name">Type</cell> <cell> <table> <row role="value"> <cell role="name">negative</cell> <cell role="code">z</cell> <cell role="lang">ro</cell> </row> <row role="value"> <cell role="name">interrogative</cell> <cell role="code">q</cell> <cell role="lang">bg</cell> <cell role="lang">hr</cell>....


TEI for Language Resources


TEI for Language Resources

Language particular specifications

<div type="section" select="sl" xml:id="msd.Q-sl"> <head>Slovene Particle</head> <table n="msd.cat" select="sl" xml:id="msd.cat.Q-sl"> <head>Slovene Specification for Particle</head> <row role="type"> <cell role="position">0</cell><cell role="name" xml:lang="sl">besedna_vrsta</cell> <cell role="value" xml:lang="sl">členek</cell> <cell role="code" xml:lang="sl">L</cell> <cell role="name" xml:lang="en">CATEGORY</cell> <cell role="value" xml:lang="en">Particle</cell> <cell role="code" xml:lang="en">Q</cell> </row> </table> <p xml:lang="sl">Opombe: <list> <item>kot členki so označene le pojavnice, ki so navedene v leksikonu</item> </list> </p> <divGen xml:id="msd.Q-sl.lexicon" type="msd.lex" select="sl"/> </div>

MTEsl = JOS


TEI for Language Resources


TEI for Language Resources

Encoding

  • TEI provides needed elements, also for commentary, bibliography, ...

  • TEI XSLT used to render as HTML

  • Tables retained from MULTEXT

  • Several XSLT scripts for MSD conversions, e.g. to collating sequence, to fvLib and fsLib

  • Interesting challenge: conversion to isoCat (Adam P. for Polish tagset), OWL


TEI for Language Resources

MTE specifications in OWL(by Christian Chiarcos)


TEI for Language Resources

Morals, 1

  • TEI good for in-place markup of richly annotated resources with varied structure:

    • Readable

    • Updatable (validation)

  • Not good for huge dataset with shallow annotation:

    • Processable

    • Read only

      → useful for (small, medium size) gold standard hand-corrected language resources

      / „new“ langauges → localisation /


TEI for Language Resources

IMPACT @ JSI

  • EU IP „Improving Access to Text“

  • Make better OCR and IR for historical texts

  • JSI: Developing a lemmatisation (+ modernisation) module for XIX century Slovene

  • Background: Lexicon, Tagging and Lemmatisation for modern Slovene + FSA rewrite patterns

  • Current dataset: AHLib (~100 books)

  • AHLib marked up in TEI


TEI for Language Resources

AHLib Digital Library


TEI for Language Resources

IMPACT Lexicon


TEI for Language Resources

Mark-up challenges

  • Text-critical apparatus vs. linguistic annotation

  • „Parallel“ corpora of transcriptions and modernisations

  • Layered linguistic annotations: tokenisation, tagsets

  • Lexicon (+dictionary) encoding


TEI for Language Resources

Morals, 2

  • Text-critical editions use TEI anyway

  • Ditto for DLs of historical texts

  • HLT increasingly applied also to such texts

  • TEI provides a good basis to join the two views


TEI for Language Resources

Current EU Projects: FlareNet

  • Fostering Language Resources Network (2008-11)

  • WG4 - Harmonisation of Formats and Standards

  • D4.1 Identification of problems in the use of LR standards and of standardisationneeds (M12):

    • „For academic purposes the TEI Guidelines (current version P5) has been a wellestablished and widely used resource of LR‐specific standards mainly for corpusanalysis, markup and annotation. But TEI is hardly known in industrial communities(with a few exceptions) and completely foreign to professional groups such as localizersand translators. We see great potential in using TEI Guidelines in industrial contexts.“ /underlined by T.E./

  • D4.2 Proposal of a European Language Resource Standards Framework (M24 /2010-09-01)


TEI for Language Resources

Research Infrastructures for the Humanities

  • DG Research funded RIs; pilot phase, 2008-2010

  • DARIAH  ask Lou...

  • EU RI CLARIN:Common Language Resources and Technology Infrastructure

  • WP5 Language Resources and Technologies Overview

  • D5C-3: Interoperability & Standards: „Due to the versatile nature of TEI, most of the following chapters include details on encodingdigital text by following the P5 guidelines and conversion methods.“


TEI for Language Resources

Morals, 3

  • TEI is firmly acknowledged in current work on LR encoding standardisation

  • But is not perscriptive enough and lacks modules for many types of LRs

    → Need of constrained solutions & linkages to ISO/W3C standards:

    • Cross-walks

    • Roma & Schema „namespace“ catalogueto DC, LMF, MAF, ...


TEI for Language Resources

TEI for LRSWOT

  • Universality, Maturity, Community, Extensibility (compare ISO)

  • Vagueness, Learning curve, ISO/W3C linkage

  • HLT (Humanities Language Technologies), New languages

  • Marginalisation, Technical obsolescence


TEI for Language Resources

Conclusions

  • Frontiers: DL+HLT, Gold standard LRs

  • Priority: Instantiated connections to other standards and languages

  • Connection with linguistics? SIG will tell...


  • Login