Tei for language resources a missed chance or a coming opportunity
1 / 36

TEI for language resources: a missed chance or a coming opportunity ? - PowerPoint PPT Presentation

  • Uploaded on

TEI for language resources: a missed chance or a coming opportunity ?. Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia. Overview. Some history Why TEI isn‘t used for LRs (as much as expected) MULTEXT-East and other case studies Conclusions.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' TEI for language resources: a missed chance or a coming opportunity ?' - lacy-buck

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tei for language resources a missed chance or a coming opportunity

TEI for language resources: a missed chance or a coming opportunity?

Tomaž ErjavecDept. of Knowledge Technologies

Jožef Stefan Institute

Ljubljana, Slovenia


TEI for Language Resources


  • Some history

  • Why TEI isn‘t used for LRs (as much as expected)

  • MULTEXT-East and other case studies

  • Conclusions


TEI for Language Resources


At its inception TEI was meant to cover CL/NLP LRs, esp. corpora:

  • ACLone of the supporting associations

  • modules for corpora, linguistic analysis, feature-structures, graphs

  • BNC in TEI

  • At the time CL/NLP do not use SGML:clear playing field

The age of xml and lrs

TEI for Language Resources

The age of XML and LRs

Release of XML (more or less) corresponds to the begining of the era of Language resources:

1998: XML 1.0, First LREC conference

But developed LRs (mostly) did not use TEI. Why?

Reason 1 x ces

TEI for Language Resources

Reason 1: (X)CES

  • EAGLES Corpus Encoding Standard

    • „constraining or simplifying the TEIspecifications in order to ensure interoperability“(Ide 1998)

  • So, more compact and easier to apply than TEI

  • Almost TEI, but not quite

  • No methods for extension

Reason 2 comp sci attitude

TEI for Language Resources

Reason 2: Comp Sci attitude

  • I don‘t care about the data format, I want to develop algorithms! (... I even hate XML...)

  • If I use XML I will roll my own schema optimal for my experiments / application (...that‘s what ‚X‘ means...)

  • I won‘t spend weeks (months, years) just getting to know TEI (...I need only 4 different elements anyway...)

Reason 3 general gripes

TEI for Language Resources

Reason 3: General gripes

  • Missing modules for syntactic analysis & lexical databases

  • Not perscriptive / precise enough

  • Too general elements

  • Too book oriented


TEI for Language Resources


  • Project-local proposals:

    • TIGER treebank format

    • Concede lexical database format

    • GENIA NER format

    • ...

  • Semantic Web: DC, RDF, OWL

  • ISO TC 37 SC4:

    • LMF, isoCat,

    • LAF, MAF, SynAF, ...


TEI for Language Resources


  • MULTEXT-East: multilingual corpora and lexica

  • Fida(PLUS): Slovene Reference Corpus

  • IJS-ELAN, SVEZ-IJS: en-sl parallel corpora

  • jaSlo: Japanese-Slovene L2 dictionary

  • eZISS: Scholarly Digital Editions of Slovene Literature

  • JRC-ACQUIS: Parallel corpus of EC laws

  • SDT: Slovene Dependency Treebank

  • SBL: Slovene Biographic Lexicon

  • AHLib: DL/corpus of 19th century Slovene books

  • JOS: Slovene gold-standard corpus for HLT

  • MULTEXT-East...

Multext east

TEI for Language Resources


  • EU project 1995-97: MULTEXT sequel

  • Development of standardised language resources for Central and Eastern European languages + English hub

  • Corpora, lexica, morphosyn. specifications

  • V1: 1998, 7 languages, LaTeX + CES/SGML

  • V4: 2010, 16 languages, TEI P5

  • http://nl.ijs.si/ME/

Multext east version 4 by language and resource type

TEI for Language Resources

MULTEXT-East Version 4 by language and resource type

Why tei for mte

TEI for Language Resources

Why TEI for MTE?

  • Because I like TEI

  • Varied resources:

    • Metadata / Documentation

    • „Document“ corpus: rich annotation structure

    • Lingustically annotated „1984“ corpus

    • Sentence alignments: stand-off markup

    • Morphosyntactic specifications: book-like

      Either choose several (moving target) schemas or use TEI.

Tei header v4 v3 v2 v1 eci ota soas

TEI for Language Resources

TEI Header-v4-v3-v2-v1-eci-ota-soas-

Annotated 1984

TEI for Language Resources

Annotated 1984

<text xml:id="Osl." xml:lang="sl"> <body> <div type="part" xml:id="Osl.1"> <div type="chapter" xml:id="Osl.1.2"> <p xml:id="Osl.1.2.2"> <s xml:id="Osl."> <w xml:id="Osl." lemma="biti" ana="#Va-p-sm">Bil</w> <w xml:id="Osl." lemma="biti" ana="#Va-r3s-n">je</w> <w xml:id="Osl." lemma="jasen" ana="#Agpmsnn">jasen</w> <c xml:id="Osl.">,</c> ← sorry! <w xml:id="Osl." lemma="mrzel" ana="#Agpmsnn">mrzel</w> <w xml:id="Osl." lemma="aprilski" ana="#Agpmsny">aprilski</w> <w xml:id="Osl." lemma="dan" ana="#Ncmsn">dan</w> <w xml:id="Osl." lemma="in" ana="#Cc">in</w> <w xml:id="Osl." lemma="ura" ana="#Ncfpn">ure</w> <w xml:id="Osl." lemma="biti" ana="#Va-r3p-n">so</w> <w xml:id="Osl." lemma="biti" ana="#Va-p-pf">bile</w> <w xml:id="Osl." lemma="trinajst" ana="#Mlc-pa">trinajst</w> <c xml:id="Osl.">.</c>


TEI for Language Resources


  • A long time ago „1984“ lost its spaces

  • Whitespace is brittlebut important:

    • Retokenisation

    • Reading

  • TEI <space> no good!

  • So <mte:space> </mte:space>, 24:1?

  • Sitting on the fence JOS solution: </S>

  • <mte:g/>?

Sentence alignments

TEI for Language Resources

Sentence alignments

In MTE V3:

<?xml version="1.0" encoding="us-ascii"?>

<!DOCTYPE cesAlign SYSTEM "xcesAlign.dtd">

<cesAlign version="4.1">

<linkList id="Oruen">

<linkGrp type="body" targType="s" domains="Oru Oen">

<link xtargets="Oru. ; Oen."/>

<link xtargets="Oru. Oru. ; Oen."/>

<link xtargets="Oru. ; Oen. Oen."/>

<link xtargets=" ; Oen."/>

Tei p5 alignments

TEI for Language Resources

TEI P5 Alignments

  • TEI way is with two level indirection: 1st grouping, 2nd alignment

  • Too complicated, esp. as 98% alignments are 1-1

  • Chose fence-sitting one-level:

    <linkGrp type="alignment" corresp="oana-mk.xml oana-sl.xml">

    <link n="1:1" targets="oana-mk.xml#Omk. oana-sl.xml#Osl."/>

    <link n="2:1" targets="oana-mk.xml#Omk. oana-mk.xml#Omk. oana-sl.xml#Osl."/>

    <link n="1:2" targets="oana-mk.xml#Omk.

    oana-sl.xml#Osl. oana-sl.xml#Osl."/>

    <!--link n="0:1" targets="oana-sl.xml#Osl.4.12.2"/-->

Morphosyntactic specifications

TEI for Language Resources

Morphosyntactic specifications

  • Define categories (PoS) and their features

  • Map feature-structures to morphosyntactic descriptions (MSD tagsets)

  • Specify which languages have which features and tagsets

  • E.g. [Category=Adverb Type=general Degree=superlative] ≡ Rgs ∈ Tagsetsl

  • Complex morphology → complex specifications

  • MSD tagsets are grounded in lexicon and corpus

Example common specifications

TEI for Language Resources

Example: common specifications

<table n="msd.cat" xml:lang="en" xml:id="msd.cat.Q"> <head>Common specifications for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name">CATEGORY</cell> <cell role="value">Particle</cell> <cell role="code">Q</cell> <cell role="lang">ro</cell> <cell role="lang">sl</cell> ... </row> <row role="attribute"> <cell role="position">1</cell> <cell role="name">Type</cell> <cell> <table> <row role="value"> <cell role="name">negative</cell> <cell role="code">z</cell> <cell role="lang">ro</cell> </row> <row role="value"> <cell role="name">interrogative</cell> <cell role="code">q</cell> <cell role="lang">bg</cell> <cell role="lang">hr</cell>....

Language particular specifications

TEI for Language Resources

Language particular specifications

<div type="section" select="sl" xml:id="msd.Q-sl"> <head>Slovene Particle</head> <table n="msd.cat" select="sl" xml:id="msd.cat.Q-sl"> <head>Slovene Specification for Particle</head> <row role="type"> <cell role="position">0</cell><cell role="name" xml:lang="sl">besedna_vrsta</cell> <cell role="value" xml:lang="sl">členek</cell> <cell role="code" xml:lang="sl">L</cell> <cell role="name" xml:lang="en">CATEGORY</cell> <cell role="value" xml:lang="en">Particle</cell> <cell role="code" xml:lang="en">Q</cell> </row> </table> <p xml:lang="sl">Opombe: <list> <item>kot členki so označene le pojavnice, ki so navedene v leksikonu</item> </list> </p> <divGen xml:id="msd.Q-sl.lexicon" type="msd.lex" select="sl"/> </div>



TEI for Language Resources


  • TEI provides needed elements, also for commentary, bibliography, ...

  • TEI XSLT used to render as HTML

  • Tables retained from MULTEXT

  • Several XSLT scripts for MSD conversions, e.g. to collating sequence, to fvLib and fsLib

  • Interesting challenge: conversion to isoCat (Adam P. for Polish tagset), OWL

Mte specifications in owl by christian chiarcos

TEI for Language Resources

MTE specifications in OWL(by Christian Chiarcos)

Morals 1

TEI for Language Resources

Morals, 1

  • TEI good for in-place markup of richly annotated resources with varied structure:

    • Readable

    • Updatable (validation)

  • Not good for huge dataset with shallow annotation:

    • Processable

    • Read only

      → useful for (small, medium size) gold standard hand-corrected language resources

      / „new“ langauges → localisation /

Impact @ jsi

TEI for Language Resources


  • EU IP „Improving Access to Text“

  • Make better OCR and IR for historical texts

  • JSI: Developing a lemmatisation (+ modernisation) module for XIX century Slovene

  • Background: Lexicon, Tagging and Lemmatisation for modern Slovene + FSA rewrite patterns

  • Current dataset: AHLib (~100 books)

  • AHLib marked up in TEI

Ahlib digital library

TEI for Language Resources

AHLib Digital Library

M ark up challenges

TEI for Language Resources

Mark-up challenges

  • Text-critical apparatus vs. linguistic annotation

  • „Parallel“ corpora of transcriptions and modernisations

  • Layered linguistic annotations: tokenisation, tagsets

  • Lexicon (+dictionary) encoding

Morals 2

TEI for Language Resources

Morals, 2

  • Text-critical editions use TEI anyway

  • Ditto for DLs of historical texts

  • HLT increasingly applied also to such texts

  • TEI provides a good basis to join the two views

Current eu projects flarenet

TEI for Language Resources

Current EU Projects: FlareNet

  • Fostering Language Resources Network (2008-11)

  • WG4 - Harmonisation of Formats and Standards

  • D4.1 Identification of problems in the use of LR standards and of standardisationneeds (M12):

    • „For academic purposes the TEI Guidelines (current version P5) has been a wellestablished and widely used resource of LR‐specific standards mainly for corpusanalysis, markup and annotation. But TEI is hardly known in industrial communities(with a few exceptions) and completely foreign to professional groups such as localizersand translators. We see great potential in using TEI Guidelines in industrial contexts.“ /underlined by T.E./

  • D4.2 Proposal of a European Language Resource Standards Framework (M24 /2010-09-01)

Research infrastructures for the humanities

TEI for Language Resources

Research Infrastructures for the Humanities

  • DG Research funded RIs; pilot phase, 2008-2010

  • DARIAH  ask Lou...

  • EU RI CLARIN:Common Language Resources and Technology Infrastructure

  • WP5 Language Resources and Technologies Overview

  • D5C-3: Interoperability & Standards: „Due to the versatile nature of TEI, most of the following chapters include details on encodingdigital text by following the P5 guidelines and conversion methods.“

Morals 3

TEI for Language Resources

Morals, 3

  • TEI is firmly acknowledged in current work on LR encoding standardisation

  • But is not perscriptive enough and lacks modules for many types of LRs

    → Need of constrained solutions & linkages to ISO/W3C standards:

    • Cross-walks

    • Roma & Schema „namespace“ catalogueto DC, LMF, MAF, ...

Tei for lr swot

TEI for Language Resources


  • Universality, Maturity, Community, Extensibility (compare ISO)

  • Vagueness, Learning curve, ISO/W3C linkage

  • HLT (Humanities Language Technologies), New languages

  • Marginalisation, Technical obsolescence


TEI for Language Resources


  • Frontiers: DL+HLT, Gold standard LRs

  • Priority: Instantiated connections to other standards and languages

  • Connection with linguistics? SIG will tell...