LEXUS: A flexible web-based Lexicon Tool Interacting with ISO Data Category Registry
Download
1 / 38

Outline - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

LEXUS: A flexible web-based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics. Outline. background – problem MPI motivation = NLP motivation playing LEGO ISO TC37/SC4 Data Categories Lexical Markup Framework

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Outline' - adelio


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

LEXUS: A flexible web-based Lexicon Tool Interacting with ISO Data Category RegistryPeter Wittenburg, Marc Kemps-SnijdersMPI for Psycholinguistics


Outline

Outline ISO Data Category Registry

  • background – problem

  • MPI motivation = NLP motivation

  • playing LEGO

  • ISO TC37/SC4

  • Data Categories

  • Lexical Markup Framework

  • LEXUS Tool Mark

  • Demo Mark

  • Outlook


Background

Tuvan orthography ISO Data Category Registry

Tuvan appendix

German orthography

Russian orthography

Russian appendix

Xakas orthography

Tofa orthography

7 DOBES teams and 12 different lexica (structures, purposes)

Background

stem orthography

sense *

lexical sub-entry *

sense nr

sense

gram cat

gram subcat

Engl Transl

example *

simple

spreadsheet

little more

complex incl

1:N relations

orthography

Engl. Transl

[T|pr] nr

entry-type =

[stem|idiom|lexical word]

head

outer-body-L*

headword

citation form

homograph no

phonetic form

inner-body-L

grammar

gloss

word-level-gloss

reversal

definition

encyclopedic info

scientific name

semantic domain

semantic index

thesaurus

semantic relation*

cross-ref*

sense number

variety

meaning

etymology

table

example*

comment*

picture/photo*

housekeeping*

small part of a complex

lexicon structure

at top level 4 different entry types (only one is shown)


Problem

  • have to use one archival lexicon representation format based on XML

  • have to build one archival exploitation framework

  • however, receive lexica

    • character encodings

    • in all sorts of formats (var. XML, SBX, CHAT, even Word)

    • in various structures

    • with different terminologies (lexical attributes, values)

  • how to do cross-lexical searches?

  • how to do lexical merging, linking and comparison?

  • how to solve lexicon-corpus interaction?

  • etc

  • in NLP the same problems

    • lack of standards

    • lack of re-usability

    • lack of interoperability

  • you knew this already or?

Problem


Why not play lego

Why not play LEGO? based on XML

  • concrete lexicon schema is basically seen as lexical attributes grouped

  • together with others and embedded in a tree structure.

1:1

sense nr

components

(sub-schemas)

sense

data categories

(lexical attributes,

linguistic concepts)

gram cat

engl trans

1:N

examples

ortho

engl trans

gloss


What else relations

What else: Relations

bank

breite Sitzgelegenheit

something broad to sit on

  • need various type of relations between

  • attributes and units in value strings

  • each relation can be associated with

  • features, i.e. relations can be seen as

  • components in its own

sitzgelegenheit

etwas um zu sitzen

something to sit on

schmal

gegenteil zu breit

contrary to broad


What else inheritance

What else: Inheritance type

just one example to reduce typing

b’ang

common attributes

particular attributes

boeb’ang

common attributes

particular attributes

goeb’ang

common attributes

particular attributes


What else conditions operations

What else: conditions (operations) type

just one example from DOBES

lexemtype

if lexemtype = “stem |

idiom | lexical word”

head

sense nr

outer-body-L

meaning

if lexemtype =

“auxil | inflect affix”

etc etc

sense nr

meaning effect

  • probably better examples around

  • if value(X) then modify contraints(Y)

  • etc

categorial effect

etc etc


Iso tc37 sc4 the solution

ISO TC37/SC4 – the solution? type

  • ISO TC37/SC4 is about standardization in LR Management

    • central is data category registry

      • basically a flat list of linguistic concepts

      • will contain is_a relations that are part of the concept definition

      • “transitive_verb” is_a “verb”

      • with proper definitions and conceptual space (value range)

      • request for filling DCR (Metadata, morphology, syntax, …)

    • looking for abstract models (frameworks)

      • for lexica

      • for annotation structures

      • for semantic annotations

      • for syntactic annotations


Underlying model

Conceptual domain type

Data element concept

Value domain

Data element

XML schema declaration

/masculine/

/feminine/

/neuter/

Underlying Model

Dutch system

is different

/Gender/

Set of Simple datcats

Complex datcat

complex datcats

simple datcats

XML object

List of values

Implemented as an XML

attribute named ‘gen’

m, f, n

<w lemme=“vert” gen=“f”>verte</w>


Lexical markup framework

General Model type

Lexical Markup Framework

Metamodel

Data category

selection

Lexical model


Core model

  • Metamodel type

  • Made of lexical layers

  • Lexical layers

  • Made of lexical components (or components)

Core Model

Lexical DB

1..1

1..1

1..1

0..n

Global Info

Lexical Entry

1..1

1..1

0..n

0..n

Sense

Form

  • basis for modeling purposes is UML

  • there will be an XML-schema based instantiation


Extended model

1..1 type

1..n

Morphology

1..1

1..1

0..n

0..1

Paradigm

Inflexion

Extended Model

Lexical DB

1..1

1..1

/lemma/

/POS/

/gender/

/key form/

1..1

0..n

Global Info

Lexical Entry

1..1

1..1

0..n

0..n

Sense

Form

/orthography/

/variant for/

/orthography/

/gender/

/number/

/tense/

/person/

/mood/

/identifier/


Proposed extensions

Lexical Entry type

Proposed Extensions

still ongoing discussions

1..1

1..1

0..n

1..1

Sense

Form

1..1

0..n

Syntactic family

1..1

1..1

Syntactic family

Semantic frame

1..1

1..1

Semantic formula

Construct set

1..1

0..n

Syntactic construct

1..1

0..n

Syntactic construct

Semantic argument

1..1

0..n

Syntactic position


What will lmf be

What will LMF be? type

  • descriptions of the general model (metamodel + DCS)

  • DC have to be ISO 11179/12620/… compliant

  • Core model

  • including component building, relations, conditions, inheritance

  • Extension mechanism

  • Proposed but not normative extensions (morphology, syntax, …)

  • XML-schema based instantiation

  • currently version 5 of the Draft Proposal

    • ISO/TC 37/SC 4 N130 Rev.5

    • Date: 2005-03-19

    • Working draft of ISO WD 24613:2005

  • web-site: http://www.tc37sc4.org/


  • Goal LEXUS type

    • To provide a framework capable of handling diverse lexicon structures and formats.

    • Lexus is based upon Lexicon Markup Framework

    • within ISO TC37/SC4 that defines a blueprint for such a flexible framework.

    • LEXUS is first test and reference implementation of LMF.

    • Increase interoperability by offering well accepted data categories (ISO, GOLD, Shoebox MDF)

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Current Status type

    • supports full LMF core model

    • allows for flexible creation of structures and content.

    • supports use of well-accepted Data Category Registries (ISO 12620, Shoebox MDF)

    • allows for dynamic editing of structures and content.

    • supports use of multimedia content.

    • import of existing lexica (Shoebox, Chat)

    • export( Shoebox/LMF XML)

    • customizable layout

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Current Status type

    • user authentication

    • personal workspace for creating and editing lexica

    • merging facilities

    • simple and advanced search

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Current Status (Technical) type

    • Implemented in java and using Open Source components

    • Uses Spring to ‘wire’ the application

      • Modular approach avoiding ‘hard’ links

    • Uses Hibernate as the persistence framework

      • Allows use of multiple databases (Postgres, MySQL,…)

    • Uses Tomcat as Servlet Container

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Logging onto the application type

    Users must authenticate before loggin onto the application.

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    User workspace type

    Each user has his/her own personal workspace

    where private lexica are stored

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexicon creation type

    New lexica may be created…

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexicon import type

    New lexica may be imported from a lexical resource…

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexicon structure type

    The LMF core model can be identified in this simple structure.

    Components and datacategories can be identified using different icons.

    All may be dynamically created or modified.

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexicon structure type

    Representation of a more complex structure. By selecting a node in the

    Tree the content of a component or datacategory is shown and may

    be modified.

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Data category selection type

    Data categories can easily be selected from data category registries. .

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexical entry overview type

    Overview of lexical entries. By selecting a lexical entry the details

    will be revealed.

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexical entry details type

    Details of a lexical entry. Entry structure modifications are bound to

    schema definition, e.g. cardinality.

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexical entry details type

    Attribute values can be easily modified. Various value types are

    supported( text, video, audio, image or file)

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexical entry details type

    Example of uploading a video file.

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Lexical entry details type

    Viewing multimedia content.

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Alternative entry view type

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004

    Alternative views are provided which may be customized in look and feel.


    Synchronization of lexica type

    Personal Workspace

    Main Lexicon

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004

    Lexica may be copied to and modified in personal workspace


    Synchronization of lexica type

    Personal Workspace

    Main Lexicon

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004

    Lexica may be synchronized with main lexicon


    Synchronization of lexica type

    When synchronizing lexica the user is notified of structural changes and

    is in total control of the synchronization proces.

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Future directions type

    • Support for various types of relations

    • Import of data from other sources

    • Support for other Data Category Registries, e.g. GOLD

    • Integration with MPI archive

    • Integration with exploitation tools (ELAN, ANNEX)

    • Miscellaneous user requests

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    References type

    • ISO (2004): Lexical Markup Framework. ISO Document in progress

    • N. Ide, A. Lenci, N. Calzolari (2003): RDF Instantiation of ISLE/Mile Lexical Entries. LDC Workshop. Philadelphia

    • P. Wittenburg, W. Peters, S. Drude (2002): Analysis of Lexical Structures from Field Linguistics and Language Engineering. LREC 2002 Conference. Las Palma, Mai

    • P. Wittenburg (2001): Lexical Structures. MPI Technical Report. MPI Nijmegen

    • J. Bell, S. Bird (2000): A Preliminary Study of the Structure of Lexicon Entries. Workshop on Web-Based Language Documentation and Description. Philadelphia.

    • Ide, N., Kilgarriff, A. and Romary, L. (2000), A Formal Model of Dictionary Structure and Content, Euralex, Stuttgart

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004


    Example lexical structure type

    Stem orthography

    Sense nr

    Sense *

    sense

    Lexical subentry

    Gram cat

    Gram subcat

    orthography

    Engl. Transl.

    Engl. Transl.

    Example *

    [T/pr] nr

    Example lexical structure used in the TEOP project within DOBES

    Workshop

    ‘LexicalDabases

    and digital tools’

    Nijmegen

    April 2004

    * sign stands for 1:n relations of sub-structures


    ad