A Unicode-based Environment for the Creation and use of LRs
1 / 11

A Unicode-based Environment for the Creation and use of LRs - PowerPoint PPT Presentation

  • Uploaded on

A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham , Diana Maynard, Oana Hamza, Tony McEnery 1 , Paul Baker 1 , Mark Leisher 2 Department of Computer Science, University of Sheffield 1 University of Lancaster

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' A Unicode-based Environment for the Creation and use of LRs' - nash-mcfarland

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  • A Unicode-based Environment for the Creation and use of LRs

  • Valentin Tablan, Cristian Ursu, Kalina Bontcheva,

  • Hamish Cunningham, Diana Maynard, Oana Hamza,

  • Tony McEnery1, Paul Baker1, Mark Leisher2Department of Computer Science, University of Sheffield

  • 1University of Lancaster

  • 2New Mexico State University

    • GATE (a General Architecture for Text Engineering) and ML LRs

    • Motivation (history of men’s underwear)

    • Short definition of GATE

    • GATE, Unicode and Java

    • EMILLE


  • Motivation for Software Infrastructure for Language Engineering

    • Analogy with recent history of men’s underwear – also supportive infrastructure:

    • The bad old days: Y-fronts: supportive, yes, but tended to be too constrictive

    • The brave new world: boxer shorts: still supportive, but less constraining

    • The purpose of our work (the boxer shorts ideal):

    • freedom within a supportive environment


  • GATE is: Engineering

  • An architectureA macro-level organisational picture for LE software systems.

  • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture.

  • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction.

  • Some free components... ...and wrappers for other people's components

  • Tools for: evaluation; visualisation/edit; persistence; IR; IE; dialogue; ontologies; etc.

  • Free software (LGPL). Download at http://gate.ac.uk/download/


  • Architectural principles Engineering

  • Non-prescriptive, theory neutral (strength and weakness)

  • Re-use, interoperation, not reimplementation (e.g. v1 used LT-NSL for SGML input; v2 talks to other XML-based systems, APIs and standards)

  • (Almost) everything is a component, and component sets are user-extendable

  • Component-based development

  • An OO way of chunking software: Java Beans

  • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering)

  • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.


  • GATE Language Resources Engineering

    • GATE LRs are documents, ontologies, corpora, lexicons.

    • Documents / corpora:

    • GATE documents loaded from local files or the web...

    • Diverse document formats: text, html, XML, email, RTF, SGML.

    • Multilinguality:

    • New internationalised versions of JVM support >100 different encodings.

    • Other encodings: developing system for user-entry of mapping tables.

    • LR persistence through XML, file datastore or databases (Oracle, PostgreSQL).


  • Processing Resourcres Engineering

    • Algorithmic components knows as PRs – beans with execute methods.

    • All PRs can handle Unicode data by default.

    • Clear distinction between code and data (simple repurposing).

    • 20-30 freebies with GATE

  • Unicode Tokeniser

    • splits text into typed tokens based on FSM

    • dynamically constructed from a set of rules based on the character categories defined by the Unicode standard.

    • UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word;

    • output can be localised by a later module (e.g. “don’t” … “do” “n’t”)

    • current status:

      • 23 rules seem able to handle without changes Indo-European languages.

      • the English tokeniser: Unicode tokeniser + pattern grammar FST.




  • Editing Multilingual Data Engineering

  • Java provides no special support for text input (this may change)

  • GATE Unicode Kit (GUK) plugs this hole

  • Support for defining additional Input Methods; currently 30 IMs for 17 languages

  • Pluggable in other applications (e.g. MPI’s EUDICO)

  • Can use virtual keyboard or standard layouts over QWERTY

  • IMs defined in plain text files

  • GUK comes with a standalone Unicode editor


  • EMILLE: Enabling Minority LE Engineering

  • 3 year EPSRC project at Lancaster University and Sheffield University.

    • Corpus development:

    • written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.

    • spoken corpora of at least 500,000 words per language.

  • Unicode developments for GATE:

    • Indic keyboard layouts.

    • encodings for Indic languages.

    • Development of basic LE tools:

    • POS tagging.

    • alignment tools for parallel corpora.


  • Encore Engineering

  • http://gate.ac.uk/

    • Other GATE-related stuff at LREC:

    • Saggion et al.: Extraction Information for MM Indexing [Weds, 19.05]

    • Baker et al.: EMILLE [Thurs, 10.25]

    • Demo and poster [Thurs, 11.00-12.20, session D1]

    • Pastra et al.: Reuse of NE pattern grammars [Thurs, 16.20]

    • Fliers