Skip this Video
Download Presentation
A Unicode-based Environment for the Creation and use of LRs

Loading in 2 Seconds...

play fullscreen
1 / 11

A Unicode-based Environment for the Creation and use of LRs - PowerPoint PPT Presentation

  • Uploaded on

A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham , Diana Maynard, Oana Hamza, Tony McEnery 1 , Paul Baker 1 , Mark Leisher 2 Department of Computer Science, University of Sheffield 1 University of Lancaster

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' A Unicode-based Environment for the Creation and use of LRs' - nash-mcfarland

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

A Unicode-based Environment for the Creation and use of LRs

  • Valentin Tablan, Cristian Ursu, Kalina Bontcheva,
  • Hamish Cunningham, Diana Maynard, Oana Hamza,
  • Tony McEnery1, Paul Baker1, Mark Leisher2Department of Computer Science, University of Sheffield
  • 1University of Lancaster
  • 2New Mexico State University
    • GATE (a General Architecture for Text Engineering) and ML LRs
    • Motivation (history of men’s underwear)
    • Short definition of GATE
    • GATE, Unicode and Java
    • EMILLE



Motivation for Software Infrastructure for Language Engineering

    • Analogy with recent history of men’s underwear – also supportive infrastructure:
    • The bad old days: Y-fronts: supportive, yes, but tended to be too constrictive
    • The brave new world: boxer shorts: still supportive, but less constraining
    • The purpose of our work (the boxer shorts ideal):
    • freedom within a supportive environment



GATE is:

  • An architectureA macro-level organisational picture for LE software systems.
  • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture.
  • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction.
  • Some free components... ...and wrappers for other people\'s components
  • Tools for: evaluation; visualisation/edit; persistence; IR; IE; dialogue; ontologies; etc.
  • Free software (LGPL). Download at



Architectural principles

  • Non-prescriptive, theory neutral (strength and weakness)
  • Re-use, interoperation, not reimplementation (e.g. v1 used LT-NSL for SGML input; v2 talks to other XML-based systems, APIs and standards)
  • (Almost) everything is a component, and component sets are user-extendable
  • Component-based development
  • An OO way of chunking software: Java Beans
  • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering)
  • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.



GATE Language Resources

    • GATE LRs are documents, ontologies, corpora, lexicons.
    • Documents / corpora:
    • GATE documents loaded from local files or the web...
    • Diverse document formats: text, html, XML, email, RTF, SGML.
    • Multilinguality:
    • New internationalised versions of JVM support >100 different encodings.
    • Other encodings: developing system for user-entry of mapping tables.
    • LR persistence through XML, file datastore or databases (Oracle, PostgreSQL).



Processing Resourcres

    • Algorithmic components knows as PRs – beans with execute methods.
    • All PRs can handle Unicode data by default.
    • Clear distinction between code and data (simple repurposing).
    • 20-30 freebies with GATE
  • Unicode Tokeniser
    • splits text into typed tokens based on FSM
    • dynamically constructed from a set of rules based on the character categories defined by the Unicode standard.
    • UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word;
    • output can be localised by a later module (e.g. “don’t” … “do” “n’t”)
    • current status:
      • 23 rules seem able to handle without changes Indo-European languages.
      • the English tokeniser: Unicode tokeniser + pattern grammar FST.



Displaying Multilingual Data (1)

  • GATE uses standard (and imperfect) Java rendering engine for displaying text.



Displaying Multilingual Data (2)

  • All the visualisation and editing tools for ML LRs use the same facilities:



Editing Multilingual Data

  • Java provides no special support for text input (this may change)
  • GATE Unicode Kit (GUK) plugs this hole
  • Support for defining additional Input Methods; currently 30 IMs for 17 languages
  • Pluggable in other applications (e.g. MPI’s EUDICO)
  • Can use virtual keyboard or standard layouts over QWERTY
  • IMs defined in plain text files
  • GUK comes with a standalone Unicode editor



EMILLE: Enabling Minority LE

  • 3 year EPSRC project at Lancaster University and Sheffield University.
    • Corpus development:
    • written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.
    • spoken corpora of at least 500,000 words per language.
  • Unicode developments for GATE:
    • Indic keyboard layouts.
    • encodings for Indic languages.
    • Development of basic LE tools:
    • POS tagging.
    • alignment tools for parallel corpora.




    • Other GATE-related stuff at LREC:
    • Saggion et al.: Extraction Information for MM Indexing [Weds, 19.05]
    • Baker et al.: EMILLE [Thurs, 10.25]
    • Demo and poster [Thurs, 11.00-12.20, session D1]
    • Pastra et al.: Reuse of NE pattern grammars [Thurs, 16.20]
    • Fliers