1 / 11

A Unicode-based Environment for the Creation and use of LRs

A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham , Diana Maynard, Oana Hamza, Tony McEnery 1 , Paul Baker 1 , Mark Leisher 2 Department of Computer Science, University of Sheffield 1 University of Lancaster

Download Presentation

A Unicode-based Environment for the Creation and use of LRs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Unicode-based Environment for the Creation and use of LRs • Valentin Tablan, Cristian Ursu, Kalina Bontcheva, • Hamish Cunningham, Diana Maynard, Oana Hamza, • Tony McEnery1, Paul Baker1, Mark Leisher2Department of Computer Science, University of Sheffield • 1University of Lancaster • 2New Mexico State University • GATE (a General Architecture for Text Engineering) and ML LRs • Motivation (history of men’s underwear) • Short definition of GATE • GATE, Unicode and Java • EMILLE 1(11)

  2. Motivation for Software Infrastructure for Language Engineering • Analogy with recent history of men’s underwear – also supportive infrastructure: • The bad old days: Y-fronts: supportive, yes, but tended to be too constrictive • The brave new world: boxer shorts: still supportive, but less constraining • The purpose of our work (the boxer shorts ideal): • freedom within a supportive environment 2(11)

  3. GATE is: • An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualisation/edit; persistence; IR; IE; dialogue; ontologies; etc. • Free software (LGPL). Download at http://gate.ac.uk/download/ 3(11)

  4. Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. v1 used LT-NSL for SGML input; v2 talks to other XML-based systems, APIs and standards) • (Almost) everything is a component, and component sets are user-extendable • Component-based development • An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL. 4(11)

  5. GATE Language Resources • GATE LRs are documents, ontologies, corpora, lexicons. • Documents / corpora: • GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML. • Multilinguality: • New internationalised versions of JVM support >100 different encodings. • Other encodings: developing system for user-entry of mapping tables. • LR persistence through XML, file datastore or databases (Oracle, PostgreSQL). 5(11)

  6. Processing Resourcres • Algorithmic components knows as PRs – beans with execute methods. • All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing). • 20-30 freebies with GATE • Unicode Tokeniser • splits text into typed tokens based on FSM • dynamically constructed from a set of rules based on the character categories defined by the Unicode standard. • UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word; • output can be localised by a later module (e.g. “don’t” … “do” “n’t”) • current status: • 23 rules seem able to handle without changes Indo-European languages. • the English tokeniser: Unicode tokeniser + pattern grammar FST. 6(11)

  7. Displaying Multilingual Data (1) • GATE uses standard (and imperfect) Java rendering engine for displaying text. 7(11)

  8. Displaying Multilingual Data (2) • All the visualisation and editing tools for ML LRs use the same facilities: 8(11)

  9. Editing Multilingual Data • Java provides no special support for text input (this may change) • GATE Unicode Kit (GUK) plugs this hole • Support for defining additional Input Methods; currently 30 IMs for 17 languages • Pluggable in other applications (e.g. MPI’s EUDICO) • Can use virtual keyboard or standard layouts over QWERTY • IMs defined in plain text files • GUK comes with a standalone Unicode editor 9(11)

  10. EMILLE: Enabling Minority LE • 3 year EPSRC project at Lancaster University and Sheffield University. • Corpus development: • written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu. • spoken corpora of at least 500,000 words per language. • Unicode developments for GATE: • Indic keyboard layouts. • encodings for Indic languages. • Development of basic LE tools: • POS tagging. • alignment tools for parallel corpora. 10(11)

  11. Encore • http://gate.ac.uk/ • Other GATE-related stuff at LREC: • Saggion et al.: Extraction Information for MM Indexing [Weds, 19.05] • Baker et al.: EMILLE [Thurs, 10.25] • Demo and poster [Thurs, 11.00-12.20, session D1] • Pastra et al.: Reuse of NE pattern grammars [Thurs, 16.20] • Fliers 11(11)

More Related