1 / 18

GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction

GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham Department of Computer Science, University of Sheffield http://gate.ac.uk/ Structure of the talk: A brief introduction to GATE

Download Presentation

GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction • Kalina Bontcheva, Diana Maynard, • Valentin Tablan, Hamish Cunningham • Department of Computer Science, University of Sheffield • http://gate.ac.uk/ • Structure of the talk: • A brief introduction to GATE • Multilingual infrastructure in GATE • Simple multilingual IE components 1(18)

  2. An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, a graphical development environment. GATE comes with... Some free components... ...and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at http://gate.ac.uk/download/ GATE is... 2(18)

  3. Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Weka...) • (Almost) everything is a component, and component sets are user-extendable • (Almost) all operations are available both from API and GUI 3(18)

  4. CREOLE – Collection of REusable Objects for Language Engineering: Java Beans: an OO way of chunking software GATE components: modified Java Beans with XML configuration The minimal component = 10 lines of Java, 10 lines of XML, 1 URL Three types: Language Resources, Processing Resources, Visual Resources Why bother? Allows the system to load arbitrary language processing components Component-based development 4(18)

  5. LRs are documents, ontologies, corpora, lexicons, …… LRs can be associated with DataStores (Oracle, PostgreSQL, XML, Java Serialisation) Documents / corpora: Diverse document formats: text, html, XML, email, RTF, SGML Optional format-preserving markup analyse / save Standoff annotation model (start, end, type, features), derivative of TIPSTER, compatible with ATLAS and XCES Coping with diverse character encodings: New internationalised versions of JVM support >100 different encodings. Other encodings: developing system for user-entry of mapping tables (remove programming from the process) Language Resources (LRs) 5(18)

  6. Algorithmic components knows as PRs – beans with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple repurposing). 20-30 freebies with GATE Controllers: execute a set of PRs SerialController: sequential run of arbitrary PR set SerialAnalyserController: analyser PRs over corpus Conditional controllers: execute depend on features Parallel controller? PRs + Controller = Applications Application parameterisation state can be saved and restored, and used for embedding / batching Processing Resources (PRs) 6(18)

  7. Visual Resources (VRs) 7(18)

  8. VRs (2): Coreference 8(18)

  9. VRs (3): Syntax 9(18)

  10. Displaying Multilingual Data • GATE uses standard (& imperfect) Java rendering engine for displaying text. 10(18)

  11. Editing Multilingual Data • GATE Unicode Kit (GUK) • Complements Java’s facilities • Support for defining Input Methods (IMs) • Currently 30 IMs for 17 languages • Pluggable in other applications (e.g. JEdit, EUDICO) • Can use virtual kybd or standard layouts over QWERTY • IMs defined in plain text files • GUK comes with a standalone Unicode editor 11(18)

  12. Processing Multilingual Data All processing, visualisation and editing tools use GUK 12(18)

  13. Multilingual IE Components The ANNIE system – a reusable and easily extendable set of components 13(18)

  14. A very portable component for multliple languages: splits text into typed tokens based on FSM dynamically constructed from rules based on character categories defined by the Unicode, e.g.: UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word; output generally localised by a later module (e.g. “don’t” … “do” “n’t”) 23 rules seem able to handle without changes Indo-European languages. the English tokeniser: Unicode tokeniser + pattern grammar FST The Unicode Tokeniser 14(18)

  15. TIDES Surprise Language: Hepple tagger but substituted Cebuano/Hindi lexicon for English Used empty ruleset since no training data available Used default heuristics (e.g. return NNP for capitalised words) Very experimental, but reasonable results 67% correctness for Hindi and 75% for Cebuano Adaptation time per language - 2 days POS tagging in new languages 15(18)

  16. Most English JAPE rules based on POS tags and gazetteer lookup Grammars can be reused for languages with similar word order, orthography etc. No time to make detailed study of Cebuano, but very similar in structure to English Most of the rules left as for English, but some adjustments to handle especially dates Used both English and Cebuano grammars and gazetteers, because NEs appear in both languages Porting NE grammars 16(18)

  17. TIDES Evaluation Results 17(18)

  18. GATE – a Unicode-based NLP infrastructure, particularly suitable for multilingual adaptation of IE systems Requires little involvement of native speakers and very little annotated data for a basic job Future work Improving multilingual support, e.g., morphology support, automatic language and encoding identification Learning gazetteer lists from annotated corpora Conclusion 18(18)

More Related