Software Architecture for Language Engineering (SALE) – where next?

Software Architecture for Language Engineering (SALE) – where next? http://gate.ac.uk/http://nlp.shef.ac.uk/ Hamish Cunningham IBM TJ Watson, 1st August/2003

Structure of the Talk • SALE and its context • Definitions • The Knowledge Economy and HLT • Software Lifecycle • GATE, a General Architecture for Text Engineering • History • Summary of Features and Principles • Component-base development • Unicode support • Measurement • CREOLE: some components • Users and Projects • Where Next (give up and go home)? • Future context • Desirables • Conclusion 2(39)

Computational Linguistics: science of language that uses computation as an investigative tool. Natural Language Processing: science of computation whose subject matter is data structures and algorithms for human language processing. Language Engineering: building systems whose cost and outputs are measurable and predictable. Software Architecture: macro-level organisational principles for families of systems. In this context is also used as infrastructure. SALE: software infrastructure, architecture and development tools for applied NLP and LE. SALE: definitions 3(39)

The Knowledge Economy and Human Language • Gartner, December 2002: • taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications • through 2012 more than 95% of human-to-computer information input will involve textual language • A contradiction: formal knowledge in semantics-based • systems vs. ambiguous informal natural language • The challenge: to reconcile these two opposing tendencies 4(39)

IE and Knowledge: Closing the Language Loop KEY MNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE (M)NLG Semantic Web; Semantic Grid;Semantic Web Services Formal Knowledge(ontologies andinstance bases) HumanLanguage OIE (A)IE ControlledLanguage CLIE 5(39)

Software lifecycle in collaborative research • Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to. • Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg. • Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator. • Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype..."). • Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry). 6(39)

Where did GATE come from? • Early- mid-1990s (e.g. in TIPSTER): • Increasing trend towards multi-site collaborative projects • Role of engineering in scalable, reusable, and portable HLT • Support for large data, in multiple media, languages, formats, and locations • Lower cost of creation of language processing components • Promote quantitative evaluation metrics via tools and a level playing field • GATE history: • 1996 – 2002: GATE version 1, proof of concept • March 2002: version 2, rewritten in Java, component based, LGPL, more users • Fall 2003: new development cycle 7(39)

An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, a graphical development environment. GATE comes with... Some free components... ...and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at http://gate.ac.uk/download/ GATE is... 8(39)

Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Weka...) • (Almost) everything is a component, and component sets are user-extendable • (Almost) all operations are available both from API and GUI 9(39)

CREOLE: a Collection of REusable Objects for Language Engineering: Java Beans: an OO way of chunking software GATE components: modified Java Beans with XML configuration The minimal component = 10 lines of Java, 10 lines of XML, 1 URL Why bother? Allows the system to load arbitrary language processing components Component-based development 10(39)

Bootstrap: stub Java class, Makefile, config Registration: URL / JAR / creole.xml Instantiation: class loading, parameterisation, bean object creation load-time parameters, e.g. a document’s charset run-time parameters, e.g. a parser’s lexicon Three types of beans (nota new religion!): Language Resources, e.g. doc, corpus, lexicon Processing Resource, e.g. tagger, stat modeller Visual Resource, e.g. doc editor, syntax editor CREOLE lifecycle 11(39)

GATE LRs are documents, ontologies, corpora, lexicons, …… LRs can be associated with DataStores (Oracle, PostgreSQL, XML, Java Serialisation) Documents / corpora: Diverse document formats: text, html, XML, email, RTF, SGML Optional format-preserving markup analyse / save Standoff annotation model (start, end, type, features), derivative of TIPSTER, compatible with ATLAS and XCES Language Resources (LRs) 12(39)

Algorithmic components knows as PRs – beans with execute methods. Controllers: execute a set of PRs SerialController: sequential run of arbitrary PR set SerialAnalyserController: analyser PRs over corpus Conditional controllers: execute depend on features Parallel controller? PRs + Controller = Applications Application parameterisation state can be saved and restored, and used for embedding / batching Processing Resources (PRs) 13(39)

Visual Resources (VRs) 14(39)

VRs (2): Coreference 15(39)

VRs (3): Syntax 16(39)

Editing Multilingual Data • GATE Unicode Kit (GUK) • Complements Java’s facilities • Support for defining Input Methods (IMs) • currently 30 IMs for 17 languages • Pluggable in other applications (e.g. JEdit) 17(39)

Processing Multilingual Data All processing, visualisation and editing tools use GUK 18(39)

Performance Evaluation • At document level – annotation diff 19(39)

Regression Test At corpus level – corpus benchmark tool – tracking system’s performance over time 20(39)

JAPE, FSTs over annotations ANNIE, A Nearly-New IE system DAML+OIL, Protégé, Ontology-Aware IE Information Retrieval, Lucene WordNet Machine Learning support More CREOLE 21(39)

FSTs over annotations • JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components • Simplifies multi-phase regex processing • Rule: Company1 • Priority: 25 • ( • ( {Token.orthography == upperInitial} )+ • {Lookup.kind == companyDesignator} • ):match • --> • :match.NamedEntity = { kind=company, rule=“Company1” } 22(39)

Info Extraction Components The ANNIE system – a reusable and easily extendable set of components 23(39)

Populating Ontologies with IE 24(39)

Protégé and Ontology Management 25(39)

Information Retrieval Currently based on the Lucene IR engine 26(39)

WordNet support 27(39)

Uses classification. [Attr1, Attr2, Attr3, … Attrn]  Class Classifies annotations. (Documents can be classified as well using a simple trick.) Annotations of a particular type are selected as instances. Attributes refer to instance annotations. Attributes have a position relative to the instance annotation they refer to. Machine Learning support 28(39)

Attributes can be: Boolean The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation. Nominal The value of a particular feature of the referred instance annotation. The complete set of acceptable values must be specified a-priori. Numeric The numeric value (converted from String) of a particular feature of the referred instance annotation. Attributes 29(39)

Machine Learning PR in GATE. Has two functioning modes: training application Uses an XML file for configuration: <?xml version="1.0" encoding="windows-1252"?> <ML-CONFIG> <DATASET> … </DATASET> <ENGINE>…</ENGINE> <ML-CONFIG> Implementation 30(39)

<DATASET> <INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTE> <NAME>POS_category(0)</NAME> <TYPE>Token</TYPE> <FEATURE>category</FEATURE> <POSITION>0</POSITION> <VALUES> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> … </VALUES> [<CLASS/>] </ATTRIBUTE> … </DATASET> <DATASET> 31(39)

<ENGINE> <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER> <OPTIONS> <CLASSIFIER>weka.classifiers.j48.J48</CLASSIFIER> <CLASSIFIER-OPTIONS>-K 3</CLASSIFIER-OPTIONS> <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-THRESHOLD> </OPTIONS> </ENGINE> Now: WEKA Soon: Torch? YASMET? TIMBL? <ENGINE> 32(39)

Attributes Position Instances type: Token 33(39)

Training Prepare training annotations. Run the ML PR in training mode. Export the dataset as .arff and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options. Update the configuration file accordingly. Run the ML PR again to collect the actual data. [ Save the learnt model. ] Standard Use Scenario Application • [ Load the previously saved model. ] • Run the ML PR in application mode. • [ Save the learnt model. ] 34(39)

The MLEngine Interface void addTrainingInstance(List attributes) Adds a new training instance to the dataset. Object classifyInstance(List attributes) Classifies a new instance. void init() This method will be called after an engine is created and has its dataset and options set. void setDatasetDefinition(DatasetDefintion definition) Sets the definition for the dataset used. void setOptions(org.jdom.Element options) Sets the options from an XML JDom element. void setOwnerPR(ProcessingResource pr) Registers the PR using the engine with the engine. Using Other ML Libraries 35(39)

GATE team projects. Past: Conceptual indexing: MUMIS: automatic semantic indices for sports video MUSE, cross-genre entitiy finder HSL, Health-and-safety IE Old Bailey: collaboration with HRI on 17th century court reports Multiflora: plant taxonomy text analysis for biodiversity research e-science Present: Advanced Knowledge Technologies: €12m UK five site collaborative project EMILLE: S. Asian languages corpus ACE/ TIDES: Arabic, Chinese NE JHU summer w/s on semtagging Future: Five new projects (below) Thousands of users at hundreds of sites. A representative sample: the American National Corpus project the Perseus Digital Library project, Tufts University, US Longman Pearson publishing, UK Merck KgAa, Germany Canon Europe, UK Knight Ridder, US BBN (leading HLT research lab), US SMEs inc. Sirma AI Ltd., Bulgaria Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia... A bit of a nuisance (GATE users) 36(39)

Can Universities cope with the long term? User survey Future context: SEKT: Knowledge Management KnowledgeWeb: OntoWeb II PrestoSpace: audiovisual preservation (FSTs for users?) hTechSight: knowledge portal for petrochemicals ETCSL: Electronic Text Corpus of Sumerian Language DERI: Digital Enterprise Research Institute PhDs: INK, PIE Where Next (1)? 37(39)

Some desirables: Corpus tools (ANNIC in progress) Audiovisual documents WS-based backend server, for ML, active learning etc. Better dialogue support (cf. AMITIES, Galaxy) Better MT support PDF documents JAPE debugger, editor, 101 language extensions (e.g. quantified ops, deletion ontology callouts) Cleverer treatment of large documents in the GUI PR reloading Where Next (2)? 38(39)

Conclusion • GATE is: • Addressing the need for scalable, reusable, and portable HLT solutions • Supporting large data, in multiple media, languages, formats, and locations • Lowering the cost of creation of new language processing components • Promoting quantitative evaluation metrics via tools and a level playing field • Promoting experimental repeatability by developing and supporting free software • http://gate.ac.uk/ 39(39)

Software Architecture for Language Engineering (SALE) – where next?

Software Architecture for Language Engineering (SALE) – where next?

Presentation Transcript

Software Architecture: An Introduction

SAE Avionics Architecture Description Language

Software Architecture: An Overview

Understanding Software Evolution

Software Architecture

ECE 355: Software Engineering

Component Software: A New Software Engineering Course

DESIGN OF SOFTWARE ARCHITECTURE

Software Engineering Model Driven Architecture

Software Architecture

CSE503: Software Engineering Software architecture

Remote Laboratories for Control Engineering: An architecture oriented state-of-the-art.

DESIGN OF SOFTWARE ARCHITECTURE

Software Architecture

LATE: Lisp Architecture for Text Engineering*

Service Oriented Architecture

Software Engineering

Understanding Software Evolution

Outlines

Engineering

Component Software: A New Software Engineering Course

(Some) Software Engineering Research at NJIT