1 / 40

Describing and Discovering Language Resources

Describing and Discovering Language Resources. David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh. Overview. Goals: availability and interoperability Service oriented architecture and workflow NLP Components Service description and discovery

lisle
Download Presentation

Describing and Discovering Language Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Describing and Discovering Language Resources David Illsley, Ewan Klein, Steve Renals School of Informatics University of Edinburgh

  2. Overview • Goals: availability and interoperability • Service oriented architecture and workflow • NLP Components • Service description and discovery • NLP and the Grid

  3. What are Language Resources? • Language Resources (LRs) of two kinds: • Static resources: • corpora (text, speech, multimodal) • lexicons, terminologies, ontologies • grammars, declarative rule-sets • Processing resources: • segmenters, tokenizers, zoners, taggers, entity classifiers, chunkers, parsers, …

  4. Goals • Maximize availability of static LRs for automatic processing • Maximize interoperability of processing LRs

  5. LRs on the WWW, 1 • Can use the WWW to locate corpora • Example: OLAC (Open Language Archive Community) • Provides query interface to search for corpora across multiple repositories • Requires standard metadata record for harvesting. • Does not provide access to corpora.

  6. LRs on the WWW,2 • Can use the WWW to directly search corpora • Many examples • BNC Online Search • words (with regular expressions) • tag strings • Typically search is limited (expressiveness, number of results)

  7. LRs on the WWW, 3 • Can use the WWW to download tools • Some tools offer a demo web interface • No interoperability: • you cannot take the output of one web-interfaced tool and feed it as input to another tool

  8. LRs on the WWW, 4 • Challenges for accessing static LRs for automatic processing: • licensing restrictions • file (or database) structure • data format • data transfer • What about processing LRs? • can download, • but not execute in an interoperable manner

  9. Web Services (WS) • WS is a self-contained software resource • Can be located and invoked across the web: • identified by a URL • public interfaces and bindings are defined and described using XML • Other applications interact with it in a prescribed manner • XML-based messages conveyed by internet protocols (e.g. HTTP) • Web services can be composed into complex, distributed applications

  10. Service Oriented Architecture (SOA) WWW description DiscoveryAgencies locate publish Service Provider Service Requester client description interact service Source: Berners-Lee

  11. Web Service: Key Ideas • Interaction with Web Services is • described by • and conducted • using XML documents exchanged over the internet • SOAP protocol • describes the form of messages and how to process them • a way of representing Remote Procedure Calls over HTTP

  12. The Appeal of Web Services • A means of building distributed systems • virtualization — not dependent on any one programming language, OS, development environment • based on well-understood underlying protocols • components can be developed independently • decentralized (apart from DNS)

  13. NLP Services • Fairly easy to wrap legacy code as web services • Allows us to deploy tools across the web as part of a larger application • Corpora can also be deployed as services • Helps with availability interoperability • But still many challenges

  14. Building NLP Applications • Many NLP applications involve relatively few ‘conceptual’ components: • tokenizers, taggers, named entity recognizers, parsers, etc • often different versions of the same components • much repeated (and messy) labour in wiring the components together to interoperate

  15. Issues in Component Approach • Granularity • What is appropriate ‘grain size’ of functionality? • Too fine: heavy overheads in communication, lose ease of use • Too gross: loss of flexibility • Hierarchical decomposition is possible • Compatibility • informational, functional, formal

  16. Linguistic Annotation • Makes information in raw text explicit: • Classification of words and phrases • Detection of structural relationships • Annotation with general and domain-specific semantic labels • Usually proceeds from more concrete to more abstract • Earlier stages of annotation feed into the later stages • Assumed that annotation is represented as XML

  17. Idealized View

  18. tokenize POS tag parse Compatible NLP Services:Substitution POS tag POS tag

  19. parse POS tag tokenize Compatible NLP Services: Sequencing tokenize POS tag parse

  20. WSDL File • XML document, usually on same machine as server • Describes everything involved in calling a web service: • The service URL and namespace • The type of web service • List of available functions • Arguments for each function • Data type of each argument • Return value of each function and data type of each return value

  21. Processor Input and Output Types • Composition of NL processors constrained by input and output types • Candidates for types? • WSDL provides simple data types: • strings, integers, booleans • not expressive enough • Can we build on notion of metadata for LRs?

  22. IMDI Catalogue Specification Catalogue.Title Arabic Treebank Catalogue.Subject-Language ara Catalogue.Content-Type written Catalogue.Format.Text UTF-8 Catalogue.Smallest Annotation Unit word Catalogue.Publisher LDC Catalogue.Size 266 Mb

  23. LR Metadata Standards • Advantages • consistency • software knows what to expect • can be designed according to agreed principles • Challenges • no generally agreed ontology for LRs • hard to get agreement (and who gets to decide?) • categorizations of LRs influenced by favourite linguistic theory • Other people are addressing this issue

  24. What’s missing: tool metadata • What kind of metadata would enable us to ensure tool interoperability? • Neither OLAC nor IMDI provide an answer.

  25. Discovering Resources • Who cares about discovering LRs? • researchers who are searching for LRs that meet specific research criteria • information providers • teachers, journalists, casual browsers • … • Current focus: automatic discovery by software agents

  26. Service Description & Discovery • What LRs can be discovered depends on how the LRs are described. • How LRs are described depends on the requirements for discovery. • Composability: • If an agent (human or software) has already selected component P, what other components Q can provide well-formed input to P ? • Query for all Q such that Q’s output type is compatible with P’s input type

  27. Some Versions of BNC name: British National Corpus, Version 1.0 type: text size: 2866 MB name: British National Corpus, Version 1.0, marked up in XML type: text size: 815 MB name: British National Corpus, Version 1.0, parsed with Charniak parser type: text size: 419 MB name: British National Corpus, Version 1.0, parsed with IMS parser type: text size: 2088 MB name: British National Corpus, Version 1.0, parsed with Minipar type: text size: 448 MB

  28. Corpus Request Scenario • Agent A requests corpus C with property [key = val]. • If C with [key = val] exists, serve it to A. • Otherwise, • find processor P such that output of P(C) satisfies [key = val] • apply P to C • serve result to A • store result for future requests

  29. Service Description • Standard approach • WSDL: describes service inputs/outputs in terms of simple data types • Doesn’t support semantically-based service discovery • Alternatives from Semantic Web • inputs and outputs specified in an ontology language • OWL and RDF both possible

  30. NLP as Document Annotation • NL Processor • takes a partially annotated document as input • yields a more richly annotated document as output

  31. Tagging as document annotation • Part of Speech Tagger • takes in a document with markup of words • yields a document as with additional markup of part of speech

  32. Document Class NB This is just corpus metadata!

  33. Subsumption over the Document class

  34. Subsumption over Processors

  35. Grid & NLP • Parallelism • distribute processes over many machines • use parallel algorithms within process • redundancy and fault tolerance • Distributed data • multiple corpora • distributed annotation of single corpus • Distributed processing pipeline • different components hosted at different sites

  36. Implementation • Based on Globus Toolkit 3.2 middleware • Corpus Services and Transformation Services provide interfaces for corpora and tools • Services Data Elements describe properties of services • properties are aggregated by Index Service, can be queried by clients • Index Service extended by Model Service • provides richer description of services using RDF triples • Backward chaining used to construct pipelines that will produce a requested resource

  37. Summary • Corpus query • for user, no obvious distinction between raw and processed data • Corpus service • either provide existing resource, or generate it • Need to have metadata for tools which allows automatic composition • Metadata needs to allow subsumption matching • using shared controlled vocabulary

More Related