1 / 27

X. Artola, A.Díaz de Ilarraza, N. Ezeiza, K. Gojenola, A. Sologaistoa and A. Soroa

EULIA: a graphical web interface for creating, browsing and editing linguistically annotated corpora. X. Artola, A.Díaz de Ilarraza, N. Ezeiza, K. Gojenola, A. Sologaistoa and A. Soroa. Introduction (I): the context. Processing of real texts (Basque)

dyanne
Download Presentation

X. Artola, A.Díaz de Ilarraza, N. Ezeiza, K. Gojenola, A. Sologaistoa and A. Soroa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EULIA:a graphical web interface for creating, browsing and editing linguistically annotated corpora X. Artola, A.Díaz de Ilarraza, N. Ezeiza,K. Gojenola, A. Sologaistoa and A. Soroa

  2. Introduction (I): the context • Processing of real texts (Basque) • Integrated language processing tools (so far): • EDBL, a general-purpose lexical database (80,000+ entries). • Tokenizer: word tokens and sentences • Morpheus, a wide-coverage morphosyntactic analyzer: • A segmentizer • A morphosyntactic analyzer • A recognizer of multiword lexical units and expressions • EusLem, a tagger/lemmatizer • A shallow syntactic analyzer (currently under development) XbRAC (LREC’04) Lisbon, 13/09/2014

  3. Introduction (II) • Main motivation: linguistic annotation "chaos" • Proliferation of a variety of formats as we developed language processing tools: standardize and integrate • So, we adopted: • A stand-off style of annotation • XML-encoded TEI-conformant feature structures (FS) • EULIA is meant to be an environment to coordinate NLP tools and to exploit the data generated by them • First step of a bigger project: a good base for the integration of different areas of linguistic engineering XbRAC (LREC’04) Lisbon, 13/09/2014

  4. Outline • Introduction • The annotation framework and the library • The I/O stream between linguistic analysis tools: the "production chain" • EULIA • functionality, architecture, and graphical interface • coordination module and abstraction layer • Future work • Conclusions XbRAC (LREC’04) Lisbon, 13/09/2014

  5. Links Text elements Linguistic Information (collections of FSs) The annotation framework • Main objective: to provide a framework for the development and use of language resources and tools • The output of each analysis tool may be seen as composed by several XML documents: the annotation web • Framework schema: • Text elements: single- or multi-word tokens, text spans, etc. • Links: between text elements and linguistic information • Linguistic information: feature structures (TEI-P4 FSs) XbRAC (LREC’04) Lisbon, 13/09/2014

  6. The annotation framework: general features • It uses feature structures (FS) as a general data model, used to exchange information between the different tools • It provides a way to represent different types of linguistic information • Partial results and ambiguities can be easily represented • Representation independency • Many tools available (XML), based on which we have developed our own library: LibiXaML XbRAC (LREC’04) Lisbon, 13/09/2014

  7. The annotation framework: the library LibiXaML (I) Structure and relations in the annotation web Set of classes encapsulated in LibiXaML • LibiXaML: implementation of a set of abstract data types (object-oriented model): general API over the annotation web • It hides XML tags and links, providing us with a set of logical objects and relationships (tokens, analyses, correspondences between them...) • Transparent, modular and scalable • Versatile for managing linguistic data • It provides an infrastructure for searching XbRAC (LREC’04) Lisbon, 13/09/2014

  8. The annotation framework: the library LibiXaML (II) • Classes in LibiXaML represent: • Text anchors: text elements found in the input (single- and multi-word tokens, text spans…) • Analysis collections: feature structure sets • Links: relation between anchors and their corresponding annotations (analyses) • XML Documents: collection of anchors, analyses, and/or links • … • Most of these classes correspond to TEI guidelines tags: FS, F, W, Link, … XbRAC (LREC’04) Lisbon, 13/09/2014

  9. The annotation framework: the library LibiXaML (III) XbRAC (LREC’04) Lisbon, 13/09/2014

  10. Production chain (I) • Tokenizer (TK): • Input: an XML-tagged text • Output: list of the recognizedtokens and sentences • Segmentizer (SG): • Input: tokenized text and the general lexicon • Output: library of segmentation analyses and link document between the tokens and their analyses. • Morphosyntactic treatment (MS): • Input: output of the segmentizer • Output: library of morphosyntactic analyses and links between the tokens and their analyses XbRAC (LREC’04) Lisbon, 13/09/2014

  11. Production chain (II) • Multiword treatment (HT): • Input: Tokenized text an morphosyntactic analyses • Output: structure of MWLUsidentified in the text and morphosyntactic library enriched with MWLUs information and its links • Lemmatizer (EL): • Input: Tokens, MWLUs and morphosyntactic information. • Output: lemmatization library and links between the single- and multi-word tokens and their corresponding lemmatizations XbRAC (LREC’04) Lisbon, 13/09/2014

  12. XbRAC (LREC’04) Lisbon, 13/09/2014

  13. General libraries vs. text-specific documents • Text-specific documents, which contain the linguistic annotations corresponding to a particular input text. • General analysis libraries containing the sets of all the annotations produced by the different tools: seglib, morflib, lemlib… The second approach speeds-up processingand saves lots of disk space, but it requires a database management system under it XbRAC (LREC’04) Lisbon, 13/09/2014

  14. Outline • Introduction • The annotation framework and the library • The I/O stream between linguistic analysis tools: the "production chain" • EULIA • functionality, architecture, and graphical interface • coordination module and abstraction layer • Future work • Conclusions XbRAC (LREC’04) Lisbon, 13/09/2014

  15. EULIA (I) • EULIA is an environment to coordinate NLP tools in an integrated way, and to exploit the data generated by these tools An integration strategy is complex:for any task it is necessary to coordinate different tools and data resources • EULIA hides this complexity to the user helping him/her when using the tools over a "web" of documents XbRAC (LREC’04) Lisbon, 13/09/2014

  16. EULIA (II) • Main goal: to build • a system that integrates, coordinates and accesses NLP tools, • working over a general linguistic annotation framework and • offering a user-oriented linguistic data manager, with an intuitive and easy-to-use GUI XbRAC (LREC’04) Lisbon, 13/09/2014

  17. EULIA: functionalities (I) • Consultation and browsing of the linguistic information attached to texts • Search, queries and analysis of results • Manual disambiguation of analysis results (link documents) • Manual annotation facilities and suitable encoding: forms generated automatically based on RelaxNG schemas (under construction) XbRAC (LREC’04) Lisbon, 13/09/2014

  18. EULIA: functionalities (II) • Simple text editing facilities • Possibility to submit a text to the coordination module to be analyzed • User control and personalization XbRAC (LREC’04) Lisbon, 13/09/2014

  19. EULIA: Architecture • EULIA’s implementation is based on a client-server architecture • Client: a Java Applet accessible by any web browser • Server: a combination of different modules distributed in different server computers • All modules are designed using an object-oriented methodology • Robust design • Easy to extend XbRAC (LREC’04) Lisbon, 13/09/2014

  20. EULIA: Graphical User Interface XbRAC (LREC’04) Lisbon, 13/09/2014

  21. EULIA: Interface • It is the intermediary between users and NLP tools • Users’ control and requests • Data browsing (XML data) according to a suitable stylesheet (XSL) • Information understanding thanks to RelaxNG schemas XbRAC (LREC’04) Lisbon, 13/09/2014

  22. EULIA: Coordination • It is the GUI’s server and answers GUI’s requests. To solve the requests, this module distributes the tasks among the integrated tools • The goal is to create a workbench which will facilitate the integration of NLP tools and the cooperation among them XbRAC (LREC’04) Lisbon, 13/09/2014

  23. EULIA: Abstraction Layer • The goal is to keep separate the coordination module from the integrated tools, the analyses and their location • This layer declares the conditions each tool requires to be used, the location of input data and how to retrieve them, and how to store and retrieve the results according to each analysis type XbRAC (LREC’04) Lisbon, 13/09/2014

  24. EULIA: Tools and Annotations • This set is composed of integrated tools and their outputs, the linguistic annotations • These tools’ I/O are coded according to the annotation framework explained before XbRAC (LREC’04) Lisbon, 13/09/2014

  25. Conclusion • A methodology to integrate different linguistic tools • Based on a common annotation framework • EULIA, a general environment which facilitates the use of linguistic tools over this annotation framework: • Oriented to common and specialized users (informative, easy-to-use, intuitive) • Powerful system although not complex for the end user, thanks to the abstraction layer set over the actual data and tools XbRAC (LREC’04) Lisbon, 13/09/2014

  26. Future work • Build a general front- and back-end modules for the analysis tools • It would confirm the extensibility of our approach • Move progressively from text-specific annotation documents to general libraries stored in XML-native databases • This approach will save lots of disk space and speed up the analysis procedures • Make use of XLink and XPointer recommendations to standardize pointing expressions • Test the annotation scheme and EULIA in manual annotation and disambiguation tasks in an intensive manner XbRAC (LREC’04) Lisbon, 13/09/2014

  27. EULIA:a graphical web interface for creating, browsing and editing linguistically annotated corpora Thank you!

More Related