1 / 18

WebLicht Application and “Workspaces”

WebLicht Application and “Workspaces”. Erhard Hinrichs & Thomas Zastrow University Tübingen. Outline. Web-based Linguistic Chaining Tool ( WebLicht ) for incremental filtering and access of language corpus data WebLicht – Motivation WebLicht - Architecture

cala
Download Presentation

WebLicht Application and “Workspaces”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University Tübingen

  2. Outline • Web-based Linguistic Chaining Tool (WebLicht) for incremental filtering and access of language corpus data • WebLicht– Motivation • WebLicht - Architecture • WebLicht– Future Requirements • Test Case – Gutenberg Corpus

  3. CLARIN (Common Language Resource and Technology Infrastructure Network) • is committed to establishing an integrated and interoperable RI • supporting easy access and use of language • aims to overcome the current fragmentation and offer a stable, • persistent and extendable infrastructure • it will offer its services to researchers and scholars across a wide • spectrum of domains in particular in the humanities and soc sciences • ESFRI roadmap project; implementation phase starts in 2011 CLARIN Mission

  4. Typical CLARIN userscenario • Scenario: A PhDstudentinvestigates regional differences in vocabulary and in wordcollocations in different variants of German . • Data: large text corporaavailable at BBAW in Berlin, at theAustrianAcademy of Science in Vienna, the Swiss Text Corpus Project in Basel, and at EURAC, Bolzano. • Tools fortargeteddataaccess: WebLichtofferscustomizablechains of web servicesforfiltering and analyzingthedata

  5. WebLicht - Motivation • Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available • Most of them are implemented to run on local machines. This can be inconvenient and error-prone • Requirements: go beyond “do-it-yourself” and “download-first” strategies • The CLARIN solution: Make tools and resources available as webservices

  6. WebLicht - Architecture • WebLichtis a SOA foraccessing and processingtext corpora • Developmentstarted in October 2008 • WebLichtconsists of thefollowingcomponents: • Distributedservices: offeringfunctionality (resources & tools) overthe (inter-)net. Implemented as webservices (ca. 90 at themoment) • Repository: storesmetadata and technicalinformationabouttheservices • Web 2.0 baseduserinterface:interactswiththeuser and combinesservices and informationfromtherepository. Access still possible via scripts / programmingcode

  7. WebLicht - Architecture Stuttgart Tübingen Leipzig Stuttgart Leipzig Finland Romania Iceland UK Tübingen Repository Standard-conformant Text Corpus Encoding Web 2.0 Application for Tool Chaining and Execution Berlin

  8. WebLicht – Architecture • Services are implemented as REST style webservices • HTTPs POST method is used to send data from the UI to the services • As client, anything which is able to use the HTTP protocol, can be used: • Browser • Commandline tools (wget, curl) • Programming Languages • Anyone can implement his/her own interface to WebLicht

  9. WebLicht - Processing Chains

  10. WebLicht - Results

  11. WebLicht - Results

  12. WebLicht - Features • With RESTstylewebservices, everyone can implement a web service for WebLicht (4pages tutorial) • The SOA infrastructure is independent of programming languages or operating systems • The chaining algorithm is independent of the used dataformat • Form a legal point of view, the web services are still located in the institute where they were created

  13. WebLicht– Future Requirements • Web services are synchronous: some linguistic annotation processes are very time consuming an asynchronous behavior of these service would be desirable • The processing power is limited by local computing resources Scalability only with strong centers possible • The current architecture is not sufficiently parallelized and therefore does not scale up: Accommodate a large number of simultaneous users Parallelization of processes

  14. WebLicht– Future Requirements • Currently, users have to store the input data and their results on their local machines • Online storage in the form of personal workspaces with reliable backup solutions • Linguistic tools are typically developed in a variety of heterogeneous software environments and programming languages (Java, Perl, Python, C/C++, Prolog, Lisp, …) Encapsulation of individualserviceswithcommonAPIsforinteroperability • Currently, WebLicht services are limited to processing text corpora Extending webservices also to spoken language and multi-modal datasets (MPI is already working on this)

  15. Test Case: Gutenberg Corpus • On the basis of these structure, a part of the free available Gutenberg Project was annotated in Tübingen • Ca. 20.000 textsfrom 800 authors • Runtime: ca. 3.5 weeks • Result: • 217 milliontokens (words), 533 millionconstituents, 110 GB data

  16. Gutenberg Corpus – Analyzing • Fulltext index (Lucene) • Database for the linear part of the data • Tree-like structures can be analyzed with XML based techniques (Xpath, Xquery) • DOM based techniques are slow and performance hungry

  17. Links etc. • Clarin Homepage: http://www.clarin.eu • The D-Spin homepage: http://www.d-spin.org • WebLicht (login via DFN AAI): https://weblicht.sfs.uni-tuebingen.de/ Erhard Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft Universität Tübingen Wilhelmstr. 19 D-72074 Tübingen thomas.zastrow@uni-tuebingen.de Erhard.hinrichs@uni-tuebingen.de

  18. WebLicht - Combinations

More Related