1 / 18

Nederlab

Nederlab. Laboratory for research on the patterns of change in the Dutch language and culture. E-Humanities Group Research Meeting, May 16 th , 2013 Meertens Institute, Amsterdam. A bit of history.

torgny
Download Presentation

Nederlab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nederlab Laboratory for research on the patterns of change in the Dutch language and culture E-Humanities Group Research Meeting, May 16th , 2013 Meertens Institute, Amsterdam

  2. A bit of history • The CLARIN EU project (2008-2011) intended to provide an answer to the digital challenge set out by the EU: • How to bring together large amounts of data from all over Europe along with the necessary tools to process them? This was followed by a number of national CLARIN projects (CLARIN-NL, D-SPIN…) tackling these challenges at a national level

  3. A bit of history (cont) • The CatchPlus project valorizes scientific research results to usable tools and services for the entire Dutch heritage sector. • This software leads to better disclosure and larger accessibility of collections from heritage institutions.

  4. A bit of history (cont) • It brought us: • PID services • ‘concept’ registries • Flexible metadata formats (CMDI) • Standard publication protocols (OAI-PMH) • Web authentication methods (SAML 2) • And a lot of tools and data sets at the national levels • (Anyone remember the CLARIN-NL call 1-4 projects?)

  5. CLARIN center network Scenario where dedicated services centres of new type interact in a stable way and give persistent and easy-to-use services to the community. Researchers must be able to rely on the services offered Scenario characterized mainly by accidental and temporary interactions

  6. Source: Riding the Wave How Europe can gain from the rising tide of scientific data Report of the High Level Expert Group on Scientific Data

  7. Arguments for Nederlab • Bridge the gap between community support services and user community/data providers • 7 points towards digitization criticism NRC handelsblad(science section) September 10 and 11, 2011 Digitisation of older texts is going wrong A lot of money is wasted

  8. 7 points 1. All the money for digitisation has to come from a single fund; the funding body is to impose requirements to the quality 2. Funds are only provided if both the digitisation and the metadata meet the (international) standards. This is the only way that sub collections can eventually be combined. 3. Linking money and quality. Text quality varies greatly, from corrected OCR to messy, uncorrected OCR. 4. Scientists, researchers and other users have to be more closely involved with the development of large websites. Better cooperation with users. 5. Central register which shows what has already been digitised, as much work is unnecessarily repeated. Money is only offered to those institutions who first investigate what has already been done. 6. Central register has to be accessible to the public. This way, people can donate books which they would otherwise throw away, and which now can be cut up. This saves a lot of time when digitising. 7. A national plan should be drawn up to professionally digitise the most important sources within 10 years, at the lowest possible cost.

  9. Hypothesis • The hypothesis is that changes in language and culture – both of which express human cognition – are related to each other and that they are based on identical or comparable regularities. By means of Nederlab we want to uncover these regularities. Research into those regularities will show which parts of the Dutch language and culture are subject to change, and which remain constant.

  10. Hypothesis • Nature versus nurture debate

  11. Some research questions • Detecting new concepts, words and combinations of words. • Concept history: What is meant by ‘burgerschap’ (‘citizenship’)? • Systematically mapping linguistic changes; for example deflexion. • Determining patterns and motives; How are the nobility, the clergy, etc., described, and with which motives are these ‘groups’ associated?

  12. Some research questions • Detecting similarities in texts: Who is citing who? • When were terse phrases, idioms and expressions coined and how were they taken over by authors and by different text genres? • What was the first text genre in which a certain metaphor was used for the first time? • Author recognition. Who was the author of a certain text?

  13. (Some) Challenges for Nederlab • Usability • Handling large amounts of data from various sources and varying quality • Handle editorial process • Dealing with diachronic (processing) issues • Integrating technologies from different technology providers • Integrating technologies that contribute towards answering research questions • Identify gaps.

  14. De Gids DBNL has mass digitized all volumes of ‘de Gids.’ Not only have their contents are accessible now, but also the contributions by individual authors. • How did the number of contributions by female authors progress over the years ? • How did the average age vary over the years ? • Where do the authors come from? • The percentage of poetry/prose over the years ? • What are the ‘new’ words occurring over the years ? • Which frequently used terms are used over the years ? • How do these change • Which words are used in one period, but not in another ?

  15. Dutch language innovations • The second research pilot concerns the hypothesis that in the 19th century innovations in the Dutch language started in Dutch overseas: in Indonesia, Surinam, and the Dutch Antilles. This hypothesis is supported by the fact that in this periode for the first time relatively large contingents of bilingual speakers were living in the Dutch colonies, which is an important condition for language innovation. The hypothesis will be tested (by SjefBarbiers and Nicoline van derSijs) by comparing texts printed in the Netherlands and overseas.

  16. Covnert to Folia DBNL Metadata DBNL data Extract articles Convert to Folia Folia XML KB didl KB Alto Postagging (Frog) + cleanup (TICCL) Index metadata Extract articles Nederlab metadata index (SOLR) Folia XML N-gram generatie (http://software.ticc.uvt.nl/tel-0.1.tar.gz) Index POS tags N-grams Blacklab POS indices (Lucene) Index n-grams n-gram indices (SOLR)

  17. Thank you marc.kemps.snijders@meertens.knaw.nl

More Related