International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013

International Conference “Corpus linguistics – 2013”St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics Goethe University, Frankfurt am Main, Germany mittmann@em.uni-frankfurt.de Old German and Old Lithuanian: The Creation of Two Deeply-Annotated Historical Text Corpora

1. Introduction • Aim: creation of deeply-annotated corpora of historical language stages • Approach: depending on • existing resources from previous analyses • qualities of the language itself • Comparison of approaches: • Old German Reference Corpus (OG/OGRC) • Old Lithuanian Reference Corpus (OL/OLRC)

2. Description of the corpora • Old German Reference Corpus(Referenzkorpus Altdeutsch) • all preserved texts from the oldest stages of German • Old High German and Old Saxon (= Old Low German) • ca. 750 – 1050 CE • ca. 650,000 word tokens • cooperation of 3 German universities: 2008 – 2013 • Humboldt University (Berlin) • Goethe University (Frankfurt am Main) • Schiller University (Jena) • several subcorpora already searchable online

2. Description of the corpora OGRC: www.deutschdiachrondigital.de

2. Description of the corpora • Old Lithuanian Reference Corpus(Senosios lietuvių kalbos korpusas) • preserved texts from the oldest stage of Lithuanian • ca. 1520 – 1800 CE • ca. 10,000,000 word tokens • pilot project covering 540,000 word tokens started in 2012 • international cooperation • Lithuanian Language Institute (LKI, Vilnius) • Goethe University (Frankfurt am Main) • University of Pisa, Italy • use of experiences made with the OGRC due to cooperation in Frankfurt

2. Description of the corpora • Qualities of the texts of both corpora • types of texts: • religious and secular texts • prose and poetry • translated/adapted and independently composed texts • language: • variation due to diachronic, diatopic and diastratic differences • foreign-language source texts and foreign-language words in the texts: • annotation as similar as possible to OG/OL word tokens • comprised in aforementioned word token numbers • Old Lithuanian:balanced choice of texts for pilot project

3. The unequal starting points • Divergence from modern languages • OL considerably closer to Modern Lithuanian than OG to Modern(High or Low) German – not only due to different age: • invention of printing press in 15th century and spread of written texts • deceleration of transformation pace of European literary languages • moderate language development from OL to Modern Lithuanian(however, large differences in spelling, in OL many variants) • vs. extensive mutations in vowel system between OG and Early Modern Times (e.g. reduction of unstressed vowels to schwa/zero)

3. The unequal starting points  Impacts on availability of resources • Old Lithuanian • no historic dictionary of Lithuanian, no OL grammar (but OL dictionaries) • dictionaries and grammars of Modern Lithuanian may be helpful • Old German • specific dictionaries and grammars • glossaries for every subcorpus:all attested inflected word forms, related to corresponding lemmata • OLRC: basis for compilation of OL grammar and glossary • OGRC: questioning and amending of existing works

3. The unequal starting points • Digital availability of the texts • OG: one printed edition per text digitized by TITUS project in Frankfurt • OL: 10 texts in pilot project • 6 on TITUS • 3 adopted from OL database of Lithuanian Language Institute (LKI) • 1: edition being prepared • TITUS texts: • structural annotation:e.g., chapters and lines for original document and edition • information can directly be adopted, together with texts

3. The unequal starting points titus.uni-frankfurt.de

3. The unequal starting points • Referential text version • OGRC: • digitized edition as main reference layer • manual addition of original text forms and graphical peculiarities saved for later, only performed by way of example • OLRC: • digitized edition extended by version of original manuscripts or prints • detailed representation of amendments  digitization of original documents required

4.1. The courses of action: OGRC • Pre-annotation • digitization of glossaries for the subcorpora into XML format

4.1. The courses of action: OGRC • Pre-annotation • digitization of glossaries for the subcorpora into XML format • linking part-of-speech and morphological data of the word forms with the word tokens in the texts: • extraction of data from glossary files • enrichment with additional part-of-speech and morphological information manually extracted from grammars • most glossaries give attestations with locations in text  one-to-one-attribution • aim of consistent spelling and consistent modern German translation  adaptation of glossary lemmata to standard dictionariesof Old High German and Old Saxon

4.1. The courses of action: OGRC • Conversion and manual annotation • conversion into ELAN format • software by Max Planck Institute for Psycholinguistics, Nijmegen,the Netherlands • database structure • with part-of-speech, morphological, lemmatical and structural pre-annotation • manual annotation: • amendment of information • dissolution of ambiguities • addition of simple syntactical annotation

4.1. The courses of action: OGRC • automated creation of standardized version of word tokens • from lemmata plus part-of-speech and morphological data • morphological knowledge of language stages conveyed into Perl program • standard word forms used to detect annotation mistakesby automated comparison with word forms in text edition

4.2. The courses of action: OLRC • Pre-annotation • no glossaries annotation tool learning from manual annotation required • use of Toolbox (by SIL International, Dallas, Texas) • applying expansible dictionaries • one dictionary with data of Lemuoklis • morphological analyser, lemmatizer and tagger by the LKI • enriched by semi-manually classified data from dictionaries on OL,Slavic loanwords in OL and Bible names • other dictionary with data of Lithuanian language dictionary • retrieval of data on all lemmata in the corpus from its digital version

4.2. The courses of action: OLRC Annotation in Toolbox (OLRC)

4.2. The courses of action: OLRC • lemmatization of word forms of OL texts:if possible, automatic, else manual • creation of standardized word forms by Lemuoklis from lemmata,part-of-speech and morphological annotation • Modern Lithuanian-English dictionary  lemma translation • conveyance of word tokens into standardized spelling:Consistent Changes Program (SIL) • mainly for older texts, specific rules for every single author needed

4.2. The courses of action: OLRC • Manual annotation and conversion • in Toolbox: • joining of texts with Lemuoklisʼ data • manual disambiguation • Toolbox: no chart structure, restriction of amount of annotation layers • transfer of data into ELAN • automated split-up of word forms into graphemes  annotation (also OGRC) • e.g., addition of information on multiword expressions, quotations and glossing of words • conversion into image annotation tool ImAnTo (Frankfurt University) • annotation of facsimiles of original documents • selection of details of images and linking to annotations

4.2. The courses of action: OLRC

4.3. The courses of action: Parallel processing • Tagsets and annotation schemes • part-of-speech and morphological annotation:OGRC: Deutsch Diachron Digital Tagset (DDDTS) • adaptation of TIGER Morphology Annotation Scheme for Modern German,based on Stuttgart-Tübingen Tagset (STTS) • DDDTS used as basis for creation of tagset for OL • distinguishing between lemma-specific and record-specific qualities ofword tokens • language of word tokens according to ISO 639-3 (goh, osx; olt; lat)

4.3. The courses of action: Parallel processing • The ANNIS database • transfer of subcorpora of both projects into ANNIS database(Potsdam University, Germany) • joining of texts with extensive metadata description • developed by Middle High German and OGRC, adapted by OLRC • complex search patterns possible, more comfortable search tool in preparation

4.3. The courses of action: Parallel processing Representation in the ANNIS database (OGRC)

5. Conclusion • Comparison of approaches for OL and OG • work on OLRC benefits from course of action applied for OGRC –in spite of various aspects diverging initially • OLRC can use digitized data and tools for Modern Lithuanian –inapplicable for OGRC • lack of glossaries for OLRC  additional adaptive annotation tool • special approaches required for objectives exceeding those of OGRC • e.g. precise annotation of facsimiles of original documents  however, cooperation advantageous, more time for philological work

Thank you for your attention! Спасибозавнимание! Old German Reference Corpus: www.deutschdiachrondigital.de

International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013

International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013

Presentation Transcript

Principles of corpus construction

Introduction to Corpus Linguistics

15 th International Conference of National Trusts Entebbe, Uganda, 30 September – 4 October 2013

5 Days and Counting: Marketplace Opening and Other ACA Issues 2013 VAHU Conference

3Q | 2013

Québec City, QC 13 June 2013

How to Use the Gateway Budgets Application in 2013-2014 Colby Shank Program Coordinator June 5, 2013

GOVERnance reforms conference

LUNG TRANSPLANTATION

Michael Hoey University of Liverpool 48th Annual International IATEFL Conference Harrogate

ACCELERATING TRADITIONAL COURSES 2013 CAEL Conference November 7, 2013 11:00am-12:15pm

Parenting Conference 2013

Pre-Conference Workshop Ethical Issues in 21 st Century Clinical Practice November 7, 2013

DAY ONE THURSDAY, JUNE 20, 2013 WELCOME!

International Conference on “ Impacts of Globalization on Quality in Higher Education ”

Affordable Housing Seminar

National Conference of P.D, D.R.D.As Date: 08 th July, 2013

Simple Statistics for Corpus Linguistics

Farm Bureau Actuarial Conference Williamsburg, VA August 5 , 2013

EARCOS Leadership Conference 2013