international conference corpus linguistics 2013 st petersburg june 25 27 2013 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013 PowerPoint Presentation
Download Presentation
International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013

Loading in 2 Seconds...

play fullscreen
1 / 25

International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013 - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013. Roland Mittmann, M.A. Institute of Empirical Linguistics Goethe University, Frankfurt am Main, Germany mittmann@em.uni-frankfurt.de. Old German and Old Lithuanian: The Creation of Two Deeply-Annotated

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'International Conference “Corpus linguistics – 2013 ” St. Petersburg, June 25 –27 , 2013' - bradley-summers


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
international conference corpus linguistics 2013 st petersburg june 25 27 2013
International Conference “Corpus linguistics – 2013”St. Petersburg, June 25–27, 2013

Roland Mittmann, M.A.

Institute of Empirical Linguistics

Goethe University, Frankfurt am Main, Germany

mittmann@em.uni-frankfurt.de

Old German and Old Lithuanian:

The Creation of Two Deeply-Annotated

Historical Text Corpora

1 introduction
1. Introduction
  • Aim: creation of deeply-annotated corpora of historical language stages
  • Approach: depending on
    • existing resources from previous analyses
    • qualities of the language itself
  • Comparison of approaches:
    • Old German Reference Corpus (OG/OGRC)
    • Old Lithuanian Reference Corpus (OL/OLRC)
2 description of the corpora
2. Description of the corpora
  • Old German Reference Corpus(Referenzkorpus Altdeutsch)
    • all preserved texts from the oldest stages of German
      • Old High German and Old Saxon (= Old Low German)
      • ca. 750 – 1050 CE
      • ca. 650,000 word tokens
    • cooperation of 3 German universities: 2008 – 2013
      • Humboldt University (Berlin)
      • Goethe University (Frankfurt am Main)
      • Schiller University (Jena)
    • several subcorpora already searchable online
2 description of the corpora1
2. Description of the corpora

OGRC: www.deutschdiachrondigital.de

2 description of the corpora2
2. Description of the corpora
  • Old Lithuanian Reference Corpus(Senosios lietuvių kalbos korpusas)
    • preserved texts from the oldest stage of Lithuanian
      • ca. 1520 – 1800 CE
      • ca. 10,000,000 word tokens
    • pilot project covering 540,000 word tokens started in 2012
    • international cooperation
      • Lithuanian Language Institute (LKI, Vilnius)
      • Goethe University (Frankfurt am Main)
      • University of Pisa, Italy
    • use of experiences made with the OGRC due to cooperation in Frankfurt
2 description of the corpora3
2. Description of the corpora
  • Qualities of the texts of both corpora
    • types of texts:
      • religious and secular texts
      • prose and poetry
      • translated/adapted and independently composed texts
    • language:
      • variation due to diachronic, diatopic and diastratic differences
    • foreign-language source texts and foreign-language words in the texts:
      • annotation as similar as possible to OG/OL word tokens
      • comprised in aforementioned word token numbers
    • Old Lithuanian:balanced choice of texts for pilot project
3 the unequal starting points
3. The unequal starting points
  • Divergence from modern languages
    • OL considerably closer to Modern Lithuanian than OG to Modern(High or Low) German – not only due to different age:
    • invention of printing press in 15th century and spread of written texts
    • deceleration of transformation pace of European literary languages
    • moderate language development from OL to Modern Lithuanian(however, large differences in spelling, in OL many variants)
    • vs. extensive mutations in vowel system between OG and Early Modern Times (e.g. reduction of unstressed vowels to schwa/zero)
3 the unequal starting points1
3. The unequal starting points

 Impacts on availability of resources

  • Old Lithuanian
    • no historic dictionary of Lithuanian, no OL grammar (but OL dictionaries)
    • dictionaries and grammars of Modern Lithuanian may be helpful
  • Old German
    • specific dictionaries and grammars
    • glossaries for every subcorpus:all attested inflected word forms, related to corresponding lemmata
  • OLRC: basis for compilation of OL grammar and glossary
  • OGRC: questioning and amending of existing works
3 the unequal starting points2
3. The unequal starting points
  • Digital availability of the texts
    • OG: one printed edition per text digitized by TITUS project in Frankfurt
    • OL: 10 texts in pilot project
      • 6 on TITUS
      • 3 adopted from OL database of Lithuanian Language Institute (LKI)
      • 1: edition being prepared
    • TITUS texts:
      • structural annotation:e.g., chapters and lines for original document and edition
      • information can directly be adopted, together with texts
3 the unequal starting points3
3. The unequal starting points

titus.uni-frankfurt.de

3 the unequal starting points4
3. The unequal starting points
  • Referential text version
    • OGRC:
      • digitized edition as main reference layer
      • manual addition of original text forms and graphical peculiarities saved for later, only performed by way of example
    • OLRC:
      • digitized edition extended by version of original manuscripts or prints
      • detailed representation of amendments

 digitization of original documents required

4 1 the courses of action ogrc
4.1. The courses of action: OGRC
  • Pre-annotation
    • digitization of glossaries for the subcorpora into XML format
4 1 the courses of action ogrc1
4.1. The courses of action: OGRC
  • Pre-annotation
    • digitization of glossaries for the subcorpora into XML format
    • linking part-of-speech and morphological data of the word forms with the word tokens in the texts:
      • extraction of data from glossary files
      • enrichment with additional part-of-speech and morphological information manually extracted from grammars
    • most glossaries give attestations with locations in text  one-to-one-attribution
    • aim of consistent spelling and consistent modern German translation

 adaptation of glossary lemmata to standard dictionariesof Old High German and Old Saxon

4 1 the courses of action ogrc2
4.1. The courses of action: OGRC
  • Conversion and manual annotation
    • conversion into ELAN format
      • software by Max Planck Institute for Psycholinguistics, Nijmegen,the Netherlands
    • database structure
      • with part-of-speech, morphological, lemmatical and structural pre-annotation
    • manual annotation:
      • amendment of information
      • dissolution of ambiguities
      • addition of simple syntactical annotation
4 1 the courses of action ogrc3
4.1. The courses of action: OGRC
  • automated creation of standardized version of word tokens
    • from lemmata plus part-of-speech and morphological data
    • morphological knowledge of language stages conveyed into Perl program
    • standard word forms used to detect annotation mistakesby automated comparison with word forms in text edition
4 2 the courses of action olrc
4.2. The courses of action: OLRC
  • Pre-annotation
    • no glossaries annotation tool learning from manual annotation required
    • use of Toolbox (by SIL International, Dallas, Texas)
      • applying expansible dictionaries
    • one dictionary with data of Lemuoklis
      • morphological analyser, lemmatizer and tagger by the LKI
      • enriched by semi-manually classified data from dictionaries on OL,Slavic loanwords in OL and Bible names
    • other dictionary with data of Lithuanian language dictionary
      • retrieval of data on all lemmata in the corpus from its digital version
4 2 the courses of action olrc1
4.2. The courses of action: OLRC

Annotation in Toolbox (OLRC)

4 2 the courses of action olrc2
4.2. The courses of action: OLRC
  • lemmatization of word forms of OL texts:if possible, automatic, else manual
  • creation of standardized word forms by Lemuoklis from lemmata,part-of-speech and morphological annotation
  • Modern Lithuanian-English dictionary  lemma translation
  • conveyance of word tokens into standardized spelling:Consistent Changes Program (SIL)
    • mainly for older texts, specific rules for every single author needed
4 2 the courses of action olrc3
4.2. The courses of action: OLRC
  • Manual annotation and conversion
    • in Toolbox:
      • joining of texts with Lemuoklisʼ data
      • manual disambiguation
    • Toolbox: no chart structure, restriction of amount of annotation layers
    • transfer of data into ELAN
      • automated split-up of word forms into graphemes  annotation (also OGRC)
      • e.g., addition of information on multiword expressions, quotations and glossing of words
    • conversion into image annotation tool ImAnTo (Frankfurt University)
      • annotation of facsimiles of original documents
      • selection of details of images and linking to annotations
4 3 the courses of action parallel processing
4.3. The courses of action: Parallel processing
  • Tagsets and annotation schemes
    • part-of-speech and morphological annotation:OGRC: Deutsch Diachron Digital Tagset (DDDTS)
      • adaptation of TIGER Morphology Annotation Scheme for Modern German,based on Stuttgart-Tübingen Tagset (STTS)
      • DDDTS used as basis for creation of tagset for OL
      • distinguishing between lemma-specific and record-specific qualities ofword tokens
    • language of word tokens according to ISO 639-3 (goh, osx; olt; lat)
4 3 the courses of action parallel processing1
4.3. The courses of action: Parallel processing
  • The ANNIS database
    • transfer of subcorpora of both projects into ANNIS database(Potsdam University, Germany)
    • joining of texts with extensive metadata description
      • developed by Middle High German and OGRC, adapted by OLRC
    • complex search patterns possible, more comfortable search tool in preparation
4 3 the courses of action parallel processing2
4.3. The courses of action: Parallel processing

Representation in the ANNIS database (OGRC)

5 conclusion
5. Conclusion
  • Comparison of approaches for OL and OG
    • work on OLRC benefits from course of action applied for OGRC –in spite of various aspects diverging initially
    • OLRC can use digitized data and tools for Modern Lithuanian –inapplicable for OGRC
    • lack of glossaries for OLRC  additional adaptive annotation tool
    • special approaches required for objectives exceeding those of OGRC
      • e.g. precise annotation of facsimiles of original documents

 however, cooperation advantageous, more time for philological work

slide25

Thank you for your attention!

Спасибозавнимание!

Old German Reference Corpus:

www.deutschdiachrondigital.de