1 / 16

outline

outline. goals, current situation, project description requirements implementation. the initiative goals standardisation flexibility. the initiative. goal: diachronic corpus of German, Old High German (800) to Modern German (  1900) for linguistic, philological and historic research

phoebe
Download Presentation

outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. outline • goals, current situation, project description • requirements • implementation

  2. the initiative goals standardisation flexibility

  3. the initiative • goal: diachronic corpus of German, Old High German (800) to Modern German (1900) for linguistic, philological and historic research • current situation: a lot of digitized texts, but • different (mostly implicit) quality standards (source, diplomaticity) • different formats (WordPerfect, WordCruncher, XML, ...) • different header structures (if any) • different positional or structural annotation (if any) • unequal coverage and different corpous composition for the language stages • availability sometimes problematic, no common search tools

  4. the initiative • linguists, philologists, corpus linguists, computer scientists from 15 German universities, international cooperation • 5 language groups + architecture group • pilot project for corpus architecture at Humboldt-Universität, Berlin • planned duration 7 years, size after 7 years • core corpus: 40 M words • extension corpus: 60 M words • current situation: funding declined by the German Science Foundation (DFG), we are looking for other financing options

  5. requirements - standardisation • standardisation • common quality standard(s) • source: original (preferred) or edited text • diplomaticity • common header structure – compatible with TEI/Menota • dialect • text type/genre • paleography/codicology • common structural annotation • graphic • logical • conflicting hierarchies

  6. requirements - standardisation • common positional annotation • levels • tagsets • lemmatisation • within language group - normalisation • across language groups – hyperlemma • multi-linguality

  7. requirements - flexibility • different texts may have different annotation layers • every text (extension corpus): header information, minimal structural annotation • core corpus: additionally lemmatisation, pos-tags • presentation corpus: aligned facsimiles, sound files • multi-modality • in addition: texts may have more annotation layers (syntax, information structure, narratological information, paleographical information, ...) – the tagsets and guidelines for each layer are standardised • texts and annotation layers can be added at any time

  8. requirements – character-wise addressing • token cannot be the graphemic word because of • difference between graphical word and lexeme • paleographic annotation • word-formation information • (Dipper et al. 2004, Lüdeling, Poschenrieder & Faulstich 2005)

  9. Swerlenrecht kůnnen wil•d~volge

  10. Implementation Concept

  11. Key requirements (again) • multilinguality • multimodality • open set of annotation layers • varying annotation depth • search

  12. architecture • web-based client-server architecture • corpus is stored in a relational database • additionally tools for annotation, search, presentation etc. (based on XPath, Vitt 2005, Faulstich, Leser & Vitt 2005) • extended ODAG model (Carletta et al.2003, Dipper et al. 2004, Faulstich, Leser & Lüdeling 2005; Faulstich & Leser 2005) • why relational database? • conflicting hierarchies and partial annotation – tabular model and pure XML tree model not possible • complex search on several text versions and their annotations • complex search on syntactic structures (graphs) and alignments (graphs over spans)

  13. import & export • import of annotated files (XML) into the database • export from the database into XML for presentation • export from the database into XML for external annotation tools • Vitt (2004); Faulstich, Leser & Lüdeling (2005); Vitt (2005); Faulstich, Leser & Vitt (2006)

More Related