1 / 1

GerManC

6. Analytical tools 1 A major objective was to adapt and develop programs for tagging and lemmatizing the corpus. The difficulties which have to be overcome for this are: (a) Orthographic variation in a pre-standardized language variety

damali
Download Presentation

GerManC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6. Analytical tools 1 A major objective was to adapt and develop programs for tagging and lemmatizing the corpus. The difficulties which have to be overcome for this are: (a) Orthographic variation in a pre-standardized language variety (b) The morphological structure of early modern German, with much lexeme-dependent allomorphy and the prevalence of vowel changes as well as affixes to mark morphosyntactic categories. We adapted the Stuttgart-Tübingen tagset; this produced good results, with some 80% of word forms tagged and lemmatized accurately. The orthographic variation was found to be relatively systematic, with each variable tending to have a discrete set of variants. These significant regularities could be exploited in order to automate assigning basic leading forms for specific variants for each text, with a stoplist of exceptions. In this way we developed programs to normalize variant spellings, capturing the relationship between the variants and a standardized form and establishing an overall lexicon of variant forms for each lemma. This is a significant improvement on existing corpus tools which tended to treat each variant separately, necessitating manually matching variant spellings to normalized forms. 7. Application A number of further programs were developed for use with the corpus, e.g. to generate frequency lists for word forms with lists of the first and last occurrences of all word forms and of all forms with unique occurrence. A concordance program allows one to search for words or patterns (e.g. all words ending in -keit) and to show these in context. Another program allows the search for particular tag sequences. Thus by searching for sequences of determiner + adjective + noun it has been possible to generate lists to show the inflection of adjectives within the noun phrase. In the nominative/accusative plural this was subject to considerable variation at this time, and the corpus shows the gradual elimination of one variant to leave only that one which was eventually adopted into the standard language. 8. Further developments In the course of the proposed extended project, with the compilation of the complete corpus of 800,000 words, it is intended that further tools should be developed, in particular to parse the corpus. Given the complexity of German syntax at this period, this presents a considerable challenge. In this context it would also be desirable to identify not only the part-of-speech of each word-form, but also its morphosyntactic properties. A start has been made with a program which identifies singular and plural nouns and their cases with a reasonable degree of accuracy (ca 75%). an annotated, spatialised, multi-genre corpus of Early Modern German Martin Durrell, Astrid Ensslin, Paul Bennett GerManC AimsThe GerManC project involves the compilation of a representative corpus of German texts for the period 1650-1800.It is designed to parallel historical corpora of English (i.e. ARCHER, Helsinki) for this period in order to facilitate comparative synchronic study of the two languages.DesignThe corpus will consist of 2000 word extracts from eight text types:orally-oriented: drama, newspapers, sermons, lettersprint-oriented: narrative prose, academic texts, medical texts, legal textsTo ensure representativeness there will be an equal number of extracts from:three sub-periods: 1650-1700 : 1701-1750 ; 1751-1800five regions: North ; West Central ; East Central ; South-West ; South-EastThis will result in a corpus of about 800,000 words and will be the first representative corpus of German for this period.It will further the synchronic study of the development of German syntax and lexis in the early modern period, and also provide material for investigating the process of standardization in German. The regional representativeness is vital for this; these 150 years saw the decline of local linguistic norms and the emergence of a supraregional standard accepted throughout the Holy Roman Empire. Methodsstage 1 - digitizationFor the pilot project 45 extracts from German newspapers of this period were digitized by double-keying, i.e. entered independently by two people and the results compared and checked with the original to eliminate mistakes.Scanning (apart from being potentially more prone to error) was not feasible as there is no reliable OCR program for black letter (‘Gothic’) typefaces.stage 2 - annotationThe corpus was then annotated according to the standards of the Text Encoding Initiative (TEI). Each text was supplied with administrative metadata (header information, etc.) and marked for significant textual features using the TEI tagset.The TEI conventions were applied rigorously, and as this corpus consists of newspapers with a wealth of relevant detail it required a very intensive level of annotation. It was marked for loan words, passages in languages other than German, proper names (of places, people, organizations etc.), numbers, dates, times, abbreviations with expansions, special characters and other diacritics, illustrations and text decorations and any formatting conventions. Exchanger XML was used as editing software, and CLaRK for automatic conformance checking in line with TEI U5 standards. Each stage of corpus construction and annotation was documented in detail and any deviations from and modifications of existing TEI standards were noted and accounted for. Analytical tools A major objective was to develop programs for tagging and lemmatizing the corpus. The Stuttgart-Tübingen tagset was adapted and this produced good results, with some 80% of word forms tagged and lemmatized accurately. Significant regularities could be exploited to automate assigning basic leading forms for specific variants for each text.Programs were developed to normalize variant spellings, capturing the relationship between the variants and a standardized form and establishing an overall lexicon of variant forms for each lemma.ApplicationFurther programs were developed, e.g. to allow searches for particular tag sequences. Thus, by searching for sequences of determiner + adjective + noun lists can be generated to show the inflection of adjectives within the noun phrase – this was subject to considerable variation at this time, and the corpus shows the elimination of one variant to leave only the one which was eventually adopted into the standard language.Further developmentsIn the proposed extended project, with the compilation of the complete corpus of 800,000 words, further tools will be developed, in particular to parse the corpus. It would also be desirable also to identify the morphosyntactic properties of each word-form. A start has been made with a program identifying singular and plural nouns and their cases with a reasonable degree of accuracy (ca 75%). Pilot The project was piloted by the compilation of a corpus of 100,000 words from one text type – newspapers– with this design, i.e. with an equal number of texts from the three sub-periods and five regions. This was completed with an ESRC grant (RES-000-22-1609) between March 2006 and March 2007. A bid for funding of the complete project, which will include the other text types, is currently awaiting decision.

More Related