Corpus Annotation II

Corpus Annotation II Martin Volk Universität Zürich Eurospider Information Technology AG

Overview • Clean-Up and Text Structure Recognition • Sentence Boundary Recognition • Proper Name Recognition and Classification • Part-of-Speech Tagging • Tagging Correction and Sentence Boundary Corr. • Lemmatisation and Lemma Filtering • NP/PP Chunk Recognition • Recognition of Local and Temporal PPs • Clause Boundary Recognition Martin Volk

Part-of-Speech Tagging • Was done with the Tree-Tagger (Helmut Schmid, IMS-Stuttgart). • The Tree-Tagger • Is a Statistical Tagger. • uses the STTS tag set (50 PoS tags and 3 tags for punctuation). • assigns 1 tag to each word. • preserves pre-set tags. Martin Volk

Tagging Correction Correction of observed tagger problems: • Sentence-initial adjectives • are often tagged as noun (NN) • '...liche[nr]' or '...ische[nr]'  ADJA • Verb group patterns • the verb in front of 'worden' must be perfect participle • VVXXX + 'worden'  VVPP • if verb + modal verb then the verb must be infinitive • VVXXX + VMYYY  VVINF • Unknown prepositions (a, via, innert, ennet) Martin Volk

Correction of sentence boundaries • E.g.: suspected ordinal number followed by a capitalized • determiner or • pronoun or • preposition or • adverb  insert sentence boundary. • Open question: Could all sentence boundary detection be done after PoS tagging? Martin Volk

Lemmatisation • Was done with Gertwol (von Lingsoft Oy, Helsinki) • for adjectives, nouns, prepositions, and verbs. • Gertwol • is a two-level morphology analyzer for German • is lexicon-based • returns all possible interpretations for each word form • segments compound words dynamically • analyzes hyphenated compounds only if all parts are known (e.g. Software-Aktien but not Informix-Aktien)  feed last element to Gertwol Martin Volk

Lemma Filtering (a project by Julian Käser) After lemmatisation: Merging of Gertwol and tagger information • Case 1: The lemma was prespecified during proper name recognition (IBMs  IBM) • Case 2: Gertwol does not find a lemma  insert the word form as lemma (with '?') Martin Volk

Lemma Filtering • Case 3: Gertwol finds exactly one lemma for the given PoS  insert the lemma • Case 4: Gertwol finds multiple lemmas for the given PoS  disambiguate and insert the best lemma • Disambiguation weights the segmentation symbols: • Strong compound segment boundary: 4 points • Weak compound segment boundary: 2 points • Derivational segment boundary: 1 point • the lemma with the lowest score wins! Example: Abteilungen  Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points) Martin Volk

Lemma Filtering • Case 5: Gertwol finds a lemma but not for the given PoS  this indicates a tagger error (Gertwol is more reliable than the tagger.) • Case 5.1: Gertwol finds a lemma for exactly one PoS  insert the lemma and exchange the PoS tag • Case 5.2: Gertwol finds lemmas for more than one PoS  find closest PoS tag, or guess Martin Volk

Lemma Filtering • 0.74% of all PoS tags were exchanged (2% of Adj, N, V tags). • In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. • 85% are cases with exactly one Gertwol tag, 15% are guesses. Martin Volk

Limitations of Gertwol • Compounds are lemmatized only if all parts are known. • Idea: Use corpus for lemmatizing remaining compounds: • Example: kaputtreden, Waferfabriken • Solution: • If first part occurs standing alone AND second part occurs standing alone with lemma, • then segment and lemmatize! • and store first part as lemma (of itself)! !! Martin Volk

NP/PP Chunk Recognition (a project by Dominik A. Merz) • Pattern matcher with patterns over PoS-tags • Example patterns: ADV ADJA --> AP APPR ART ADJA NN --> PP APPR ART AP NN --> PP • Note: The morphological information provided by Gertwol (e.g. grammatical case, number, gender) was not used!! Martin Volk

Representation Format The NEGRA export format • is a line based format • works with pointers for tree structure • comprises node labels (constituents) and edge labels (grammatical functions) • has no provision for semantic information. Therefore: We use the comment field. Martin Volk

Recognition of temporal PPs (a project by Stefan Höfler) A second step towards semantic annotation. • Starting point: • Prepositions (3) that always introduce a temporal PP: binnen, während, zeit • Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, ... + additional evidence • Additional evidence: • Temporal adverb in PP: heute, niemals, wann, ... • Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ... Martin Volk

Recognition of temporal PPs • Evaluation corpus: 990 sentences with manually checked 263 temporal PPs • Result: • Precision: 81% • Recall: 76% Martin Volk

Recognition of local PPs • Starting point: • Prepositions (3) that always introduce a local PP: fern, oberhalb, südlich von • Prepositions (30) that may introduce a local PP: ab, auf, bei, ... + additional evidence • Additional evidence: • Local adverb in PP: dort, hier, oben, rechts, ... • Local noun in PP: Strasse, Quartier, Land, Norden, <GEO>, ... Martin Volk

Recognition of temporal and local PPs Martin Volk

A Word on Recall and Precision • The focus varies with the application! • For my project: Precision is more important than Recall! • Idea: If I annotate something, then I want to be 'sure' that it is correct. Martin Volk

Clause Boundary Recognition (a project by Gaudenz Lügstenmann) • Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. • A sentence consists of one or more clauses, and a clause consists of one or more phrases. • Clauses are important for determining the cooccurrence of verbs and PPs (among other things). Martin Volk

Clauses Boundary Recognition • Exceptions from the definition: Clauses with more than one verb: • Coordinated verbs (e.g. Daten können überführt und verarbeitet werden) • Perception verb + infinitive verb (=AcI) (e.g. die den Markt wachsen sehen.) • 'lassen' + infinitive verb (e.g. lässt die Handbücher übertragen) Martin Volk

Clauses Boundary Recognition • Exceptions from the definition: Clauses without a verb: • Elliptical clauses (e.g. in coordinated structures) • Example: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz. Martin Volk

Clauses Boundary Recognition • The CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) • Example: • Comma + Relative Pronoun • Finite verb ... + Conjunction + ... Finite Verb • Most difficult: CB without overt punctuation symbol or trigger word • Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz. Martin Volk

Clauses Boundary Recognition • Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. • Results (counting all CBs) • Precision: 95.8% • Recall: 84.9% • Results (counting only intra-sentential CBs) • Precision: 90.5% • Recall: 61.1% Martin Volk

Using a PoS Tagger for Clause Boundary Recognition • A CB recognizer can be seen as a disambiguator over commas and CB trigger tokens (if we disregard the CBs without trigger). • A tagger may serve the same purpose. • Example: • ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. • ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Martin Volk

Using a PoS Tagger for Clause Boundary Recognition • Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. • Training the Brill-Tagger on 75% and applying it on the remaining 25% • Results: • 93% Precision • 91% Recall • Caution: very small evaluation corpus!! Martin Volk

Clause Boundary Recognition vs. Clause Recognition • CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. • Example: • Nur ein Projekt der Volkswagen AG, <CB> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, <CB> stößt in ähnliche Dimensionen vor. • <C> Nur ein Projekt der Volkswagen AG, <C> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, </C> stößt in ähnliche Dimensionen vor. </C> • Clause Recognition should be done with a recursive parsing approach because of clause nesting. Martin Volk

Corpus Annotation II