1 / 26

Corpus Annotation II

Corpus Annotation II. Martin Volk Universität Zürich Eurospider Information Technology AG. Overview. Clean-Up and Text Structure Recognition Sentence Boundary Recognition Proper Name Recognition and Classification Part-of-Speech Tagging Tagging Correction and Sentence Boundary Corr.

Download Presentation

Corpus Annotation II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus Annotation II Martin Volk Universität Zürich Eurospider Information Technology AG

  2. Overview • Clean-Up and Text Structure Recognition • Sentence Boundary Recognition • Proper Name Recognition and Classification • Part-of-Speech Tagging • Tagging Correction and Sentence Boundary Corr. • Lemmatisation and Lemma Filtering • NP/PP Chunk Recognition • Recognition of Local and Temporal PPs • Clause Boundary Recognition Martin Volk

  3. Part-of-Speech Tagging • Was done with the Tree-Tagger (Helmut Schmid, IMS-Stuttgart). • The Tree-Tagger • Is a Statistical Tagger. • uses the STTS tag set (50 PoS tags and 3 tags for punctuation). • assigns 1 tag to each word. • preserves pre-set tags. Martin Volk

  4. Tagging Correction Correction of observed tagger problems: • Sentence-initial adjectives • are often tagged as noun (NN) • '...liche[nr]' or '...ische[nr]'  ADJA • Verb group patterns • the verb in front of 'worden' must be perfect participle • VVXXX + 'worden'  VVPP • if verb + modal verb then the verb must be infinitive • VVXXX + VMYYY  VVINF • Unknown prepositions (a, via, innert, ennet) Martin Volk

  5. Correction of sentence boundaries • E.g.: suspected ordinal number followed by a capitalized • determiner or • pronoun or • preposition or • adverb  insert sentence boundary. • Open question: Could all sentence boundary detection be done after PoS tagging? Martin Volk

  6. Lemmatisation • Was done with Gertwol (von Lingsoft Oy, Helsinki) • for adjectives, nouns, prepositions, and verbs. • Gertwol • is a two-level morphology analyzer for German • is lexicon-based • returns all possible interpretations for each word form • segments compound words dynamically • analyzes hyphenated compounds only if all parts are known (e.g. Software-Aktien but not Informix-Aktien)  feed last element to Gertwol Martin Volk

  7. Lemma Filtering (a project by Julian Käser) After lemmatisation: Merging of Gertwol and tagger information • Case 1: The lemma was prespecified during proper name recognition (IBMs  IBM) • Case 2: Gertwol does not find a lemma  insert the word form as lemma (with '?') Martin Volk

  8. Lemma Filtering • Case 3: Gertwol finds exactly one lemma for the given PoS  insert the lemma • Case 4: Gertwol finds multiple lemmas for the given PoS  disambiguate and insert the best lemma • Disambiguation weights the segmentation symbols: • Strong compound segment boundary: 4 points • Weak compound segment boundary: 2 points • Derivational segment boundary: 1 point • the lemma with the lowest score wins! Example: Abteilungen  Abt~ei#lunge (5 points) vs. Ab|teil~ung (3 points) Martin Volk

  9. Lemma Filtering • Case 5: Gertwol finds a lemma but not for the given PoS  this indicates a tagger error (Gertwol is more reliable than the tagger.) • Case 5.1: Gertwol finds a lemma for exactly one PoS  insert the lemma and exchange the PoS tag • Case 5.2: Gertwol finds lemmas for more than one PoS  find closest PoS tag, or guess Martin Volk

  10. Lemma Filtering • 0.74% of all PoS tags were exchanged (2% of Adj, N, V tags). • In other words: ~ 14'000 tags / annual volume of the ComputerZeitung were exchanged. • 85% are cases with exactly one Gertwol tag, 15% are guesses. Martin Volk

  11. Limitations of Gertwol • Compounds are lemmatized only if all parts are known. • Idea: Use corpus for lemmatizing remaining compounds: • Example: kaputtreden, Waferfabriken • Solution: • If first part occurs standing alone AND second part occurs standing alone with lemma, • then segment and lemmatize! • and store first part as lemma (of itself)! !! Martin Volk

  12. NP/PP Chunk Recognition (a project by Dominik A. Merz) • Pattern matcher with patterns over PoS-tags • Example patterns: ADV ADJA --> AP APPR ART ADJA NN --> PP APPR ART AP NN --> PP • Note: The morphological information provided by Gertwol (e.g. grammatical case, number, gender) was not used!! Martin Volk

  13. Representation Format The NEGRA export format • is a line based format • works with pointers for tree structure • comprises node labels (constituents) and edge labels (grammatical functions) • has no provision for semantic information. Therefore: We use the comment field. Martin Volk

  14. Recognition of temporal PPs (a project by Stefan Höfler) A second step towards semantic annotation. • Starting point: • Prepositions (3) that always introduce a temporal PP: binnen, während, zeit • Prepositions (30) that may introduce a temporal PP: ab, an, auf, bis, ... + additional evidence • Additional evidence: • Temporal adverb in PP: heute, niemals, wann, ... • Temporal noun in PP: Minute, Stunde, Jahr, Anfang, ... Martin Volk

  15. Recognition of temporal PPs • Evaluation corpus: 990 sentences with manually checked 263 temporal PPs • Result: • Precision: 81% • Recall: 76% Martin Volk

  16. Recognition of local PPs • Starting point: • Prepositions (3) that always introduce a local PP: fern, oberhalb, südlich von • Prepositions (30) that may introduce a local PP: ab, auf, bei, ... + additional evidence • Additional evidence: • Local adverb in PP: dort, hier, oben, rechts, ... • Local noun in PP: Strasse, Quartier, Land, Norden, <GEO>, ... Martin Volk

  17. Recognition of temporal and local PPs Martin Volk

  18. A Word on Recall and Precision • The focus varies with the application! • For my project: Precision is more important than Recall! • Idea: If I annotate something, then I want to be 'sure' that it is correct. Martin Volk

  19. Clause Boundary Recognition (a project by Gaudenz Lügstenmann) • Definition: A clause is a unit consisting of a full verb together with its (non-clausal) complements and adjuncts. • A sentence consists of one or more clauses, and a clause consists of one or more phrases. • Clauses are important for determining the cooccurrence of verbs and PPs (among other things). Martin Volk

  20. Clauses Boundary Recognition • Exceptions from the definition: Clauses with more than one verb: • Coordinated verbs (e.g. Daten können überführt und verarbeitet werden) • Perception verb + infinitive verb (=AcI) (e.g. die den Markt wachsen sehen.) • 'lassen' + infinitive verb (e.g. lässt die Handbücher übertragen) Martin Volk

  21. Clauses Boundary Recognition • Exceptions from the definition: Clauses without a verb: • Elliptical clauses (e.g. in coordinated structures) • Example: Er beobachtet den Markt und seine Mitarbeiter die Konkurrenz. Martin Volk

  22. Clauses Boundary Recognition • The CB recognizer is realized as a pattern matcher over PoS tags. (34 patterns) • Example: • Comma + Relative Pronoun • Finite verb ... + Conjunction + ... Finite Verb • Most difficult: CB without overt punctuation symbol or trigger word • Example: Simple Budgetreduzierungen in der IT in den Vordergrund zu stellen <CB> ist der falsche Ansatz. Martin Volk

  23. Clauses Boundary Recognition • Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. • Results (counting all CBs) • Precision: 95.8% • Recall: 84.9% • Results (counting only intra-sentential CBs) • Precision: 90.5% • Recall: 61.1% Martin Volk

  24. Using a PoS Tagger for Clause Boundary Recognition • A CB recognizer can be seen as a disambiguator over commas and CB trigger tokens (if we disregard the CBs without trigger). • A tagger may serve the same purpose. • Example: • ... schrieb der Präsident,<Co> Michael Eisner,<Co> im Jahresbericht. • ... schrieb der Präsident,<CB> der Michael Eisner kannte,<CB> im Jahresbericht. Martin Volk

  25. Using a PoS Tagger for Clause Boundary Recognition • Evaluation corpus: 1150 sentences with 754 intra-sentential CBs. • Training the Brill-Tagger on 75% and applying it on the remaining 25% • Results: • 93% Precision • 91% Recall • Caution: very small evaluation corpus!! Martin Volk

  26. Clause Boundary Recognition vs. Clause Recognition • CB recognition marks only the boundaries. It does not identify discontinuous parts of clauses. • Example: • Nur ein Projekt der Volkswagen AG, <CB> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, <CB> stößt in ähnliche Dimensionen vor. • <C> Nur ein Projekt der Volkswagen AG, <C> die ihre europäischen Vertragswerkstätten per Satellit vernetzen will, </C> stößt in ähnliche Dimensionen vor. </C> • Clause Recognition should be done with a recursive parsing approach because of clause nesting. Martin Volk

More Related