1 / 40

Text Analysis Meets Computational Lexicography

Text Analysis Meets Computational Lexicography. Hannah Kermes. Motivation. maintainance of consistency and completeness within lexica computer assisted methods lexical engineering scalable lexicographic work process processes reproducible on large amounts of text. Motivation.

marnie
Download Presentation

Text Analysis Meets Computational Lexicography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Analysis Meets Computational Lexicography Hannah Kermes

  2. Motivation • maintainance of consistency and completeness within lexica • computer assisted methods • lexical engineering • scalable lexicographic work process • processes reproducible on large amounts of text

  3. Motivation • rising interest to use evidence derived from automatic syntactic analysis • statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research • full parsers are not robust enough • need for analyzing tools that meet the specific needs of corpus linguistic studies

  4. Information needed • syntactic information subcategorization patterns • semantic information selectional preferences, collocations, MWL • morphological information case, number, gender compounding and derivation

  5. A corpus linguistic approach

  6. Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

  7. Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards

  8. Requirements for the annotation • head lemma • morpho-syntactic information • lexical-semantic information • structural and textual information • hierarchical representation

  9. Chunking Full Parsing Chunking vs. full parsing YAC • flat non-recursive structures • simple grammar • robust and efficient • non-ambiguous output • full hierarchical representation • complex grammar • not very robust • ambiguous output

  10. Problems for extraction • Kübler and Hinrichs (2001) focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.

  11. Extended chunk definition A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head as well as post-head modifiers but no PP-attachment, or sentential elements.

  12. A classical chunker • robust – works on unrestricted text • works fully automatically • does not provide full but partial analysis of text • no highly ambiguous attachment decisions are made

  13. YAC goes beyond • extends the chunk definition of Abney • provides additional information about annotated chunks

  14. Technical framework - CQP • regular expression matching on token and annotation strings .*jahr • tests for membership in user specific word lists • feature set operations • constraints to specify dependencies

  15. Perl-Scripts rule application post- processing lexicon annotation of results Applying and processing rules corpus grammar rules

  16. Advantages of the system • efficient work even with large corpora • modular query language • interactive grammar development • powerful post-processing of rules

  17. Annotated chunk categories • Adverbial phrases (AdvP) • Adjectival phrases (AP) • Noun phrases (NP) • Prepositional phrases (PP) • Verbal complexes (VC) • Clauses (CL)

  18. Additional information • head lemma • morpho-syntactic information • lexical-semantic properties

  19. Feature annotation

  20. Some properties of NPs

  21. Other lexical-semantic properties • VC with separated prefix: pref Er kommt an (he arrives) • PP with contracted preposition and article: fus am Bahnhof (at the station) • complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten • AP with deverbal adjectives: vder

  22. Second Level Corpus Corpus Corpus Third Level First Level Lexicon Chunking process

  23. Chunking process • First Level • lexical information is introduced • chunks with specific internal structure are built • non-recursive chunks are built • Second Level • main parsing level • complex (recursive) structures are built in several iterations • Third Level • built chunk hierarchy

  24. Rule blocks

  25. Advantages • specific rules do not interact with main parsing rules • additional (e.g. domain specific) rules can be included easily • main parsing rules can be kept simple • number of main parsing rules can be kept small

  26. Evaluation on ideal PoS-tags

  27. Evaluation on automatic PoS-tags

  28. Sample query adjective + verb + finite clause  VC Adjuncts* AP CL

  29. Sample query adjective + verb + finite clause  VC (AdvP|PP|NPtemp|CLrel)* APpred CLfin

  30. Target data • predicative(-like) constructions Es war klar, daß ... It was clear, that ... • ... with adverbial pronoun Er ist davon überzeugt, daß ... He is of it convinced, that ... • ... with reflexive pronoun Es zeigt sich deutlich, daß ... It shows itself clear, that ...

  31. Target data • ... with infinite clauses Es ist möglich, ihn zu besuchen. It is possible, him to visit. • ... with clause in topicalized position Daß ..., ist klar. That ..., is clear. Ihn zu besuchen, ist möglich. Him to visit, is possible.

  32. adjective + verb + finite clause

  33. adjective + verb + finite clause

  34. adjective + verb + finite clause

  35. adjective + verb + finite clause

  36. adjective + verb + infinite clause

  37. adjective + verb + infinite clause

  38. low freq adj + verb + infin clause

  39. low freq adj + verb + clause

  40. Conclusion • recursive chunking workable compromise between depth of analysis and robustness • extracted data show correlation between • collocational preference • subcategorization frames • semantic classes of adjectives • to a certain extent distributional preferences

More Related