1 / 40

Text Analysis Meets Computational Lexicography

Text Analysis Meets Computational Lexicography. Hannah Kermes. Motivation. maintainance of consistency and completeness within lexica computer assisted methods lexical engineering scalable lexicographic work process processes reproducible on large amounts of text. Motivation.

alaric
Download Presentation

Text Analysis Meets Computational Lexicography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Analysis Meets Computational Lexicography Hannah Kermes

  2. Motivation • maintainance of consistency and completeness within lexica • computer assisted methods • lexical engineering • scalable lexicographic work process • processes reproducible on large amounts of text

  3. Motivation • rising interest to use evidence derived from automatic syntactic analysis • statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research • full parsers are not robust enough • need for analyzing tools that meet the specific needs of corpus linguistic studies

  4. Information needed • syntactic information subcategorization patterns • semantic information selectional preferences, collocations, MWL • morphological information case, number, gender compounding and derivation

  5. A corpus linguistic approach

  6. Hypothesis The better and more detailed the off-line annotation, the better and faster the on-line extraction. However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

  7. Requirements for the tool • it has to work on unrestricted text • shortcomings in the grammar should not lead to a complete failure to parse • no manual checking should be required • should provide a clearly defined interface • annotation should follow linguistic standards

  8. Requirements for the annotation • head lemma • morpho-syntactic information • lexical-semantic information • structural and textual information • hierarchical representation

  9. Chunking Full Parsing Chunking vs. full parsing YAC • flat non-recursive structures • simple grammar • robust and efficient • non-ambiguous output • full hierarchical representation • complex grammar • not very robust • ambiguous output

  10. A classical chunker • robust – works on unrestricted text • works fully automatically • does not provide full but partial analysis of text • no highly ambiguous attachment decisions are made

  11. YAC goes beyond • extends the chunk definition of Abney • provides additional information about annotated chunks

  12. Perl-Scripts rule application post- processing lexicon annotation of results Applying and processing rules corpus grammar rules

  13. Advantages of the system • efficient work even with large corpora • modular query language • interactive grammar development • powerful post-processing of rules

  14. Annotated chunk categories • Adverbial phrases (AdvP) • Adjectival phrases (AP) • Noun phrases (NP) • Prepositional phrases (PP) • Verbal complexes (VC) • Clauses (CL)

  15. Additional information • head lemma • morpho-syntactic information • lexical-semantic properties

  16. Feature annotation

  17. Some properties of NPs

  18. Other lexical-semantic properties • VC with separated prefix: pref Er kommt an (he arrives) • PP with contracted preposition and article: fus am Bahnhof (at the station) • complex APs embedding PPs: pp über die Köpfe der Apostel gesetzten • AP with deverbal adjectives: vder

  19. Target data • predicative(-like) constructions Es war klar, daß ... It was clear, that ... • ... with adverbial pronoun Er ist davon überzeugt, daß ... He is of it convinced, that ... • ... with reflexive pronoun Es zeigt sich deutlich, daß ... It shows itself clear, that ...

  20. Target data • ... with infinite clauses Es ist möglich, ihn zu besuchen. It is possible, him to visit. • ... with clause in topicalized position Daß ..., ist klar. That ..., is clear. Ihn zu besuchen, ist möglich. Him to visit, is possible.

  21. Sample query adjective + verb + finite clause  VC AP CL

  22. Sample query adjective + verb + finite clause  VC APpred CLfin

  23. Sample query adjective + verb + finite clause  VC Adjuncts* APpred CLfin

  24. Sample query adjective + verb + finite clause  VC (AdvP|PP|NPtemp|CLrel)* APpred CLfin

  25. adjective + verb + finite clause

  26. adjective + verb + finite clause

  27. Topicalized finite clause adjective + verb + finite clause  CLfin VC (AdvP|PP|NPtemp|CLrel)* APpred

  28. adjective + verb + finite clause

  29. adjective + verb + finite clause

  30. adjective + verb + infinite clause

  31. adjective + verb + infinite clause

  32. low freq adj + verb + infin clause

  33. low freq adj + verb + clause

  34. Conclusion • recursive chunking workable compromise between depth of analysis and robustness • extracted data show correlation between • collocational preference • subcategorization frames • semantic classes of adjectives • to a certain extent distributional preferences

  35. Evaluation on automatic PoS-tags

  36. Evaluation on ideal PoS-tags

  37. Second Level Corpus Corpus Corpus Third Level First Level Lexicon Chunking process

  38. Chunking process • First Level • lexical information is introduced • chunks with specific internal structure are built • non-recursive chunks are built • Second Level • main parsing level • complex (recursive) structures are built in several iterations • Third Level • built chunk hierarchy

  39. Rule blocks

  40. Advantages • specific rules do not interact with main parsing rules • additional (e.g. domain specific) rules can be included easily • main parsing rules can be kept simple • number of main parsing rules can be kept small

More Related