1 / 12

STO A Lexical Database of Danish for Language Technology Applications

STO A Lexical Database of Danish for Language Technology Applications. Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001. Background. EU-funded international projects EAGLES: recommendations for morphological and syntactic specifications for 9 languages

mimis
Download Presentation

STO A Lexical Database of Danish for Language Technology Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STOA Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

  2. Background • EU-funded international projects • EAGLES: recommendations for morphological and syntactic specifications for 9 languages • GENELEX: development of a generic lexicon model • PAROLE: development of harmonized WL resources (lexicon, corpus) for 12 languages • SIMPLE: development of an ontology and model of semantic description for 12 languages • Follow-up • Danish, nationally funded co-operative lexicon project: STO

  3. Aims of the project • Monolingual aim • to eliminate the usual ’bottleneck problem’: lack of a large-size Danish lexical database for • language technology applications • computational language research purposes • Multilingual aim • to provide an elaborated Danish lexical database for • linked bi- or multilingual databases for LT/NLP applications • contrastive CL and lexicology research …

  4. STO development objectives • Requirements of monolingual applications • tailor the linguistic specifications for Danish • add more language specific features • extend the linguistic and lexical coverage • refine the lexicon structure • develop customized, user-friendly interfaces... • but also requirements of multilingual linking • keep the basic, harmonised lexicon structure • keep the principles and language of lexical description • be attentive to similar follow-up projects • Ø’more Danish’ but still consistent with the other lexicons

  5. The three linguistic layers of description • Main info types - 3 independent but linked layers • Morphology • Inflection (pattern-based) • Spelling • Compounding • Syntax (totally pattern-based) • Syntactic frame (complementation structures & functional properties, etc.) • Control, raising (constructional properties) • Semantics (the layer of multilingual linking) • Domain (=sublanguage, source area) • Semantic relations (qualia) • Specification of meaning (SIMPLE model + core ontolgy)

  6. Between syntax and semantics • No clear-cut borderline: difficult to represent mutual dependencies in a strictly modular description. • Ø Syntactic or semantic units? • Collocations: combine features of complex structure, (morpho)syntactic constraints and slightly restricted compositionality (meaning transparency); strong subcategorisation and selectional restrictions ... • Phrasal verbs: combine features of complex syntactic structure and compositional/non-compositional semantics … • ØDifferent representation strategies: ’early’ vs. ’late’

  7. Linking lexicons at the semantic level • Basic method: • link between L1-meaning and L2-meaning • Basic requirement: • harmonized semantics (ontology, model & method) • Advantages: • proper treatment of all lexical units including • homonymes • polysemes • complex lexical units (collocations, idioms) • independent treatment of L1 and L2 wrt. morpholgy and syntax

  8. About the STO lexical database (V.1) • Point of departure: PAROLE material • linguistic specifications elaborated (inc. also Danish) • modular lexicon architecture developed • information structure developed • 20,000 general language lexicon entries encoded • Main STO development steps: • tailor and refine the LingSpec’s for Danish • improve the information structure (DB) • add new entry types (complex lexical units, etc.) • extend the vocabulary to 50,000 entries • (~ 35,000 GL and ~15,000 LSP from 6-8 domains)

  9. Progress report for 2001 (1) • New status:Nationally funded co-operative project • requiring • more thorough project planning (incl. ’logistics’) • more detailed information (guidelines, specifications, cross-checks, evaluation…) • Continuously ongoing supporting processes • Updating and refinement of LingSpec’s • Elaboration of an Encoding Manual • Elaboration of various additional documentation • (evaluation sheets, etc.) • Revision of the database/info structure

  10. Progress report for 2001 (2) • New supporting tools for lexicographers developed • Encoding tools for morphological and syntactic info • Browsers for retrieval of encoded info... • Number of entries encoded with • morphological information ~50,000 • syntactic information ~23,000 • semantic information ~ 8,500 (from SIMPLE) • Other tasks (ongoing/finished) • selected entries (on customer’s request) downloaded • work on principles of statistically based selection of lemmas and syntactic constructions to be encoded • corpus-related work

  11. Progress report for 2001 (3) • Treatment of new entry types • domain specific (LSP) entries • compounds (decomposition and linking elements implemented) • geographical proper nouns (inflectional and agreement properties investigated, the results are implemented) • collocations (information structure designed) • revision of the treatment of phrasal verbs

  12. Summing up the goals • STO will • conform to ’general’ linguistic knowledge • meet demands of a broad application and research area (size, selection of domains and vocabulary, detail of linguistic description…) • satisfy monolingual language specific requirements • be potentially compatible with other lexical databases for future linking • be reasonable easy to access, customize/use... • perform the development contract and meet the production deadlines

More Related