1 / 17

Computational Lexicography

Frank Van Eynde Centre for Computational Linguistics. Computational Lexicography. OUTLINE. 1. The token/type distinction 2. Lexicographic practice 3. Computational lexica 4. Lexical databases 5. Lexical knowledge acquisition 6. The use of lexica in text-to-speech.

leone
Download Presentation

Computational Lexicography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Frank Van Eynde Centre for Computational Linguistics Computational Lexicography

  2. OUTLINE 1. The token/type distinction 2. Lexicographic practice 3. Computational lexica 4. Lexical databases 5. Lexical knowledge acquisition 6. The use of lexica in text-to-speech

  3. 1. Tokens vs. types (1) The girl gave the flowers to the athlete. - 3 tokens the : properties are context specific - 1 type <THE> : properties are generalizations over the various uses Heracleitos vs. Plato (2) The sooner they come, the better it is. <THE, article> vs. <A, article> NL de, het <THE, adverb> vs. <FAR, adverb> NL hoe

  4. 1. Tokens vs. types (3) I do not think that the dog of that girl is really that dangerous. <THAT, compl> vs. <IF, compl> FR que <THAT, det> vs. <THIS, det> FR ce/cette <THAT, adverb> vs. <SO, adverb> FR si (4) Je ne pense pas que le chien de cette fille est vraiment si dangereux.

  5. 1. Tokens vs. types The abstraction problem: given a word W, how many types <W,POS> do we have to distinguish? (5) It is not far from here. (6) We didn't go far. (7) He's living in the Far West. (8) Paris is far more expensive than Dublin. <FAR, adj> vs. <NEAR, adj> NL ver <FAR, adv> vs. <LITTLE, adv> NL veel

  6. 1. Tokens vs. types (9) De bal van de finale wordt verkocht op het bal van de FIFA. <BAL, noun [non-neuter]> IT palla <BAL, noun [neuter]> IT ballo (10) La palla del finale sarà venduta al ballo della FIFA.

  7. 1. Tokens vs. types (11) That girl has been very lucky. (12) That girl has a lot of hair. <W,POS,VAL> <HAVE, verb [aux], _VP[PSP]> IT avere/essere <HAVE, verb [main], _NP> IT avere

  8. (13) The pen is in my pocket. (14) The pig is in the pen. <W, POS, VAL, SENSE> <PEN, noun, writing implement> NL pen <PEN, noun, fenced enclosure> NL hok 1. Tokens vs. types

  9. 2. Lexicographic practice The entries of pen and peg in the Oxford Advanced Learner's Dictionary of Current English. <ORTHⁿ, PHON, POS, m, (VAL,) SENSE> Homonymy vs. polysemy Problem: for any given ORTH, how many n and how many m does one have to distinguish?

  10. 2. Lexicographic practice The entries of pen and peg in the Collins Cobuild Dictionary of the English Language. <ORTH, PHON, m, SENSE> There is no 1 to 1 correspondence between the senses in both dictionaries

  11. 3. Computational Lexica Dictionaries are made for people who already understand (much of) the language. Computational lexica are made for machines that do not understand (anything of) the language Consequence: an NLP system can only make sense of information which is presented in the notation (or format) which it employs for processing the language.

  12. 3. Computational Lexica <two hundred fifty-six, 256> <two hundred fifty-six, CCLVI> POS tagger The entry for ik in Van Dale The entry for ikin the lexicon of the Spoken Dutch Corpus

  13. 4. Lexical databases Computational lexica are often task-specific and application-dependent. The need for reusability, maintainability, extensibility Creation of a lexical database which is sufficiently general and abstract to be reusable, maintainable and easily extensible Two aspects of abstractness: theory-neutral and level-independent

  14. 4. Lexical databases Lexical knowledge representation languages DATR (Gazdar and Evans)‏ Typed feature structures (HPSG)‏ The number of lexical entries for any given natural language is enormous. The information to be captured in each lexical entry is detailed and complex.

  15. 4. Lexical databases WordNet English nouns, verbs, adjectives and adverbs Inspired by psycholinguistic and computational theories of human lexical memory Organized into synonym sets, each representing one underlying concept Example: call Extension to other languages: EuroWordNet Application to Dutch: Cornetto Other initiatives: FrameNet and VerbNet

  16. 5. Lexical knowledge acquisition from scratch from a machine-readable dictionary from an agency for the distribution of resources (TST, ELRA and LDC)‏ inductive: from a partial lexicon and a corpus

  17. 6. Lexica in text-to-speech written text  text normalisation expanded graphemic representation  tagging & syntactic analysis graphemic representation with prosody  grapheme-to-phoneme sequence of phonemes, incl. lexical stress  speech synthesis fluent speech

More Related