Frank Van Eynde Centre for Computational Linguistics. Computational Lexicography. OUTLINE. 1. The token/type distinction 2. Lexicographic practice 3. Computational lexica 4. Lexical databases 5. Lexical knowledge acquisition 6. The use of lexica in text-to-speech.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Centre for Computational LinguisticsComputational Lexicography
1. The token/type distinction
2. Lexicographic practice
3. Computational lexica
4. Lexical databases
5. Lexical knowledge acquisition
6. The use of lexica in text-to-speech
(1) The girl gave the flowers to the athlete.
- 3 tokens the : properties are context specific
- 1 type <THE> : properties are generalizations over the various uses
Heracleitos vs. Plato
(2) The sooner they come, the better it is.
<THE, article> vs. <A, article> NL de, het
<THE, adverb> vs. <FAR, adverb> NL hoe
(3) I do not think that the dog of that girl is really that dangerous.
<THAT, compl> vs. <IF, compl> FR que
<THAT, det> vs. <THIS, det> FR ce/cette
<THAT, adverb> vs. <SO, adverb> FR si
(4) Je ne pense pas que le chien de cette fille est vraiment si dangereux.
The abstraction problem: given a word W, how many types <W,POS> do we have to distinguish?
(5) It is not far from here.
(6) We didn't go far.
(7) He's living in the Far West.
(8) Paris is far more expensive than Dublin.
<FAR, adj> vs. <NEAR, adj> NL ver
<FAR, adv> vs. <LITTLE, adv> NL veel
(9) De bal van de finale wordt verkocht op het bal van de FIFA.
<BAL, noun [non-neuter]> IT palla
<BAL, noun [neuter]> IT ballo
(10) La palla del finale sarà venduta al ballo della FIFA.
(11) That girl has been very lucky.
(12) That girl has a lot of hair.
<HAVE, verb [aux], _VP[PSP]> IT avere/essere
<HAVE, verb [main], _NP> IT avere
(14) The pig is in the pen.
<W, POS, VAL, SENSE>
<PEN, noun, writing implement> NL pen
<PEN, noun, fenced enclosure> NL hok1. Tokens vs. types
The entries of pen and peg in the Oxford Advanced Learner's Dictionary of Current English.
<ORTHⁿ, PHON, POS, m, (VAL,) SENSE>
Homonymy vs. polysemy
Problem: for any given ORTH, how many n and how many m does one have to distinguish?
The entries of pen and peg in the Collins Cobuild Dictionary of the English Language.
<ORTH, PHON, m, SENSE>
There is no 1 to 1 correspondence between the senses in both dictionaries
Dictionaries are made for people who already understand (much of) the language.
Computational lexica are made for machines that do not understand (anything of) the language
Consequence: an NLP system can only make sense of information which is presented in the notation (or format) which it employs for processing the language.
<two hundred fifty-six, 256>
<two hundred fifty-six, CCLVI>
The entry for ik in Van Dale
The entry for ikin the lexicon of the Spoken Dutch Corpus
Computational lexica are often task-specific and application-dependent.
The need for reusability, maintainability, extensibility
Creation of a lexical database which is sufficiently general and abstract to be reusable, maintainable and easily extensible
Two aspects of abstractness: theory-neutral and level-independent
Lexical knowledge representation languages
DATR (Gazdar and Evans)
Typed feature structures (HPSG)
The number of lexical entries for any given natural language is enormous.
The information to be captured in each lexical entry is detailed and complex.
English nouns, verbs, adjectives and adverbs
Inspired by psycholinguistic and computational theories of human lexical memory
Organized into synonym sets, each representing one underlying concept
Extension to other languages: EuroWordNet
Application to Dutch: Cornetto
Other initiatives: FrameNet and VerbNet
from a machine-readable dictionary
from an agency for the distribution of resources (TST, ELRA and LDC)
inductive: from a partial lexicon and a corpus
expanded graphemic representation
tagging & syntactic analysis
graphemic representation with prosody
sequence of phonemes, incl. lexical stress