Frank van eynde centre for computational linguistics
1 / 17

Computational Lexicography - PowerPoint PPT Presentation

  • Uploaded on

Frank Van Eynde Centre for Computational Linguistics. Computational Lexicography. OUTLINE. 1. The token/type distinction 2. Lexicographic practice 3. Computational lexica 4. Lexical databases 5. Lexical knowledge acquisition 6. The use of lexica in text-to-speech.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Computational Lexicography' - leone

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Frank van eynde centre for computational linguistics

Frank Van Eynde

Centre for Computational Linguistics

Computational Lexicography


1. The token/type distinction

2. Lexicographic practice

3. Computational lexica

4. Lexical databases

5. Lexical knowledge acquisition

6. The use of lexica in text-to-speech

1 tokens vs types
1. Tokens vs. types

(1) The girl gave the flowers to the athlete.

- 3 tokens the : properties are context specific

- 1 type <THE> : properties are generalizations over the various uses

Heracleitos vs. Plato

(2) The sooner they come, the better it is.

<THE, article> vs. <A, article> NL de, het

<THE, adverb> vs. <FAR, adverb> NL hoe

1 tokens vs types1
1. Tokens vs. types

(3) I do not think that the dog of that girl is really that dangerous.

<THAT, compl> vs. <IF, compl> FR que

<THAT, det> vs. <THIS, det> FR ce/cette

<THAT, adverb> vs. <SO, adverb> FR si

(4) Je ne pense pas que le chien de cette fille est vraiment si dangereux.

1 tokens vs types2
1. Tokens vs. types

The abstraction problem: given a word W, how many types <W,POS> do we have to distinguish?

(5) It is not far from here.

(6) We didn't go far.

(7) He's living in the Far West.

(8) Paris is far more expensive than Dublin.

<FAR, adj> vs. <NEAR, adj> NL ver

<FAR, adv> vs. <LITTLE, adv> NL veel

1 tokens vs types3
1. Tokens vs. types

(9) De bal van de finale wordt verkocht op het bal van de FIFA.

<BAL, noun [non-neuter]> IT palla

<BAL, noun [neuter]> IT ballo

(10) La palla del finale sarà venduta al ballo della FIFA.

1 tokens vs types4
1. Tokens vs. types

(11) That girl has been very lucky.

(12) That girl has a lot of hair.


<HAVE, verb [aux], _VP[PSP]> IT avere/essere

<HAVE, verb [main], _NP> IT avere

1 tokens vs types5

(13) The pen is in my pocket.

(14) The pig is in the pen.


<PEN, noun, writing implement> NL pen

<PEN, noun, fenced enclosure> NL hok

1. Tokens vs. types

2 lexicographic practice
2. Lexicographic practice

The entries of pen and peg in the Oxford Advanced Learner's Dictionary of Current English.


Homonymy vs. polysemy

Problem: for any given ORTH, how many n and how many m does one have to distinguish?

2 lexicographic practice1
2. Lexicographic practice

The entries of pen and peg in the Collins Cobuild Dictionary of the English Language.


There is no 1 to 1 correspondence between the senses in both dictionaries

3 computational lexica
3. Computational Lexica

Dictionaries are made for people who already understand (much of) the language.

Computational lexica are made for machines that do not understand (anything of) the language

Consequence: an NLP system can only make sense of information which is presented in the notation (or format) which it employs for processing the language.

3 computational lexica1
3. Computational Lexica

<two hundred fifty-six, 256>

<two hundred fifty-six, CCLVI>

POS tagger

The entry for ik in Van Dale

The entry for ikin the lexicon of the Spoken Dutch Corpus

4 lexical databases
4. Lexical databases

Computational lexica are often task-specific and application-dependent.

The need for reusability, maintainability, extensibility

Creation of a lexical database which is sufficiently general and abstract to be reusable, maintainable and easily extensible

Two aspects of abstractness: theory-neutral and level-independent

4 lexical databases1
4. Lexical databases

Lexical knowledge representation languages

DATR (Gazdar and Evans)‏

Typed feature structures (HPSG)‏

The number of lexical entries for any given natural language is enormous.

The information to be captured in each lexical entry is detailed and complex.

4 lexical databases2
4. Lexical databases


English nouns, verbs, adjectives and adverbs

Inspired by psycholinguistic and computational theories of human lexical memory

Organized into synonym sets, each representing one underlying concept

Example: call

Extension to other languages: EuroWordNet

Application to Dutch: Cornetto

Other initiatives: FrameNet and VerbNet

5 lexical knowledge acquisition
5. Lexical knowledge acquisition

from scratch

from a machine-readable dictionary

from an agency for the distribution of resources (TST, ELRA and LDC)‏

inductive: from a partial lexicon and a corpus

6 lexica in text to speech
6. Lexica in text-to-speech

written text

 text normalisation

expanded graphemic representation

 tagging & syntactic analysis

graphemic representation with prosody

 grapheme-to-phoneme

sequence of phonemes, incl. lexical stress

 speech synthesis

fluent speech