Computational Lexicography

Frank Van Eynde Centre for Computational Linguistics Computational Lexicography

OUTLINE 1. The token/type distinction 2. Lexicographic practice 3. Computational lexica 4. Lexical databases 5. Lexical knowledge acquisition 6. The use of lexica in text-to-speech

1. Tokens vs. types (1) The girl gave the flowers to the athlete. - 3 tokens the : properties are context specific - 1 type <THE> : properties are generalizations over the various uses Heracleitos vs. Plato (2) The sooner they come, the better it is. <THE, article> vs. <A, article> NL de, het <THE, adverb> vs. <FAR, adverb> NL hoe

1. Tokens vs. types (3) I do not think that the dog of that girl is really that dangerous. <THAT, compl> vs. <IF, compl> FR que <THAT, det> vs. <THIS, det> FR ce/cette <THAT, adverb> vs. <SO, adverb> FR si (4) Je ne pense pas que le chien de cette fille est vraiment si dangereux.

1. Tokens vs. types The abstraction problem: given a word W, how many types <W,POS> do we have to distinguish? (5) It is not far from here. (6) We didn't go far. (7) He's living in the Far West. (8) Paris is far more expensive than Dublin. <FAR, adj> vs. <NEAR, adj> NL ver <FAR, adv> vs. <LITTLE, adv> NL veel

1. Tokens vs. types (9) De bal van de finale wordt verkocht op het bal van de FIFA. <BAL, noun [non-neuter]> IT palla <BAL, noun [neuter]> IT ballo (10) La palla del finale sarà venduta al ballo della FIFA.

1. Tokens vs. types (11) That girl has been very lucky. (12) That girl has a lot of hair. <W,POS,VAL> <HAVE, verb [aux], _VP[PSP]> IT avere/essere <HAVE, verb [main], _NP> IT avere

(13) The pen is in my pocket. (14) The pig is in the pen. <W, POS, VAL, SENSE> <PEN, noun, writing implement> NL pen <PEN, noun, fenced enclosure> NL hok 1. Tokens vs. types

2. Lexicographic practice The entries of pen and peg in the Oxford Advanced Learner's Dictionary of Current English. <ORTHⁿ, PHON, POS, m, (VAL,) SENSE> Homonymy vs. polysemy Problem: for any given ORTH, how many n and how many m does one have to distinguish?

2. Lexicographic practice The entries of pen and peg in the Collins Cobuild Dictionary of the English Language. <ORTH, PHON, m, SENSE> There is no 1 to 1 correspondence between the senses in both dictionaries

3. Computational Lexica Dictionaries are made for people who already understand (much of) the language. Computational lexica are made for machines that do not understand (anything of) the language Consequence: an NLP system can only make sense of information which is presented in the notation (or format) which it employs for processing the language.

3. Computational Lexica <two hundred fifty-six, 256> <two hundred fifty-six, CCLVI> POS tagger The entry for ik in Van Dale The entry for ikin the lexicon of the Spoken Dutch Corpus

4. Lexical databases Computational lexica are often task-specific and application-dependent. The need for reusability, maintainability, extensibility Creation of a lexical database which is sufficiently general and abstract to be reusable, maintainable and easily extensible Two aspects of abstractness: theory-neutral and level-independent

4. Lexical databases Lexical knowledge representation languages DATR (Gazdar and Evans)‏ Typed feature structures (HPSG)‏ The number of lexical entries for any given natural language is enormous. The information to be captured in each lexical entry is detailed and complex.

4. Lexical databases WordNet English nouns, verbs, adjectives and adverbs Inspired by psycholinguistic and computational theories of human lexical memory Organized into synonym sets, each representing one underlying concept Example: call Extension to other languages: EuroWordNet Application to Dutch: Cornetto Other initiatives: FrameNet and VerbNet

5. Lexical knowledge acquisition from scratch from a machine-readable dictionary from an agency for the distribution of resources (TST, ELRA and LDC)‏ inductive: from a partial lexicon and a corpus

6. Lexica in text-to-speech written text  text normalisation expanded graphemic representation  tagging & syntactic analysis graphemic representation with prosody  grapheme-to-phoneme sequence of phonemes, incl. lexical stress  speech synthesis fluent speech

Computational Lexicography

Computational Lexicography

Presentation Transcript

English Monolingual Lexicography

Lexicography versus Terminography

Lexicographic phonetics or phonetic lexicography?

Lexicography and Encyclopaedistics in the Digital Environment

Bilingual Lexicography

History of English and Indonesian Lexicography

Setting up for Corpus Lexicography

Off-line (and On-line) Text Analysis for Computational Lexicography

Lexicography and computer science: a harmless drudgery?

Setting up for Corpus Lexicography

Off-line (and On-line) Text Analysis for Computational Lexicography

Corpus lexicography in Russia: recent trends and perspectives

Computational

Computational Lexicography: Mapping Meaning onto Use

Text Analysis Meets Computational Lexicography

Using Corpora in Linguistics and Lexicography

English Lexicography

International Standards in Lexicography

Corpus Creation for Lexicography

Computational lexicography, morphology and syntax

Computational Lexicography: Mapping Meaning onto Use

English Lexicography