Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004.
Dictionaries/Lexicons • Lexicography and the computer • Corpus-based lexicography • MRDs • Dictionaries for NLP • Thesauri: structured lexicons
Computational lexicography • Restructuring and exploiting human dictionaries for use by computer programs • Using computational techniques to compile (new) dictionaries • Focus on English (and other well established languages) • Significant different issues for other languages, especially • Alphabetization and arrangement • Compilation from scratch for previously unstudied languages
Human dictionaries • Traditional view of what a “dictionary” is • List of words, arranged (usually) alphabetically • Inclusion in dictionary lends authority, even proscriptively • Entry typically gives • spelling ... alternate spellings • POS, morphology (if irregular) • core definition (using defining vocab?) • pronunciation (using own transcription) • etymology • examples of usage • as justification for inclusion • as illustration of use (esp. learner’s dictionaries) • Entry typically doesn’t give • help with spelling • morphology (if regular), especially derivational • subcategorization information • contrastive examples of use • indications of possible metaphorical extensions to meaning
Human dictionaries • Historically • bilingual dictionaries for translators • monolingual dictionary as (pre/proscriptive) definition of language, often polemical • OED (1884-1928) first dictionary on purely descriptive principle, relying on citations • Deficiencies and difficulties • What to include? (neologisms, slang) • Inclusion of names • Differentiating senses
Differentiating word senses • Dictionaries disagree widely • Probably no right answer • General principles (look for excuse to split vs look for reason to lump) • Keep related words of different POS together? • Etymology can be misleading (eg crane, pupil) • Metaphorical extension of original meaning – how far do you go? (eg rose, bar) • Purpose of dictionary may help decide, eg translation
Citations • Senses and uses identified by collecting examples of use • Sent in on “slips” by informants • Lexicographer’s job is to collate these • Criteria for a new word (or new meaning) • Number of citations • Source of citations • Veracity of use
Corpus-based dictionaries • A collection of texts, usually collected with a specific purpose in mind • British National Corpus, attempt to capture a synchronic picture of BrE of the late 1980s (100m words) • COBUILD “Bank of English” dynamic “monitor” corpus used to help lexicographers identify/define usage
Machine-readable dictionaries • “Machine” means “computer” • Dictionary stored in a format which makes it manipulable on a computer • Originally, derived from MR version of print dictionary (from type-setter’s tapes) • Now the other way round: data stored as a database from which hard copy can be printed (inter alia)
MRDs - advantages • Flexibility of access and presentation • Not bound to alphabetical listing • Information presented can be filtered • Can be searched as a database • Different versions (for different users, serving different purposes) can be produced • Increased storage capacity • More information can be stored, especially • Implicit information can be made explicit • More examples, including “negative data”
Lexicons for NLP • Have to state everything we need to know about the word • Phonology: stress pattern, possible weak forms • Orthography: spelling alternatives, hyphenation • Morphology: inflectional paradigms, even if regular • Information about derivations • Syntax: Explicit information about subcategorization and • eg syntactic/semantic features of arguments • Any special interpretation of tenses • Lexical combinatorics: compounds, idioms • Semantics: definition, semantic features, semantic relations • Pragmatics: register, collocation, connotation
Lexicons for NLP - example • Information about derivations • Agentive derivation (-er) is very productive • Usually means the actor doing the action of a verb, e.g. swimmer, dancer, killer • Not available for some verbs, e.g. *knower, *cycler, *sayer though cf soothsayer, *hoper • May have a specialised meaning instead of or as well as the derived meaning, e.g. revolver, computer, washer, hitter • In some cases can mean the object undergoing the action (via ergative use of verb), e.g. taster
Subcategorization • Words are assigned to categories (ie parts of speech, POS), eg noun, verb • on basis of form, meaning, use • Syntactic behaviour is predictable from (or determined by) category • Within a category there are subcategories with specific patterns of behaviour, both syntactic and semantic, e.g. • transitive/intransitive verb direct object? passivize?
Subcategorization • Subcat frames indicate complement patterns and preferences, e.g. • subj, obj, double obj, prep-obj, infinitival complement, that complement etc • semantic features of complements, eg obj of eat normally edible • Subcat information can help to disambiguate • cf He told the man where the body was buried . • He found the place where the body was buried . • Much of this info can be captured in general rules [ ][ ] [ [ ]]
Have to state everything we need to know about the word, though not necessarily explicitly • There can be rules to capture inheritance of properties, e.g. • accomplishment + prog tense implies incompletion • cf She was baking a cake when she dropped dead no cake • She was stroking the cat when she dropped dead
Exploiting human dictionaries in NLP • In all NLP applications, lexicon is major bottleneck • Availability of MRD versions of human dictionaries provided possible solution • Obviously, MRD gives list of words, and some information • Extract further information about verb frames by analysing the examples • Identify semantic features from definitions eg a plant which..., a person who... • Identify hidden arguments eg to lock = to close sthg using a key cf He locked the door. The key was heavy. He emptied his pockets. *The key was heavy.
Exploiting human dictionaries in NLP • Generic information about a word and its usage can be derived from definitions in which it occurs: Wine: alcoholic drink made from fermented juices, especially of grapes Vintage: a season’s yield of wine from a vineyard Red wine: wine having a red colour derived from the skins of the grapes used ... Vineyard: an orchard where grapes are grown for the purpose of wine making Pinot noir: a dry red Californian table wine Sake: Japanese rice wine Claret: a dry red Bordeaux or Bordeaux-like wine Sherry: a sweet white wine fromthe Jerez region of Spain Riesling: a dessert wine made from white grapes grown historically in Germany ...
Corpus-based lexicography revisited • Similarly, analysis of real examples can reveal patterns of usage • Identify primary meaning: not always what you’d expect (example of reckon) • Identify possible complementation patterns, and their relative frequency
Structured dictionaries • Special type of dictionary in which words are grouped together according to their meaning: thesaurus • Classic example Roget’s Thesaurus (1852) • Structured vocabulary much used in field of terminology • Also now a valuable resource for NLP: Miller’s (Princeton) WordNet (1985)