1 / 61

Modelli simulativi nelle Scienze Cognitive

Modelli simulativi nelle Scienze Cognitive. Il lessico: modelli linguistici, WordNet, acquisizione lessicale Massimo Poesio. PART I: LEXICON AND LEXICAL SEMANTICS WORDNET. What’s in a lexicon. A lexicon is a repository of lexical knowledge The simplest form of lexicon: a list of words

gusty
Download Presentation

Modelli simulativi nelle Scienze Cognitive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modelli simulativi nelle Scienze Cognitive Il lessico: modelli linguistici, WordNet, acquisizione lessicale Massimo Poesio Modelli simulativi

  2. PART I:LEXICON AND LEXICAL SEMANTICSWORDNET Modelli simulativi

  3. What’s in a lexicon • A lexicon is a repository of lexical knowledge • The simplest form of lexicon: a list of words • But even for English – let alone languages with a more complex morphology, such as Italian – it makes sense to split WORD FORMS from LEXICAL ENTRIES or LEXEMEs: • LEXEME BANK • POS: N • WORD BANKS • LEXEME: BANK • SYN: • NUM: PLUR • And lexical knowledge also includes information about the MEANING of words Metodi simulativi

  4. Meaning …. • Characterizing the meaning of words not easy • Most of the methods considered in these lecture characterize the meaning of a word by stating its relations with other words • This method however doesn’t say much about what the word ACTUALLY mean (e.g., what can you do with a car) Metodi simulativi

  5. Un esempio di lexical entry: VICINO (da it.wiktionary.org) vicino sostantivo m (vicinaf, viciniplm, vicineplf) 1. Colui che abita accanto. (“I miei vicini vengono da Frosinone” vicino aggettivo m (vicinaf, viciniplm, vicineplf) (“La piu’ vicina stella a neutroni e’ RX J185635-3754”) vicino avverbio (invariabile) (“Itunes visto da vicino”) Metodi simulativi

  6. Lexical resources for computers: MACHINE READABLE DICTIONARIES • A traditional DICTIONARY is a database containing information about • the PRONUNCIATION of a certain word • its possible PARTS of SPEECH • its possible SENSES (or MEANINGS) • In recent years, most dictionaries have appeared in Machine Readable form (MRD) • English: • Oxford English Dictionary • Collins • Longman Dictionary of Ordinary Contemporary English (LDOCE) • Italian: • Garzanti • Zanichelli • Paravia • it.wiktionary.org Metodi simulativi

  7. An example LEXICAL ENTRY from a machine-readable dictionary: STOCK,from the LDOCE • 0100 a supply (of something) for use: a good stock of food • 0200 goods for sale: Some of the stock is being taken without being paid for • 0300 the thick part of a tree trunk • 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side • 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed • 0600 a group of animals used for breeding • 0700 farm animals usu. cattle; LIVESTOCK • 0800 a family line, esp. of the stated character • 0900 money lent to a government at a fixed rate of interest • 1000 the money (CAPITAL) owned by a company, divided into SHAREs • 1100 a type of garden flower with a sweet smell • 1200 a liquid made from the juices of meat, bones, etc., used in cooking ….. Metodi simulativi

  8. Homonymy • Word-strings like STOCK are used to express apparently unrelated senses / meanings, even in contexts in which their part-of-speech has been determined • Other well-known examples: BANK, LIME, RIGHT, SET, SCALE • Italian: CALCIO, OBBIETTIVO • An example of the problems homonimy may cause for IR systems • Search for 'West Bank' with Google Metodi simulativi

  9. CALCIO, da “Il grande dizionario Garzanti” • calcio1 [càl-cio] s.m. 1. colpo dato con il piede o con la zampa; pedata; dare, assestare, ricevere un _ 2. (sport) gioco che si svolge tra due squadre di undici giocatori ciascuna … 3. nel football, colpo dato con il piede al pallone: - di punizione, … - di rigore …. – d’angolo …. – piazzato • calcio2 parte inferiore della cassa di un fucile … derivato del lat. calx calcis …. • calcio3 elemento chimico il cui simbolo è Ca; metallo alcalinoterroso …… Metodi simulativi

  10. Omonimia in un MRD per l’Italiano (ItalWordNet) obbiettivo, Nome [1] - scopo di un'operazione militare.(obbiettivo [1], obiettivo [1]) [2] - bersaglio nel tiro di artiglieria(obbiettivo [2], obiettivo [2]) [4] - sistema di lenti per proiettare l'immagine reale di un oggetto(obbiettivo [4], obiettivo [4]) Metodi simulativi

  11. Homonymy and machine translation Metodi simulativi

  12. Meaning in MRDs, 2: SYNONYMY • Two words are SYNONYMS if they have the same meaning at least in some contexts • E.g., PRICE and FARE; CHEAP and INEXPENSIVE; LAPTOP and NOTEBOOK; HOME and HOUSE • I’m looking for a CHEAP FLIGHT / INEXPENSIVE FLIGHT • From Roget’s thesaurus: • OBLITERATION, erasure, cancellation, deletion • But few words are truly synonymous in ALL contexts: • I wanna go HOME / ?? I wanna go HOUSE • The flight was CANCELLED / ?? OBLITERATED / ??? DELETED • Knowing about synonyms may help in IR: • NOTEBOOK (get LAPTOPs as well) • CHEAP PRICE (get INEXPENSIVE FARE) Metodi simulativi

  13. Sinonimia in Italiano scorza, Nome [1] - (corteccia [1], scorza [1]) [2] - parte esterna, involucro dei frutti(buccia [1], scorza [2]) [4] - (scorza [4]) "sotto la sua scorza scortese si nasconde un animo nobile" Metodi simulativi

  14. Problems and limitations of MRDs Identifying distinct senses always difficult- Sense distinctions often subjective Definitions often circular Very limited characterization of the meaning of words Metodi simulativi

  15. Homonymy vs polysemy • 0100 a supply (of something) for use: a good stock of food • 0200 goods for sale: Some of the stock is being taken without being paid for • 0300 the thick part of a tree trunk • 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side • 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed • 0600 a group of animals used for breeding • 0700 farm animals usu. cattle; LIVESTOCK • 0800 a family line, esp. of the stated character • 0900 money lent to a government at a fixed rate of interest • 1000 the money (CAPITAL) owned by a company, divided into SHAREs • 1100 a type of garden flower with a sweet smell • 1200 a liquid made from the juices of meat, bones, etc., used in cooking ….. Metodi simulativi

  16. POLYSEMY vs HOMONIMY • In cases like BANK, it’s fairly easy to identify two distinct senses (etymology also different). But in other cases, distinctions more questionable • E.g., senses 0100 and 0200 of stock clearly related, like 0600 and 0700, or 0900 and 1000 • In some cases, syntactic tests may help. E.g., KEEP (Hirst, 1987): • Ross KEPT staring at Nadia’s decolletage • Nadia KEPT calm and made a cutting remark • Ross wrote of his embarassment in the diary that he KEPT. • POLYSEMOUS WORDS: meanings are related to each other • Cfr. Human’s foot vs. mountain’s foot • In general, distinction between HOMONIMY and POLYSEMY not always easy (especially with VERBS) Metodi simulativi

  17. Other aspects of lexical meaning not captured by MRDs • Other semantic relations: • HYPONYMY • ANTONYMY • A lot of other information typically considered part of ENCYCLOPEDIAs: • Trees grow bark and twigs • Adult trees are much taller than human beings Metodi simulativi

  18. Hyponymy and Hypernymy • HYPONYMY is the relation between a subclass and a superclass: • CAR and VEHICLE • DOG and ANIMAL • BUNGALOW and HOUSE • Generally speaking, a hyponymy relation holds between X and Y whenever it is possible to substitute Y for X: • That is a X -> That is a Y • E.g., That is a CAR -> That is a VEHICLE. • HYPERNYMY is the opposite relation • Knowledge about TAXONOMIES useful to classify web pages • Eg., Semantic Web • Automatically (e.g., Udo Kruschwitz’s system) • This information not generally contained in MRD Metodi simulativi

  19. EAT-LEX-1 The organization of the lexicon “eat” “eats” eat0600 eat0700 “ate” “eaten” WORD-FORMS LEXEMES SENSES Metodi simulativi

  20. STOCK-LEX-1 STOCK-LEX-2 STOCK-LEX-3 The organization of the lexicon stock0100 stock0200 stock0600 “stock” stock0700 stock0900 stock1000 WORD-STRINGS LEXEMES SENSES Metodi simulativi

  21. CHEAP-LEX-1 CHEAP-LEX-2 INEXP-LEX-3 Synonymy cheap0100 “cheap” …. …… cheapXXXX inexp0900 “inexpensive” inexpYYYY WORD-STRINGS LEXEMES SENSES Metodi simulativi

  22. A more advanced lexical resource: WordNet • A lexical database created at Princeton • Freely available for research from the Princeton site • http://www.cogsci.princeton.edu/~wn/ • Information about a variety of SEMANTICAL RELATIONS • Three sub-databases (supported by psychological research as early as (Fillenbaum and Jones, 1965)) • NOUNs • VERBS • ADJECTIVES and ADVERBS • Each database organized around SYNSETS Metodi simulativi

  23. The noun database • About 90,000 forms, 116,000 senses • Relations: Metodi simulativi

  24. Synsets • Senses (or `lexicalized concepts’) are represented in WordNet by the set of words that can be used in AT LEAST ONE CONTEXT to express that sense / lexicalized concept: the SYNSET • E.g., {chump, fish, fool, gull, mark, patsy, fall guy, sucker, shlemiel, soft touch, mug}(gloss: person who is gullible and easy to take advantage of) Metodi simulativi

  25. Hypernyms 2 senses of robin                                                       Sense 1robin, redbreast, robin redbreast, Old World robin, Erithacus rubecola -- (small Old World songbird with a reddish breast)       => thrush -- (songbirds characteristically having brownish upper plumage with a spotted breast)           => oscine, oscine bird -- (passerine bird having specialized vocal apparatus)               => passerine, passeriform bird -- (perching birds mostly small and living near the ground with feet having 4 toes arranged to allow for gripping the perch; most are songbirds; hatchlings are helpless)                   => bird -- (warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings)                       => vertebrate, craniate -- (animals having a bony or cartilaginous skeleton with a segmented spinal column and a large brain enclosed in a skull or cranium)                           => chordate -- (any animal of the phylum Chordata having a notochord or spinal column)                               => animal, animate being, beast, brute, creature, fauna -- (a living organism characterized by voluntary movement)                                   => organism, being -- (a living thing that has (or can develop) the ability to act or function independently)                                       => living thing, animate thing -- (a living (or once living) entity)                                           => object, physical object --                                                => entity, physical thing -- Metodi simulativi

  26. Meronymy wn beak –holon Holonyms of noun beak 1 of 3 senses of beak Sense 2 beak, bill, neb, nib PART OF: bird Metodi simulativi

  27. The verb database • About 10,000 forms, 20,000 senses • Relations between verb meanings: Metodi simulativi

  28. Relations between verbal meanings V1 ENTAILS V2 when Someone V1 (logically) entails Someone V2- e.g., snore entails sleep TROPONYMY when To do V1 is To do V2 in some manner- e.g., limp is a troponym of walk Metodi simulativi

  29. The adjective and adverb database • About 20,000 adjective forms, 30,000 senses • 4,000 adverbs, 5600 senses • Relations: Metodi simulativi

  30. How to use • Online: http://cogsci.princeton.edu/cgi-bin/webwn • Command line: • Get synonyms: • wn –synsn bank • Get hypernyms: • wn –hypen robin • (also for adjectives and verbs): get antonyms • wn –antsa right Metodi simulativi

  31. ItalWordNet (una produzione locale) • EuroWordNet: creato da un consorzio Europeo • ItalWordNet: creato da ITC • http://www.ilc.cnr.it/iwndb_php/ Metodi simulativi

  32. Other machine-readable lexical resources • Machine readable dictionaries: • LDOCE • Roget’s Thesaurus • The biggest encyclopedia: CYC • Italian: • http://multiwordnet.itc.it/ (IRST) Metodi simulativi

  33. Readings • WordNet online manuals • C. Fellbaum (ed), Wordnet: An Electronic Lexical Database, The MIT Press Metodi simulativi

  34. PART II: VECTOR-BASED MODELS OF THE LEXICON AND LEXICAL ACQUISITION Modelli simulativi

  35. VECTOR-BASED LEXICAL MODELS • Both in Linguistics and in Psychology researchers have developed theories of the lexicon in which concepts are characterized in terms of FEATURES • E.g., Smith and Medin, 1981; Sartori and Job, 1988 • This type of approach leads to a ‘geometrical’ view of lexical entries as points , or VECTORS, in FEATURE SPACE • This type of model can account for which words ‘mean the same’ • A particularly simple version of this theory is the one in which the ‘features’ are simply other words • Vector-space models have been shown to correlate well with the results of psychological experiments, particularly about SEMANTIC PRIMING Metodi simulativi

  36. VECTOR-BASED MODELS AND LEXICAL ACQUISITION • Vector-based models (both the feature-based and the word-based variety) also interesting because they can serve as the basis for models of lexical acquisition • These models are interesting • From a psychological point of view, to explain how concepts are stored in memory • In neural science, they are being used to investigate SEMANTIC CATEGORY DEFICITS (e.g., Caramazza, Tyler et al, Vigliocco et al) • From a linguistic point of view, because they can address the problems encountered by lexicographers when trying to specify word senses • From a practical point of view: most MRD these days contain at least some information derived by computational means Metodi simulativi

  37. Feature-based lexical semantics • Very old idea in Linguistics: the meaning of a word can be specified in terms of the values of certain `features’ (`DECOMPOSITIONAL SEMANTICS’) • dog : ANIMATE= +, EAT=MEAT, SOCIAL=+ • horse : ANIMATE= +, EAT=GRASS, SOCIAL=+ • cat : ANIMATE= +, EAT=MEAT, SOCIAL=- • E.g., Katz and Fodor, 1968 Metodi simulativi

  38. PSYCHOLOGY: THE FUSS MODEL (Vinson and Vigliocco, 2002, 2003) Metodi simulativi

  39. Vector-based lexical semantics CAT DOG HORSE Metodi simulativi

  40. WORD-BASED VECTOR-SPACE LEXICAL MODELS, I Metodi simulativi

  41. WORD-BASED VECTOR SPACE MODELS, II Metodi simulativi

  42. WORD-BASED VECTOR-SPACE MODELS, III Metodi simulativi

  43. Measures of semantic similarity • Euclidean distance: • Cosine: • Manhattan Metric: Metodi simulativi

  44. DIMENSIONALITY REDUCTION Metodi simulativi

  45. Concept clustering(aka: automatic taxonomy discovery) Year Month Day Joy Car Van Love Fear Airplane Time Vehicle Feeling Metodi simulativi

  46. Some psychological evidence for vector-space representations • Burgess and Lund (1996, 1997): the clusters found with HAL correlate well with those observed using semantic priming experiments. • Landauer, Foltz, and Laham (1997): scores overlap with those of humans on standard vocabulary and topic tests; mimic human scores on category judgments; etc. • Evidence about `prototype theory’ (Rosch et al, 1976) • Posner and Keel, 1968 • subjects presented with patterns of dots that had been obtained by variations from single pattern (`prototype’) • Later, they recalled prototypes better than samples they had actually seen • Rosch et al, 1976: `basic level’ categories (apple, orange, potato, carrot) have higher `cue validity’ than elements higher in the hierarchy (fruit, vegetable) or lower (red delicious, cox) Metodi simulativi

  47. General characterization of vector-based semantics (from Charniak) • Vectors as models of concepts • The CLUSTERING approach to lexical semantics: • Define properties one cares about, and give values to each property (generally, numerical) • Create a vector of length n for each item to be classified • Viewing the n-dimensional vector as a point in n-space, cluster points that are near one another • What changes between models: • The properties used in the vector • The distance metric used to decide if two points are `close’ • The algorithm used to cluster Metodi simulativi

  48. Using words as features in a vector-based semantics • The old decompositional semantics approach requires • Specifying the features • Characterizing the value of these features for each lexeme • Simpler approach: use as features the WORDS that occur in the proximity of that word / lexical entry • Intuition: “You can tell a word’s meaning from the company it keeps” • More specifically, you can use as `values’ of these features • The FREQUENCIES with which these words occur near the words whose meaning we are defining • Or perhaps the PROBABILITIES that these words occur next to each other • Alternative: use the DOCUMENTS in which these words occur (e.g., LSA) Metodi simulativi

  49. Using neighboring words to specify the meaning of words • Take, e.g., the following corpus: • John ate a banana. • John ate an apple. • John drove a lorry. • We can extract the following co-occurrence matrix: Metodi simulativi

  50. Acquiring lexical vectors from a corpus(Schuetze, 1991; Burgess and Lund, 1997) • To construct vectors C(w) for each word w: • Scan a text • Whenever a word w is encountered, increment all cells of C(w) corresponding to the words v that occur in the vicinity of w, typically within a window of fixed size • Differences among methods: • Size of window • Weighted or not • Whether every word in the vocabulary counts as a dimension (including function words such as the or and) or whether instead only some specially chosen words are used (typically, the m most common content words in the corpus; or perhaps modifiers only). The words chosen as dimensions are often called CONTEXT WORDS • Whether dimensionality reduction methods are applied Metodi simulativi

More Related