Word association thesaurus as a resource for building wordnet
Download
1 / 29

Word Association Thesaurus as a Resource for Building Wordnet - PowerPoint PPT Presentation


  • 175 Views
  • Uploaded on

Word Association Thesaurus as a Resource for Building Wordnet. Anna Sinopalnikova Masaryk University, Brno, Czech Republic Saint-Petersburg State University, Russia anna@fi.muni.cz. Overview . Types of LRs used What is Word Association? Information to be extracted from WAT WAT vs. Corpus

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Word Association Thesaurus as a Resource for Building Wordnet' - karli


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Word association thesaurus as a resource for building wordnet

Word Association Thesaurus as a Resource for Building Wordnet

Anna Sinopalnikova

Masaryk University, Brno, Czech Republic

Saint-Petersburg State University, Russia

anna@fi.muni.cz


Overview
Overview Wordnet

  • Types of LRs used

  • What is Word Association?

  • Information to be extracted from WAT

  • WAT vs. Corpus

  • Conclusions

  • Future plans


What kind of language resources are used to build wordnets

Primary resources Wordnet

e.g. text corpora

present (more or less) ‘raw’ data on the language in use

information is given implicitly

Derived resources

e.g. explanatory dictionaries, Roget type thesauri

present explications of internal knowledge of language

based on primary resources + intuition

information is given explicitly

What kind of language resources are used to build wordnets?


What is better
What is better? Wordnet

  • To build an adequate and reliable lexical database (e.g. wordnet) it is not enough to rely upon information produced by ‘experts’ (i. e. linguists, lexicographers).

  • One should rather explore the raw data, and extract information from language in its actual and its potential use.

  • Corpora reign!


Word association
Word Association Wordnet

  • Association: “connection or relation between 2 entities (perceptions, ideas or words), that manifests in a following way: an appearance of one entity entails the appearance of the other in the mind”

  • Word Association: an appearance of one word entails the appearance of the other in the mind


Association examples 1
Association: examples Wordnet (1)

  • Kill 


Association examples 11
Association: examples Wordnet (1)

  • Kill  Bill



Association examples 21
Association: examples Wordnet (2)

 Nike



Association examples 31
Association: examples Wordnet (3)

  •  Kill Bill


Word association test
Word Association Test Wordnet

  • Generally, a list of words (stimuli) is given to subjects (either in writing or in oral form). The subjects are asked to respond with the first word that comes into their mind (responses).

  • Other methods: controlled association test, priming etc.


Cat stimulates
‘Cat’ Wordnet stimulates

  • Dog 49, mouse 8, black 4, animal 2, eyes, gut, kitten, tom 2, bit, Cheshire, claw, claws, enigma, feline, furry, hearth, house, kin, kittens, milk, pet, pussy, todd 1

    (of 100 people asked)


Word association norms wan
Word Association Norms (WAN) Wordnet

  • WAN represents the data collected through a series of WA test carried out according to the standard technique.

  • The body of WAN: list of responses and their absolute frequencies for each stimulus word

    E.g. Kent & Rosanoff (1910) 100 stimuli - 1000 subjects

    Palermo & Jenkins (1964) 200 stimuli - 1000 subjects


Word association thesaurus wat
Word Association Thesaurus (WAT) Wordnet

  • WAT is a kind of WAN

  • WAN vs. WAT differ not only in volume but also in the procedure of data collection. It implies cycles: A small set of stimuli is used as a starting point of the experiment, responses obtained for them are used as stimuli in the next stage, the cycle being repeated at least 3 times.

  • Being a thesaurus WAT is expected to cover ‘all’ the vocabulary (all the words relevant for the language) and reflect the basic structure of a particular language (all the relations between words relevant for this particular language system).

  • E.g. Kiss et al (1972): about 54.000 words, Nelson et al (1973-1990) about 75.000 words, Karaulov et al (1994-1998): 23.000 words


What kind of linguistic information could be extracted from wat
What kind of linguistic information could be extracted from WAT?

  • The core concepts of the language

  • Syntagmatic & paradigmatic relations between words presented explicitly (as opposed to text corpora)

  • Relevance of word senses for native speakers

  • Relevance of relations for native speakers

  • Domain information that are shown (as opposed to dictionaries)

  • Semantic classification of words obtained by using formal criteria


The core concepts of the language
The core concepts of the language WAT?

  • In every language there is a finite number of words that appear as responses more frequently then other words. This set is quite stable:

    • it does change much as the time goes;

    • it doesn’t depends on the starting circumstances, e.g. on words that were chosen as stimulus words

      Russian: ‘man’, ‘house’, ‘love’, ‘life’, ‘be/eat’, ‘think’, ‘live’, ‘go’, ‘big/large’, ‘good’, ‘bad’, ‘no/not’...

      295 words with more then 100 relations

      English: man, sex, no (not), love, house; work, eat, think, go, live; good, old, small…

      586 words with more then 100 relations

  • Cf. EuroWordNet Basic Concepts


Syntagmatic relations
Syntagmatic relations WAT?

E.g. Cat -> black, Cheshire, pussy;

Cat -> mat, nip, purr

  • Law of contiguity: through life we learn “what goes together” and reproduce it together

  • Right and left contexts of a word

  • Help to acquire:

    • Selectional preferences, valency frames

    • Semantic relations between words (e.g. ROLE/INVOLVED)

    • Distinguishing different senses of a word

    • Establishing relations of synonymy, hyponymy, and antonymy

      Cf. text corpora


Paradigmatic relations
Paradigmatic relations WAT?

E.g. Cat-> dog, mouse, animal, pet;

Cat-> eyes, claw

  • Synonyms, hyponyms/hyperonyms/co-hyponyms, meronyms/holonyms, or antonyms

  • Law of contiguity???

  • Help us to acquire:

    • This information may be included directly in terms of semantic relations between wordnet entries

    • Also it helps us to enrich and to check out the set of relations encoded earlier



Domain information
Domain information syntagmatic associations

  • E.g., hospital –> nurse, doctor, pain, ill, injury, load…

  • This type of data is not so easily extracted from corpora, in explanatory dictionaries it is presented partly

  • Is crucial while we approach wordnet usage in IR.


Relevance of word senses for native speakers
Relevance of word senses for native speakers syntagmatic associations

  • WAT: for each word 80% of associations are related to 1-3 of its senses.

  • Cf. Corpus: 90% of occurrences of a word

  • That allows us:

    • to measure the relevance of a particular word sense for native speakers.

    • to find an appropriate place for it in the hierarchy of senses.

    • to define the necessary level of sense granularity: to include into a wordnet no more and no less senses of each word than native speakers do differentiate.

  • Problem: emotionally coloured senses are thus overestimated. E.g. дать – в рожу


Relevance of relations for native speakers
Relevance of relations for native speakers syntagmatic associations

  • It is clear that in a WN words must have at least a hyperonym and desirably a synonym.

  • Other relations???

  • Relations are not the same for different PoS, but also they are not the same for different words within the same PoS.

    E.g. buy CONVERSIVE sell, while cry INVOLVED_AGENT baby.


Wat vs corpus
WAT vs. Corpus syntagmatic associations

  • Compare a corpus to WAT:

    Wetter & Rapp (1996), Willners (2001): Correlation between frequency of word X and word Y co-occurrence in a corpus and strength of association word X-word Y in WAT.

  • Compare WAT to a corpus?


Wat vs corpus 2
WAT vs. Corpus (2) syntagmatic associations

Coverage: 64% word associations do not occur in the corpus


Wat vs corpus 3
WAT vs. Corpus (3) syntagmatic associations

Table 1. Distribution of word associations that do not occur in the corpus.

NB! Mostly it’s Syntagmatic WA that are missing, not paradigmatic ones


Conclusions
Conclusions syntagmatic associations

  • The advantages of using WAT in wordnet constructing:

    • Great variety of linguistic information extracted.

      WAT is equal to or excels other LRs in several respects.

    • ‘Raw’ data (as opposed to theoretical one, cf. conventional dictionaries, that supposes the researcher’s introspection and intuition to be involved, and hence, leads to over- and under-estimation of the language phenomena).

      WAT is comparable to a balanced text corpus, and could supply all the necessary empirical information in case of absence of the latter.

    • Probabilistic nature of data presented (data reflects the relative rather then absolute relevance of language phenomena).

  • Parallel usage of WAT and other LR is effective way of:

    • constant checking-out of wordnet construction,

    • refining wordnet and

    • expandingwordnet


Future plans
Future plans syntagmatic associations

  • WAT vs. Corpus vs. Wordnet

    • Czech: small – large – middle

    • English: large – large – large

    • Russian: large – middle - small


Word association thesaurus as a resource for building wordnet

Thank you… syntagmatic associations