Survey on WSD and IR. Apex@SJTU. WSD: Introduction. Problems in online news retrieval system: query: “major” Articles retrieved: about “Prime Minister John Major MP” “major” appears as an adjective “major” appears as a military rank. WSD: Introduction.
Survey on WSD and IR Apex@SJTU
WSD: Introduction • Problems in online news retrieval system: query: “major” Articles retrieved: • about “Prime Minister John Major MP” • “major” appears as an adjective • “major” appears as a military rank
WSD: Introduction • Gale, Church and Yarowsky (1992) cite work dating back to 1950. • For many years, WSD was applied only to limited domains and a small vocabulary. • In recent years, disambiguators are applied to resolve the senses of words in a large heterogeneous corpus. • With a more accurate representation and a query also marked up with word sense, researchers believe that the accuracy of retrieval would have to improve.
Approaches to disambiguation • Disambiguation based on manually generated rules • Disambiguation using evidence from existing corpora.
Disambiguation based on manually generated rules • Weiss (1973): • general context rule: If the word “type” appears near to “print”, it most likely meant a small block of metal bearing a raised character on one end. • template rule: If “of” appears immediately after “type”, it most likely meant a subdivision of a particular kind of thing.
Weiss (1973): • Template rules were better, so replied them first. • To create rules: • Examine 20 occurrences of an ambiguous word. • Test these manually created rules on a further 30 occurrences. • Accuracy: 90% • Cause for errors: idiomatic uses.
Disambiguation based on manually generated rules • Kelly and Stone (1975): • created a set of rules for 6,000 words • consisted of contextual rules similar to those of Weiss • in addition, used grammatical category of a word as a strong indicator of sense: • “the train” and “to train”
Kelly and Stone (1975): • The grammar and context rules were grouped into sets so that only certain rules were applied in certain situations. • Conditional statements controlled the application of rule sets. • Unlike Weiss’s system, this disambiguator was designed to process a whole sentence at the same time. • Accuracy: not a success
Disambiguation based on manually generated rules • Small and Rieger (1982) came to similar conclusions. • When this type of disambiguator was extended to work on larger vocabulary, the effort involved in building it became too great. • Since 1980s, WSD research has concentrated on automatically generated rules based on sense evidence derived from a machine readable corpus.
Disambiguation using evidence from existing corpora • Lesk (1988): • Resolve the sense of “ash” in : There was ash from the coal fire. • Dictionary definition looked up: • ash(1): The soft grey powder that remains after something has been burnt. • ash(2): A forest tree common in Britain. • Definition of context words looked up: • coal(1): A black mineral which is dub from the earth, which can be burnt to given heat. • fire(1): The condition of burning; flames, light and great heat. • fire(2): The act of firing weapons or artillery at an enemy.
Lesk (1988): • Sense definitions are ranked by scoring function based on the number of words that co-occur. • Questionable: how often the word overlap necessary for disambiguation occurred. • Accuracy: “very brief experimentation”, 50%--70% • No analysis for the failure, although definition length is recognized as a possible factor in deciding which dictionary to use.
Disambiguation using evidence from existing corpora • Wilks et al. (1990): • addressed this word overlap problem by using a technique of expanding a dictionary definition with words that commonly co-occurred with the text of that definition. • Co-occurrence information was derived from all definition texts in the dictionary.
Wilks et al. (1990): • Longman’s Dictionary of Contemporary English (LDOCE): all its definitions were written using a simplified vocabulary of around 2,000 words. • Few synonyms, a distracting element in the co-occurrence calculation. • “bank”: • for economic sense: “money”, ”check”, ”rob” • for geographical sense: “river”, ”flood”, ”bridge” • Accuracy: “bank” in 200 sentences, judged correct if it coincides with one manually chosen, 53% at fine-grained level(13 senses) and 85% at coarse-grained(5 senses) level. • They suggested using simulated annealing to disambiguate a whole sentence simultaneously.
Disambiguating simultaneously • Cowie et al. (1992): • Accuracy: tested on 67 sentences, 47% for fine-grained senses while 72% for coarse-grained ones. • No comparison with Wilks et al.’s. • No baseline. A possible baseline: senses randomly chosen A better one: select the most common sense
Manually tagging a corpus • A technique in POS tagging: • manually mark up a large text corpus with POS tag, and then train a statistical classifier to associate features with occurrences of the tags. • Ng and Lee (1996): • disambiguate 192,000 occurrences of 191 words. • examine the following features: • POS and morphological form of the sense tagged word • unordered set of its surrounding words • local collocations relative to it • and if the sense tagged word was a noun, the presence of a verb was noted also.
Ng and Lee (1996): • Experiments: • separated their corpus into training and test sets on an 89%--11% split • accuracy: 63.7% (baseline: 58.1%) • sense definition used were from WordNet, 7.8 senses per word for nouns and 12.0 senses for verbs • no comparison possible between WordNet definition or LDOCE
Using thesauri: Yarowsky (1992) • Roget’s thesaurus: 1,042 semantic categories • Grolier Multimedia Encyclopedia To decide which semantic category an ambiguous word occurrence should be assigned: • a set of clue words, one set for each category, was derived from a POS tagged corpus • the context of each occurrence was gathered • a term selection process similar to relevance feedback was used to derive clue words
Yarowsky (1992) • Eg. clue words for animal/insects: species, family bird, fish, cm, animal, tail, egg, wild, common, coat, female, inhabit, eat, nest • Comparison between words in the context and the clue word sets • Accuracy: 12 ambiguous words, several hundred occurrences, 92% of accuracy on average • Comparison were suspect.
Testing disambiguators • Few “pre-disambiguated” test corpora publicly available. • A sense tagged version of the Brown corpus, called SEMCOR, is available. Trec-like effort underway, called SENSEVAL.
WSD and IR experiments • Voorhees (1993): based on WordNet: • Each of 90,000 words and phrases is assigned to one or more synsets. • A synset is a set of words that are synonyms of each other; the words of a synset define it and its meaning. • All synsets are linked together to form a mostly hierarchical semantic network based on hypernymy and hyponymy. • Other relations: meronymy, holonymy, antonymy.
Voorhees (1993): • the hood of a word sense contained in synset s: • largest connected sub graph; • contains s; • contains only descendants of an ancestor of s • contains no synset that has a descendent that includes another instance of a member of s. • Consistently worse, tagging sense inaccurately
The hood of the first sense of “house” would include the words: housing, lodging, apartment, flat, cabin, gatehouse, bungalow, cottage.
Wallis (1993) • replace words with definitions from LDOCE. • “ocean” and “sea”: ocean: The great mass of salt water that covers most of the earth; sea: the great body of salty water that covers much of the earth’s surface. • disappointing results. • no analysis of the cause.
Sussna (1993) • Assign a weight to all relations and calculate the semantic distance between two synsets. • Calculate semantic distance between context words and each of the the synsets to rank the synsets. • Parameters: size of context (41 as optimal), the number of words (only 10 because of computation consideration) disambiguated simultaneously. • Accuracy: 56%
Analyses of WSD & IR • Krovetz & Croft: sense mismatches were significantly more likely to occur in non-relevant documents. • word collocation • skewed frequency distribution • Situations under which WSD may prove useful: • where collocation is less prevalent • where query words were used in a minority sense
Analyses of WSD & IR • Sanderson (1994,1997): • pseudo-words: banana/kalashnikov/anecdote • experiments on the factor of query length: effectiveness of retrievals based on short query was greatly affected by the introduction of ambiguity but much less so for longer queries.
Analyses of WSD & IR • Gonzalo et al. (1998): experiments based on SEMCOR, write a summary for each document and use it as a query, which is related with only one relevant document. • Cause for error: sense may be too specific newspaper as a business concern as opposed to the physical object
Gonzalo et al. (1998): • synset based representation: retrieval based on synset seems to be the best • erroneous disambiguation and its impact on retrieval effectiveness: baseline precision: 52.6% when error 30%, precision 54.4% when error 60%, precision 49.1%
Sanderson (1997): • output word sense in a list ranked by a confidence score • accuracy: worse than the one without sense, better than the one tagged with one sense. • possible cause: errors.
Disambiguation without sense definition • Zernik (1991): • generate cluster for an ambiguous word by three criteria: context words, grammatical category and derivational morphology. • associate the cluster with a dictionary sense. eg. “train”: 95% of accuracy, grammatical category “office”: full of error
Disambiguation without sense definition Schutze and Pederson (1995): Very few of the results which show 14% improvement • Cluster based on context words only: words with similar context are put into the same cluster, but recognized as a cluster if only the context appears more than fifty time sin corpus • Similar context of “ball”: tennis, football, cricket. Thus this method breaks up a word’s commonest sense into a number of uses (the sporting sense of ball).
Schutze and Pederson (1995): • score each use of a word • representing a word occurrence by • just the word • word with its commonest use • word with n of its uses
WSD in IR Revisited sigir’03 • Skewed frequency distributions coupled with the query term co-occurrence effect are the reasons why traditional IR techniques that don’t take sense into account are not penalized severely. • The impact of inaccurate fine grained WSD has an extreme negative effect on the performance of an IR system. • To achieve increases in performance, it is imperative to minimize the impact of the inaccurate disambiguation. • The need for 90% accurate disambiguation in order to see performance increases remains questionable.
The WSD methods applied • A number of experiments were tried, but nothing better than the following was found: applying each of knowledge source (collocations, co-occurrence, and sense frequency) in a stepwise fashion: • a context window consisting of the sentence surrounding the target word to identify sense of the word • examine the surrounding sentence if it contained any collocates we have observed from Semcor • specific sense data
WSD in IR Revisited: Conclusions • Reasons for success: high precision WSD technique sense frequency statistics • Resilience of vector space model • Analysis for Schutze and Pederson’s success: added tolerance
“A highly accurate bootstrapping algorithm for word sense disambiguation” Rada M. 2000 Disambiguate all nouns and verbs: • step 1: complex nominals • step 2: name entity • step 3: word pairs, based on SEMCOR (previous word, word) pair, (word, successive word) pair • step 4: context, based on SEMCOR and WordNet in WordNet, hypernym are also its context
“A highly accurate bootstrapping algorithm for word sense disambiguation” (cont’d) • step 5: words with semantic distance 0 from some words which has already been disambiguated • step 6: words with semantic distance 1 from some words which has already been disambiguated • step 7: words with semantic distance 0 among ambiguous words • step 8: words with semantic distance 1 among ambiguous words
“An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases” sigir 04 • Significant increase for short query • Only WSD on Query and Query Expansion • Phrase-based and Term-based • PSEUDO-RELEVANCE
Phrases identification • 4 types of phrases: Proper names (Name Entity), Dictionary Phrases( by WordNet), a simple phrases, a complex phrase • Decide windows size of simple/complex phrases by calculate correlation
WSD • Unlike Rada Miha’s WSD, Liu didn’t utilize Semcor, only utilize WORDNET • 6 step, basic ideas, by hyper, hypo, cross-reference,etc
Query Expansion • Add Synonyms(conditional) • Add Definition Words( only first shortest noun phrase) conditional if it is highly globally correlated • Add Hyponyms(conditional) • Add Compound Word(conditional)
PSEUDO RELEVANCE FEEDBACK • Using Global Correlations and Wordnet • Global_cor>1 and one of two conditions: • 1: monosense • 2:its defintion contains some other query terms • 3.it is in top10 ranked documents • Combining Local and Global Correlations:
Results • SO: standard Okapi (term-similarity) • NO: enhanced SO • NO+P: +phrase-similarity • NO+P+D: +WSD • NO+P+D+F: +Pseudo-feedback
Model conclusion • WSD query only • WSD only by Wordnet, no semcor • Query Complicate Expansion • Pseudo-relevance feedback • Phrases and term-based