Statistical Methods in NLP Course 10. Diana Trandabăț 2013-2014. Sense Word Disambiguation. 2. One of the central challenges in NLP. Ubiquitous across all languages. Needed in: Machine Translation : For correct lexical choice.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Knowledge Based Approaches
Rely on knowledge resources like WordNet, Thesaurus etc.
May use grammar rules for disambiguation.
May use hand coded rules.
Machine Learning Based Approaches
Rely on corpus evidence.
Train a model using tagged or untagged corpus.
Use corpus evidence as well as semantic relations form WordNet.
This airlines serves dinner in the evening flight.
object – edible
This airlines serves the sector between Agra & Delhi.
object – sector
Requires exhaustive enumeration of:
E.g. This flight serves the “region” between Mumbai and Delhi
How do you decide if “region” is compatible with “sector”
Require a Machine Readable Dictionary (MRD).
Find the overlap between the features of different senses of an ambiguous word (sense bag) and the features of the words in its context (context bag).
These features could be sense definitions, example sentences, hypernyms etc.
The sense which has the maximum overlap is selected as the contextually appropriate sense.
Sense Bag: contains the words in the definition of a candidate sense of the ambiguous word.
Context Bag: contains the words in the definition of each sense of each context word.
E.g. “On burning coal we get ash.”
Trees of the olive family with pinnate leaves, thin furrowed bark and gray branches.
The solid residue left when combustible material is thoroughly burned or oxidized.
To convert into ash
A piece of glowing carbon or burnt wood.
A black solidcombustible substance formed by the partial decomposition of vegetable matter without free access to air and under the influence of moisture and often increased pressure and temperature that is widely used as a fuel for burning
In this case Sense 2 of ash would be the winner sense.
Two different words are likely to have similar meanings if they occur in identical local contexts.
E.g. The facility will employ 500 new employees.
Senses of facility
Subjects of “employ”
To maximize similarity select the sense which has the same hypernym as most of the other words in the context
A Thesaurus Based approach.
Step 1: For each sense of the target word find the thesaurus category to which that sense belongs.
Step 2: Calculate the score for each sense by using the context words. A context words will add 1 to the score of the sense if the thesaurus category of the word matches that of the sense.
E.g. The money in this bank fetches an interest of 8% per annum
Target word: bank
Clue words from the context: money, interest, annum, fetch
add 1 to the
the topic of the
word matches that
of the sense
Select a sense based on the relatedness of that word-sense to the context.
Relatedness is measured in terms of conceptual distance
(i.e. how close the concept represented by the word and the concept represented by its context words are)
This approach uses a structured hierarchical semantic net (WordNet) for finding the conceptual distance.
The smaller the conceptual distance, the higher will be the conceptual density.
(i.e. if all words in the context are strong indicators of a particular concept then that concept will have a higher density.)
h (height) of the
c = concept
nhyp = mean number of hyponyms
h = height of the sub-hierarchy
m = no. of senses of the word and senses of context words contained in the sub-hierarchy
The approachesjury(2) praised the administration(3) and operation (8) of Atlanta Police Department(1)Conceptual Density (example)
CD = 0.062
CD = 0.256
Step 1: Make a lattice of the nouns in the context, their senses and hypernyms.
Step 2: Compute the conceptual density of resultant concepts (sub-hierarchies).
Step 3: The concept with highest CD is selected.
Step 4: Select the senses below the selected concept as the correct sense for the respective words.
Bell ring church Sunday
Step 1: Add a vertex for each possible sense of each word in the text.
Step 2: Add weighted edges using definition based semantic similarity (Lesk’s method).
Step 3: Apply graph based ranking algorithm to find score of each vertex (i.e. for each word sense).
Step 4: Select the vertex (sense) which has the highest score.
E.g. “Roger Federer” will be a strong indicator of the category “sports” in Roger Federer plays tennis.
sˆ= argmax s ε senses Pr(s|Vw)
sˆ= argmax s ε senses Pr(s).Πi=1nPr(Vwi|s)
Assuming there are only
two senses for the word.
Of course, this can be
extended to ‘k’ senses.
D approachesECISION LIST ALGORITHM (CONTD.)
Classification of a test sentence is based on the highest ranking collocation found in the test sentence.
…plucking flowers affects plant growth…
Instead of using “dictionary defined senses” extract the “senses from the corpus” itself
These “corpus senses” or “uses” correspond to clusters of similar contexts for a word.
Different uses of a target word form highly interconnected bundles (or high density components)
In each high density component one of the nodes (hub) has a higher degree than the others.
Construct co-occurrence graph, G.
Arrange nodes in G in decreasing order of in-degree.
Select the node from G which has the highest frequency. This node will be the hub of the first high density component.
Delete this hub and all its neighbors from G.
Repeat Step 3 and 4 to detect the hubs of other high density components
The four components for “barrage” can be characterized as:
Attach each node to the root hub closest to it.
The distance between two nodes is measured as the smallest sum of the weights of the edges on the paths linking them.
Add the target word to the graph G.
Compute a Minimum Spanning Tree (MST) over G taking the target word as the root.
Each node in the MST is assigned a score vector with as many dimensions as there are components.
E.g. pluie(rain) belongs to the component EAU(water) and d(eau, pluie) = 0.82, spluei = (0.55, 0, 0, 0)
For a given context, add the score vectors of all words in that context.
Select the component that receives the highest weight.
Le barrage recueille l’eau a la saison des pluies.
The dam collects water during the rainy season.
EAU is the winner in this case.
A reliability coefficient (ρ) can be calculated as the difference between the best score and the second best score.
If A is a “Hill” and B is a “Coast” then the commonality between A and B is that “A is a GeoForm and B is a GeoForm”.
sim(Hill, Coast) =
In general, similarity is directly proportional to the probability that the two words have the same super class (Hypernym)
To maximize similarity select that sense which has the same hypernym as most of the Selector words.
A word having multiple senses in one language will have distinct translations in another language, based on the context in which it is used.
The translations can thus be considered as contextual indicators of the sense of the word.
Uses semantic relations (synonymy and hypernymy) form WordNet.
Extracts collocational and contextual information form WordNet (gloss) and a small amount of tagged data.
Monosemic words in the context serve as a seed set of disambiguated words.
In each iteration new words are disambiguated based on their semantic distance from already disambiguated words.
It would be interesting to exploit other semantic relations available in WordNet.
Uses some tagged data to build a semantic language model for words seen in the training corpus.
Uses WordNet to derive semantic generalizations for words which are not observed in the corpus.
Semantic Language Model
For each POS tag, using the corpus, a training set is constructed.
Each training example is represented as a feature vector and a class label which is word#sense
In the testing phase, for each test sentence, a similar feature vector is constructed.
The trained classifier is used to predict the word and the sense.
If the predicted word is same as the observed word then the predicted sense is selected as the correct sense.
if “drink water” is observed in the corpus then using the hypernymy tree we can derive the syntactic dependency “take-inliquid”
“take-inliquid” can then be used to disambiguate an instance of the word tea as in “taketea”, by using the hypernymy-hyponymy relations.
A semantic relations graph for the two senses of the word bus (i.e. vehicle and connector)
Using Search Engines
Construct search queries using monosemic words and phrases from the gloss of a synset.
Feed these queries to a search engine.
From the retrieved documents extract the sentences which contain the search queries.
Using Equivalent Pseudo Words
Use monosemic words belonging to each sense of an ambiguous word.
Use the occurrences of these words in the corpus as training examples for the ambiguous word.
Contradictory results have been published. Hence difficult to conclusively decide.
Depends on the quality of the underlying MT model.
The bias of BLEU score towards phrasal coherency often gives misleading results.
E.g. (Chinese to English translation)
Hiero (SMT model): Australian minister said that North Korea bad behavior will be more aid.
Hiero (SMT model) + WSD : Australian minister said that North Korea bad behavior will be unable to obtain more aid.
Here the second sentence is more appropriate. But since the phrase “unable to obtain” was not observed in the language model the second sentence gets a lower BLEU score
Dictionary defined senses do not provide enough surface cues.
Complete dependence on dictionary defined senses is the primary reason for low accuracies in Knowledge Based approaches.
Extracting “sense definitions” or “usage patterns” from the corpus greatly improves the accuracy.
Word-specific classifiers are able to attain extremely good accuracies but suffer from the problem of non-reusability.
Unsupervised algorithms are capable of performing at par with supervised algorithms.
Relying on single most predictive evidence increases the accuracy.
Classifiers that exploit syntactic dependencies between words are able to perform large scale disambiguation (generic classifiers) and at the same time give reasonably good accuracies.
Using a diverse set of features improves WSD accuracy.
WSD results are better when the degree of polysemy is reduced.
Use unsupervised or hybrid approaches to develop a multilingual WSD engine. (focusing on MT)
Automatically generate sense tagged data.
Explore the possibility of using an ensemble of WSD algorithms.
See you tomorrow!