Using Information Content to Evaluate Semantic Similarity in a Taxonomy

Using Information Content to EvaluateSemantic Similarity in a Taxonomy Philip Resnik Sun Microsystems Laboratories Presenter: Cosmin Adrian Bejan

Overview • The paper presents an alternative way to evaluate semantic similarity in a taxonomy, based on the notion of information content. • It provides a way of adapting a static knowledge structure to multiple contexts by combining a taxonomic structure with empirical probability estimates.

Introduction • semantic similarity is a special case of semantic relatedness: • for example, cars - gasoline would seem to be more closely related than, say, cars - bicycles, but the latter pair are certainly more similar. • a natural way to evaluate semantic similarity in a taxonomy is to evaluate the distance between the nodes corresponding to the items being compared – the shorter the path from one node to another, the more similar they are. • but one of the problems with this approach is that it relies on the notion that links in the taxonomy represent uniform distances. Examples: • rabbit ears IS-A television antenna • phytoplankton IS-A living thing

Similarity of two concepts • Intuitively, one key in the similarity of two concepts is the extent to which they share information in common, indicated in an IS-A taxonomy by a highly specific concept that subsumes them both. • By associating probabilities with concepts in the taxonomy, it is possible to capture the same idea, but avoiding the unreliability of edge distances

Similarity and Information Content • Let: • C – the set of concepts in an IS-A taxonomy, permitting multiple inheritance • Function p:C->[0,1], such that for any c C, p(c) is the probability of encountering an instance of concept c. • Notes: • p is monotonic as one moves up the taxonomy: if c1IS-A c2, then p(c1) ≤ p(c2) • if the taxonomy has a unique top node then its probability is 1.

Similarity and Information Content • the information content of a concept c can be quantified as negative the log likelihood -log p(c) • as probability increases, informativeness decreases, so the more abstract a concept, the lower its information content. • the information shared by two concepts is indicated by the information content of the concepts that subsumes them in the taxonomy: where S(c1, c2) is the set of concepts that subsume both c1 and c2

Similarity and Information Content • although similarity is computed by considering all upper bounds for the two concepts, the information measure has the effect of identifying minimal upper bounds since no class is less informative than its superordinates.

Implementation • WordNet – the taxonomy of concepts and compound nominals • frequencies of concepts in the taxonomy were estimated using noun frequencies from Brown Corpus of American English • Each noun that occurred in the corpus was counted as an occurrence of each taxonomic class containing it: where words(c) is the set of words subsumeb by concept c • concept probabilities were computed simply as relative frequency: where N is the total number of nouns observed (and exist in WN)

Task • one way to evaluate computational measures of semantic similarity is to find a correlation with human similarity ratings. • replicated experiment with human subjects: giving ten subjects 30 noun pairs they were asked to rate “similarity of meaning” for each pair on a scale from 0 (no similarity) to 4 (perfect synonymy). • the average correlation over the 10 subjects was r=0.8848

Results • Three similarity measures were used: • Similarity measurement using information content • A variant on the second counting method • a measure that simply uses the probability of a concept, rather than the information content:

Using Information Content to Evaluate Semantic Similarity in a Taxonomy

Using Information Content to Evaluate Semantic Similarity in a Taxonomy

Presentation Transcript

Using A-train observations to evaluate clouds in CAM

Semantic Textual Similarity (STS) Workshop

Semantic Similarity in a Taxonomy

Components for a semantic textual similarity system

Measuring Semantic Similarity between Words Using HowNet

A semantic similarity metric combining features and intrinsic information content

Building a Semantic Taxonomy

Evaluating semantic similarity using GML in Geographic Information Systems

An Investigation into the Relationship between Semantic and Content Based Similarity Using LIDC

How to Evaluate Web Content

Using Similarity

How to Evaluate Web Content

Feature Based Approaches to Semantic Similarity

Algorithmic Detection of Semantic Similarity

Preserving Semantic Content in Text Mining Using Multigrams

Content-Based Similarity Search

Using Semantic Relations to Improve Information Retrieval

USING TECHNOLOGY TO EXTRACT INFORMATION AND EVALUATE SOURCES OF INFORMATION

Using a taxonomy to support progression in learning

Preserving Semantic Content in Text Mining Using Multigrams

An Investigation into the Relationship between Semantic and Content Based Similarity Using LIDC

Semantic Similarity Measurement and Geographic Applications Similarity approaches