1 / 17

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Evgeniy Gabrilovich and Shaul Markovitch Israel Institute of Technology IJCAI 2007. Introduction. How related are “ cat ” and “ mouse ” ?

Download Presentation

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Evgeniy Gabrilovich and Shaul Markovitch Israel Institute of Technology IJCAI 2007

  2. Introduction • How related are “cat” and “mouse”? • represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia • Wikipedia is currently the largest knowledge repository on the Web • English version is the largest of all with 400+ million words in over one million articles

  3. Explicit Semantic Analysis • T = {w1…wn} be input text • <vi> be T’s TFIDF vector • vi is the weight of word wi • Wikipedia concept cj , {cj ∈ c1, . . . , cN} • N = total number of Wikipedia concepts • Let <kj> be an inverted index entry for word wi • where kj quantifies the strength of association of word wi with Wikipedia concept cj

  4. Explicit Semantic Analysis • the semantic interpretation vector V for text T is a vector of length N, in which the weight of each concept cj is defined as • To compute semantic relatedness of a pair of text fragments we compare their vectors using the cosine metric

  5. Example 1 First ten concepts in sample interpretation vectors

  6. First ten concepts in sample interpretation vectors

  7. Example (texts with ambiguous words) First ten concepts in sample interpretation vectors

  8. Empirical Evaluation • Wikipedia • parsing the Wikipedia XML dump, we obtained 2.9 Gb of text in 1,187,839 articles • removing small and overly specific concepts (those having fewer than 100 words and fewer than 5 incoming or outgoing links), 241393 articles were left • 389,202 distinct terms

  9. Empirical Evaluation • Open Directory Project • hierarchy of over 400,000 concepts and 2,800,000 URLs. • crawling all of its URLs, and taking the first 10 pages encountered at each site • 70 Gb textual data. After removing stop words and rare words, we obtained 20,700,000 distinct terms

  10. Datasets and Evaluation Procedure • The WordSimilarity-353 collection • contains 353 word pairs.3 Each pair has 13–16 human judgements • Spearman rank-order correlation coefficient was used to compare computed relatedness scores with human judgements • Spearman rank-order correlation (http://webclass.ncu.edu.tw/~tang0/Chap8/sas8.htm)

  11. Datasets and Evaluation Procedure • 50 documents from the Australian Broadcasting Corporation’s news mail service [Lee et al., 2005] • These documents were paired in all possible ways, and each of the 1,225 pairs has 8–12 human judgements • When human judgements have been averaged for each pair, the collection of 1,225 relatedness scores have only 67 distinct values. • Spearman correlation is not appropriate in this case, and therefore we used Pearson’s linear correlation coefficient • http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

  12. Results • word relatedness • text relatedness

  13. Related Work • Lexical databases • WordNet or Roget’s Thesaurus • Different with ESA • Inherently limited to individual words • Need word sense disambiguation • ESA provides much more sophisticated mapping of words (inWordNet amounts to simple lookup,without any weights)

  14. Related Work • LSA • LSA is essentially a dimensionality reduction technique that identifies a number of most prominent dimensions in the data • notoriously difficult to interpret • Computed concepts cannot be readily mapped into natural concepts manipulated by humans

  15. Related Work • WikiRelate! [Strube and Ponzetto 2006] • Given a pair of words w1 and w2, • searches for Wikipedia articles, p1 and p2, that respectively contain w1 and w2 in their titles • Semantic relatedness is then computed using various distance measures between p1 and p2. • Different with ESA • only process words that actually occur in titles • limited to single words

  16. Conclusions • a novel approach to computing semantic relatedness of natural language texts • Empirical evaluation confirms that using ESA leads to substantial improvements in computing word and text relatedness • ESA model is easy to explain to human users

More Related