1 / 31

Computing Semantic Relatedness

Computing Semantic Relatedness. B.Tech Project – Second Stage. Rohitashwa Bhotica (04005010) Under the guidance of :- Prof. Pushpak Bhattacharyya. OUTLINE. Introduction Wiktionary Semantic Relatedness Page Rank Implementation Steps Results and Testing Conclusion. Introduction.

sinjin
Download Presentation

Computing Semantic Relatedness

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing Semantic Relatedness B.Tech Project – Second Stage Rohitashwa Bhotica (04005010) Under the guidance of :- Prof. Pushpak Bhattacharyya

  2. OUTLINE • Introduction • Wiktionary • Semantic Relatedness • Page Rank • Implementation Steps • Results and Testing • Conclusion

  3. Introduction • Computing Semantic Relatedness between words has uses in various applications • Many measures exist, all using WordNet • Wiktionary models lexical semantic knowledge similar to conventional WordNets • Wiktionary can be a substitute to WordNet • We see how Concept-Vector and PageRank is used to measure Semantic Relatedness using Wiktionary as a corpus

  4. Wiktionary • Freely available, multilingual, web based dictionary in over 151 languages • Project by WikiMedia foundation • Written collaboratively by online volunteers • The English version has over 800,000 entries • Contains many relation types such as synonyms, etymology, hypernymy, etc.

  5. Comparison with WordNets

  6. Differences between WordNet & Wiktionary • Wiktionary constructed by users on web rather than by expert linguists • This reduces creation costs and increases size and speed of creation of entries • Wiktionary is available in more languages • Wiktionary schema is fixed but not enforced • Older entries not updated hence inconsistent • Wiktionary entries not necessary complete and may contain stubs. Not symmetrical also

  7. Similarities Between Wiktionary & WordNet • Wiktionary contains concepts connected to each other by lexical semantic relations • Have glosses giving short descriptions • Size of all major languages are large • Wiktionary articles are monitored by the community on the web just like WordNet

  8. Structure of Wiktionary Entry • Is in XML format with tags for title, author, creation date, comments, etc. • Meanings and various forms with examples • List of synonyms and related terms • Linked to other words represented by “[[ ]]” • Contains list of translations of word in other languages and categories to which it belongs • Pronunciation and rhyming words as well

  9. Example • http://en.wiktionary.org/wiki/bank • We can see the various meanings for the different forms of the word “bank” • List of derived and related terms present • Contains translations into other languages

  10. Semantic Relatedness • Defines resemblance between two words • More general concept than similarity • Similar and dissimilar entries can be related by lexical relationships such as meronymy • Cars-petrol more related than cars-bicycle which is more similar • Humans can judge easily unlike computers • Computers need vast amount of common sense and world knowledge

  11. Measures of Semantic Relatedness • Concept – Vector Based Approach • Word represented as high dimensional concept vector, v (w) = (v1,…, vn), n is no. of documents • The tf.idf score is stored in vector element • Vector v represents word w in concept space • Semantic Relatedness can be calculated using:- • This is also known as cosine similarity and the score varies from 0 to 1

  12. Measures of Semantic Relatedness • Path – Length Based Measure • Computes Semantic Relatedness in WordNet • Views it as a graph and sees path length between concepts. “Shorter the path, the more related it is” • Good results when path consists of is-a links • Concepts are nodes and semantic relations between these concepts can be treated as edges • SR calculated by relPL (c1, c2) = Lmax – L (c1, c2) • Lmax is length of longest non-cyclic path and L (c1, c2) gives number of edges from concept c1 to c2

  13. Measures of Semantic Relatedness • Problem is that is considers all links to be uniform in distance which may not be the case always • Many improvements using Information Content • The Resnik Measure • Information content based relatedness measure • Higher information content specific to particular topics, lower ones specific to more general topics • Carving fork – HIGH, entity – LOW • Idea is that two concepts are semantically related proportional to the amount of information shared

  14. Measures of Semantic Relatedness • Considers position of nouns in is-a hierarchy • SR is determined by information content of lowest common concept which subsumes both concept • For example: Nickel and Dime subsumed by Coin, Nickel and Credit card by Medium of Exchange • P(c) is probability of encountering concept c. • If a is-a b, then p(a) is less than equal to p(b) • Information content calculated by formula:- IC (concept) = – log (P (concept))

  15. Measures of Semantic Relatedness • Thus relatedness is given by:- Simres (c1, c2) = IC (LCS (c1, c2)) • Does not consider information content of the concepts themselves nor path length • Problems faced is that many concepts might have the same subsumer thus having same score • May get high measures on the basis of some inappropriate word senses. E.g tobacco and horse • Newer methods such as Jiang-Conrath, Linand Leacock-Chodorow measures

  16. Page Rank • Developed by Larry Page and Sergei Brinn • Link analysis algorithm assigns numerical weighting to hyperlinked set of documents • Measures relative importance of page in set • Link to a page is a vote of support which increases the rank of that particular page • It is a probability distribution representing the likelihood of a person randomly clicking ultimately ending up on a specific page

  17. Simplified Algorithm • Assume universe has 4 pages A, B, C and D • Initial values of all the pages is 0.25 • Now suppose B, C and D link only to A • Rank of A given by:- • If B links to other pages also then rank of A:- • L(B) is the number of outbound links from B

  18. Simplified Algorithm • Page rank of U depends on rank of page V linking to U divided by number of links from V • Page Rank can be given by general formula:- • Formula applicable for pages which link to U • Thus we can see that the page ranks of all pages in corpus will be equal to 1

  19. Final Algorithm • Damping Factor : Imaginary surfer will stop clicking at links after some time. • d is probability that user will continue clicking • Damping factor is estimated at 0.85 here • The new page rank formula using this is:- • Now to get actual rank of a page we will have to iterate this formula many times • Problem of Dangling Links

  20. Page Rank in our Implementation • Wiktionary contains link structure in articles • Page Rank of every word in corpus can be calculated using same algorithm • Higher ranking words more probability of occurrence in random clicking • Algorithm iterated 30 times • Problem is that link structure is not symmetric and can be improved

  21. Implementation Steps • We use Wiktionary corpus dated 15th Mar,08 • Parsing:- • Split large Wiktionary dump file into smaller files • Parse articles removing irrelevant information such as comments, leaving only content words • Content words consist of words in glosses of article and synonyms, antonyms, etc. of word • Content words are then stemmed with Porter stemmer to maintain uniformity for all words • Stop words are removed to leave only main words

  22. Implementation Steps • Calculating SR using C-V based approach :- • We have list of all words and their content words • Using each word as a different document calculate the concept vector of each word • Calculate SR using these concept vectors • Example :-

  23. Implementation Steps • List of Linked Words :- • Each linked word is enclosed in “[[ ]]”s • We parse Wiktionary and store all these words • Calculating Page Rank :- • We have list of all links for all words in corpus • Words not linking to any are linked to all the words to solve dangling links problem • Use Page and Brinn algorithm to calculate the rank now

  24. Implementation Steps • Calculating SR using Page Rank :- • Concept vector of each word already computed • Multiply each element of concept vector by its corresponding Page Rank • Compute cosine similarity using these vectors • Example :-

  25. Results and Testing • Miller & Charles(30), Rubenstein and Goodenough(65) and Finkelstein(353) datasets are used for testing • Pearson’s Correlation Coefficient and Spearman’s rank order Correlation Coefficients are calculated for results obtained on these datasets

  26. Results and Testing • Pearson’s Correlation coefficient formula :- • Results :-

  27. Results and Testing • Spearman’s correlation coefficient formula :- di = xi − yi =difference between the ranks of values Xi and Yi • All entries with 0 values are removed for this • Results :-

  28. Conclusion • Coverage of Wiktionary is very high for datasets • Pearson and Spearman’s correlation coefficient is lesser using the second method of Page Rank • Entries in nascent stage, no well defined and symmetric link structure existing • Entries are not properly authored and edited • For tougher datasets Fin1 and Fin2 score is low • Second method will improve once link structure and structure and content of articles improve

  29. Conclusion (contd.) • Semantic Relatedness between words can be used to solve word sense disambiguation, word choice problems, etc. • Seen features of Wiktionary and measures of calculating semantic relatedness between words • Studied the concept of Page Rank and its application in calculating semantic relatedness • Results show that Wiktionary is a good and emerging semantic resource which is going to improve in the future

  30. Bibliography • Using Wiktionary for Computing Semantic Relatedness • Alexander Budanitsky, Graeme Hirst. Evaluating WordNet-based Measures of Lexical Semantic Relatedness, 2006 • Philip Resnik. Using Information Content to Evaluate Semantic Similarity in a Taxonomy, IJCAI-1995 • Siddharth Patwardhan, Satanjeev Banerjee, Ted Petersen. Using Measures of Semantic Relatedness for Word Sense Disambiguation, 2003 • Wikimedia foundation. Wikipedia, www.wikipedia.com • Philip Resnik, Mona Diab. Measuring Verb Similarity, 2000 • Larry Page, Sergei Brinn. The Page Rank Citation Ranking: Bringing Order to the Web, 1998

  31. Bibliography • Wikimedia foundation. Wiktionary, www.wiktionary.com • Siddharth Patwardhan, Ted Pedersen. Using Wordnet Based Concept Vectors to Estimate the Semantic Relatedness of Concepts, 2006 • Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, Eytan Ruppin. Placing search in content: The concept revisited, ACM TOIS, 2002 • Herbert Rubenstein, John B. Goodenough. Contextual correlates of similarity, 1965 • George A. Miller, Walter G. Charles. Contextual correlates of semantic similarity, 1991 • Jay J. Jiang, David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy, ROCLING 1997

More Related