1 / 29

Comparing Word Relatedness Measures Based on Google n-grams

Comparing Word Relatedness Measures Based on Google n-grams. Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University, Halifax, Canada islam@cs.dal.ca, eem@cs.dal.ca, vlado@cs.dal.ca COLING 2012. Introduction.

lynley
Download Presentation

Comparing Word Relatedness Measures Based on Google n-grams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ComparingWord Relatedness MeasuresBased on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University, Halifax, Canada islam@cs.dal.ca, eem@cs.dal.ca, vlado@cs.dal.ca COLING 2012

  2. Introduction • Word-relatedness has a wide range of applications • IR: Image retrieval, Query extention… • Paraphrase recognition • Malapropism detection and correction • Automatic creation of thesauri • Speak recognition • …

  3. Introduction • Methods can be categorized into 3: • Corpus-based • Supervised • Unsupervised • Knowledge-based • Semantic resources were used • Hybrid

  4. Introduction • This paper focus on unsupervised corpus-based measures • 6 measures have been compared

  5. Problem • Unsupervised corpus-based measures usually use co-occurrence statistics, mostly word n-grams and frequencies • The co-occurrence are corpus-specific • Most of the corpura doesn't have co-occurrence stats, thus can't be used on-line • Some use web search result, but results vary from time to time

  6. Motivation • How to compare different measures fairly? • Observation • Co-occurrence stats were used • A corpus with co-occurrence information, eg. Google n-grams, is probably a good resource

  7. Google N-Grams • A publicly available corpus with • Co-occurrence statistics (uni-gram to 5-gram) • A large volume of <del>web text</del> • Digitalized books with over 5.2 million books published since 1500 • Data format: • ngram year match_count volume_count • eg: • analysis is often described as 1991 1 1 1

  8. Another Motivation • To find a indirect mapping between Google n-grams and web search result • Thus, it might be used on-line

  9. How About WordNet? • In 2006, Budanitsky and Hirst evaluated 5 knowledge-based measures using WordNet • Create a resource like WordNet requires lots of efforts • Coverage of words is not enough for NLP tasks • Resource is language-specific, while Google n-grams consists more than 10 languages

  10. Notations • C(w1 … wn) • Frequency of the n-gram • D(w1 … wn) • # of web docs (up to 5-grams) • M(w1, w2) • C(w1 wi w2)

  11. Notations • (w1, w2) • 1/2 [ C(w1 wi w2) + C(w2 wi w1) ] • N • # of docs used in Google n-grams • |V| • # of uni-grams in Google n-grams • Cmax • max frequency in Google n-grams

  12. Assumptions • Some measures use web search results, and co-occurrence information not provided by Google n-gram, but • C(w1) ≥ D(w1) • C(w1 w2) ≥ D(w1 w2) • It is because uni-grams and bi-grams might occurs multiple times in one document

  13. Assumptions • Considering the lower limits • C(w1) ≈ D(w1) • C(w1 w2) ≈ D(w1 w2)

  14. Measures • Jaccard Coefficient • Simpson Coefficient

  15. Measures • Dice Coefficient • Pointwise Mutual Information

  16. Measures • Normalized Google Distane (NGD)variation

  17. Measures • Relatedness based on Tri-grams (RT)

  18. Evaluation • Compare with human judgments • It is considered to be the upper limit • Evaluate the measures with respect to a particular application • Evaluate relatedness of words • Text Similarity

  19. Compare With Human Judgments • Rubenstein and Goodenough's 65 Word Pairs • 51 people rating 65 pairs of word (English) on the scale of 0.0 to 4.0 • Miller and Charles' 28 Noun Pairs • Restricting R&G to 30 pairs, 38 human judges • Most of researchers use 28 pairs because 2 were omitted from early version of WordNet

  20. Result

  21. Result

  22. Application-based Evaluation • TOEFL's 80 Synonym Questions • Given a problem word,infinite, and four alternative wordslimitless, relative, unusual, structuralchoose the most related word • ESL's 50 Synonym Qeustions • Same as TOEFL's 80 synonym questions task • Expect the synonym questions are from English as a 2nd Language tests

  23. Result

  24. Result

  25. Text Similarity • Find the similarity between two text items • Use different measures on a single text similarity measure, and evaluate the results of the text similarity measure based on a standard data set • 30 sentences pairs from one of most used data sets were used

  26. Result

  27. Result • Pearson correlation coefficient with mean human similarity ratings: • Ho et al. (2010) used one measure based-on WordNet and then applied those scores in Islam and Inkpen (2008) achieved 0.895 • Tsatsaronis et al. (2010) achieved 0.856 • Islam et al. (2012) achieved 0.916 • The improvement over Ho et al. (2010) is statistically significant at 0.05 level

  28. Conclusion • Any measures uses n-gram statistics can easily apply Google n-gram corpus, and be fairly evaluated with existing works on standard data sets of different tasks • Find an indirect mapping of co-occurrence statistics between the Google n-gram corpus and a web search engine using some assumptions

  29. Conclusion • Measures based on n-gram are language-independent • Other languages can be implemented if it has a sufficiently large n-gram corpus

More Related