The Google Similarity Distance

The Google Similarity Distance We’ve been talking about Natural Language parsing Understanding the meaning in a sentence requires knowing relationships between words e.g. house -> square house -> home house -> rooms There are many of these in our language!

There are ongoing attempts to build databases of these relationships. They are time and labour intensive. The Web is the largest text database on Earth. It contains low-grade information in abundance. There are two kinds of objects on which knowledge can be attained: actual object (a graph) and names of objects (“a graph”). Actual objects can be compared for similarity through features. Names of objects can be compared for similarity through ‘Google Semantics’ i.e. how they occur together in the web.

The Idea: Define a new kind of semantics understandable by a computer. Google semantics: content of the pages returned for a query on a word. For a pair of words: the pages after querying the words singly, and then together. Semantics is the context in which the words appear. Links from the pages to additional context are ignored Only identifies associations, not similarity of meaning. For example, “rich” and “poor” will often occur together.

The method: Count how many pages are returned by Google for “monkey”, “president” and “monkey president”. Monkey: 74,200,000 President: 363,000,000 Monkey president: 2,230,000

The Google Distribution: Number of pages returned for a word x is event x. Number of pages returned for words x and y together is event x∩y. Probability L of monkey is 74,200,000 / total number of pages(8x109 ) =0.009275 Probability L of president is 363,000,000 / total number of pages =0.045375 Probability L of monkey∩president is 2,230,000 / total number of pages = 0.00027875

Normalisation: • The values are normalised to produce a normalized Google distance (NGD). • N = the sum of the three sets: 74,200,000+ 363,000,000+2,230,000 = 439430000

The Google Similarity Distance