The Google Similarity Distance. We’ve been talking about Natural Language parsing Understanding the meaning in a sentence requires knowing relationships between words e.g. house -> square house -> home house -> rooms There are many of these in our language!.
We’ve been talking about Natural Language parsing
Understanding the meaning in a sentence requires knowing relationships between words
e.g. house -> square
house -> home
house -> rooms
There are many of these in our language!
There are ongoing attempts to build databases of these relationships. They are time and labour intensive.
The Web is the largest text database on Earth. It contains low-grade information in abundance.
There are two kinds of objects on which knowledge can be attained: actual object (a graph) and names of objects (“a graph”).
Actual objects can be compared for similarity through features.
Names of objects can be compared for similarity through ‘Google Semantics’ i.e. how they occur together in the web.
Define a new kind of semantics understandable by a computer.
Google semantics: content of the pages returned for a query on a word.
For a pair of words: the pages after querying the words singly, and then together.
Semantics is the context in which the words appear. Links from the pages to additional context are ignored
Only identifies associations, not similarity of meaning. For example, “rich” and “poor” will often occur together.
Count how many pages are returned by Google for “monkey”, “president” and “monkey president”.
Monkey president: 2,230,000
Number of pages returned for a word x is event x.
Number of pages returned for words x and y together is event x∩y.
Probability L of monkey is
74,200,000 / total number of pages(8x109 )
Probability L of president is
363,000,000 / total number of pages
Probability L of monkey∩president is
2,230,000 / total number of pages
74,200,000+ 363,000,000+2,230,000 = 439430000