1 / 6

The Google Similarity Distance

The Google Similarity Distance. We’ve been talking about Natural Language parsing Understanding the meaning in a sentence requires knowing relationships between words e.g. house -> square house -> home house -> rooms There are many of these in our language!.

Download Presentation

The Google Similarity Distance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Google Similarity Distance We’ve been talking about Natural Language parsing Understanding the meaning in a sentence requires knowing relationships between words e.g. house -> square house -> home house -> rooms There are many of these in our language!

  2. There are ongoing attempts to build databases of these relationships. They are time and labour intensive. The Web is the largest text database on Earth. It contains low-grade information in abundance. There are two kinds of objects on which knowledge can be attained: actual object (a graph) and names of objects (“a graph”). Actual objects can be compared for similarity through features. Names of objects can be compared for similarity through ‘Google Semantics’ i.e. how they occur together in the web.

  3. The Idea: Define a new kind of semantics understandable by a computer. Google semantics: content of the pages returned for a query on a word. For a pair of words: the pages after querying the words singly, and then together. Semantics is the context in which the words appear. Links from the pages to additional context are ignored Only identifies associations, not similarity of meaning. For example, “rich” and “poor” will often occur together.

  4. The method: Count how many pages are returned by Google for “monkey”, “president” and “monkey president”. Monkey: 74,200,000 President: 363,000,000 Monkey president: 2,230,000

  5. The Google Distribution: Number of pages returned for a word x is event x. Number of pages returned for words x and y together is event x∩y. Probability L of monkey is 74,200,000 / total number of pages(8x109 ) =0.009275 Probability L of president is 363,000,000 / total number of pages =0.045375 Probability L of monkey∩president is 2,230,000 / total number of pages = 0.00027875

  6. Normalisation: • The values are normalised to produce a normalized Google distance (NGD). • N = the sum of the three sets: 74,200,000+ 363,000,000+2,230,000 = 439430000

More Related