The google similarity distance
Download
1 / 6

The Google Similarity Distance - PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on

The Google Similarity Distance. We’ve been talking about Natural Language parsing Understanding the meaning in a sentence requires knowing relationships between words e.g. house -> square house -> home house -> rooms There are many of these in our language!.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Google Similarity Distance' - lilac


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The google similarity distance

The Google Similarity Distance

We’ve been talking about Natural Language parsing

Understanding the meaning in a sentence requires knowing relationships between words

e.g. house -> square

house -> home

house -> rooms

There are many of these in our language!


There are ongoing attempts to build databases of these relationships. They are time and labour intensive.

The Web is the largest text database on Earth. It contains low-grade information in abundance.

There are two kinds of objects on which knowledge can be attained: actual object (a graph) and names of objects (“a graph”).

Actual objects can be compared for similarity through features.

Names of objects can be compared for similarity through ‘Google Semantics’ i.e. how they occur together in the web.


The idea

The Idea: relationships. They are time and labour intensive.

Define a new kind of semantics understandable by a computer.

Google semantics: content of the pages returned for a query on a word.

For a pair of words: the pages after querying the words singly, and then together.

Semantics is the context in which the words appear. Links from the pages to additional context are ignored

Only identifies associations, not similarity of meaning. For example, “rich” and “poor” will often occur together.


The method

The method: relationships. They are time and labour intensive.

Count how many pages are returned by Google for “monkey”, “president” and “monkey president”.

Monkey: 74,200,000

President: 363,000,000

Monkey president: 2,230,000


The google distribution

The Google Distribution: relationships. They are time and labour intensive.

Number of pages returned for a word x is event x.

Number of pages returned for words x and y together is event x∩y.

Probability L of monkey is

74,200,000 / total number of pages(8x109 )

=0.009275

Probability L of president is

363,000,000 / total number of pages

=0.045375

Probability L of monkey∩president is

2,230,000 / total number of pages

= 0.00027875


Normalisation
Normalisation: relationships. They are time and labour intensive.

  • The values are normalised to produce a normalized Google distance (NGD).

  • N = the sum of the three sets:

    74,200,000+ 363,000,000+2,230,000 = 439430000


ad