the google similarity distance
Download
Skip this Video
Download Presentation
The Google Similarity Distance

Loading in 2 Seconds...

play fullscreen
1 / 6

The Google Similarity Distance - PowerPoint PPT Presentation


  • 46 Views
  • Uploaded on

The Google Similarity Distance. We’ve been talking about Natural Language parsing Understanding the meaning in a sentence requires knowing relationships between words e.g. house -> square house -> home house -> rooms There are many of these in our language!.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Google Similarity Distance' - lilac


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the google similarity distance

The Google Similarity Distance

We’ve been talking about Natural Language parsing

Understanding the meaning in a sentence requires knowing relationships between words

e.g. house -> square

house -> home

house -> rooms

There are many of these in our language!

slide2

There are ongoing attempts to build databases of these relationships. They are time and labour intensive.

The Web is the largest text database on Earth. It contains low-grade information in abundance.

There are two kinds of objects on which knowledge can be attained: actual object (a graph) and names of objects (“a graph”).

Actual objects can be compared for similarity through features.

Names of objects can be compared for similarity through ‘Google Semantics’ i.e. how they occur together in the web.

the idea

The Idea:

Define a new kind of semantics understandable by a computer.

Google semantics: content of the pages returned for a query on a word.

For a pair of words: the pages after querying the words singly, and then together.

Semantics is the context in which the words appear. Links from the pages to additional context are ignored

Only identifies associations, not similarity of meaning. For example, “rich” and “poor” will often occur together.

the method

The method:

Count how many pages are returned by Google for “monkey”, “president” and “monkey president”.

Monkey: 74,200,000

President: 363,000,000

Monkey president: 2,230,000

the google distribution

The Google Distribution:

Number of pages returned for a word x is event x.

Number of pages returned for words x and y together is event x∩y.

Probability L of monkey is

74,200,000 / total number of pages(8x109 )

=0.009275

Probability L of president is

363,000,000 / total number of pages

=0.045375

Probability L of monkey∩president is

2,230,000 / total number of pages

= 0.00027875

normalisation
Normalisation:
  • The values are normalised to produce a normalized Google distance (NGD).
  • N = the sum of the three sets:

74,200,000+ 363,000,000+2,230,000 = 439430000

ad