The Google Similarity Distance

The Google Similarity Distance Presenter：Chien-Hsing Chen Author: Rudi L. Cilibrasi Paul M.B. Vitanyi 2007,TKDE

Outline • Motivation • Objective • NGD • Experiments • Conclusions • Personal Opinion

Motivation • great cost of designing structures capable of manipulating knowledge • entering high quality contents in these structures by knowledgeable human experts • the efforts are long-running • large scale

Objective • The author develop a method that uses only the name of an object and obtains knowledge about the similarity of objects • a regular FCA, used in Ontology, acquires the similarity between objects and attributes

The Google Similarity Distance Kolmogorov complexity

The Google Similarity Distance • NGD (horse, rider) = 0.443 • “horse” 46,700,000 pages • “rider” 12,200,000 pages • “horse, rider” 2,630,000 pages • N= Indexed 8,058,044,651 pages NGD(pensi, cola)=0.797 NGD(賓拉登, 攻擊)=0.64 NGD(horse, rider)=0.898 NGD(book, drink)=0.694 NGD(web, network)=0.2768

Applications and Experiments • Hierarchical Clustering • Given a set of objects in a space provided with a distance measure, the matrix has as entries the pairwise distances between the objects.

Applications and Experiments • Hierarchical Clustering • Dataset: 17th Century painters

Applications and Experiments • SVM-NGD Learning • The author uses the anchor words to convert each of the 40 training words w1, …, w40 to 6-dimensional training vector v1,…v40. • The entry vj,i of vj=(vj,1,…,vj,6) is defined as vj,i=NGD(wj,ai) (1≦j ≦ 40, 1 ≦ i ≦ 6)

NGD Translation

Comparison to WordNet semantics • Randomly selected 100 semantic categories from the WordNet database • for each category, SVM is trained on 50 labeled training samples Positive examples are from WordNet, others are from dictionary • Per experiment is used a total of six anchors, 3 are from WordNet, 3 are from dictionary • Testing dataset, 20 new examples • Running with 100 experiments • The author ignores the false negatives

Conclusion • This knowledge base was created over the course of decades by paid human experts. • Google has already indexed more than 8 billion pages and shows no signs of slowing down. • Someone who estimated the 8-billion indexed pages was in 2004.

Opinion • Advantage • Google search engine was respected recently for similarity measure. • Drawback • anchors determination, accuracy measure (ignore false-negative) • NGD is a nothing novel but a demonstration straightly • Application

The Google Similarity Distance