Varun rao
1 / 20

Algorithmic Information Theory, Similarity Metrics and Google - PowerPoint PPT Presentation

  • Uploaded on

Varun Rao. Algorithmic Information Theory, Similarity Metrics and Google. Algorithmic Information Theory. Kolmogorov Complexity Information Distance Normalized Information Distance Normalized Compression Distance Normalized Google Distance. Kolmogorov Complexity.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Algorithmic Information Theory, Similarity Metrics and Google' - mahon

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Varun rao

Varun Rao

Algorithmic Information Theory, Similarity Metrics and Google

Algorithmic information theory
Algorithmic Information Theory

  • Kolmogorov Complexity

  • Information Distance

  • Normalized Information Distance

  • Normalized Compression Distance

  • Normalized Google Distance

Kolmogorov complexity
Kolmogorov Complexity

  • The Kolmogorov complexity of a string x is the length, in bits, of the shortest computer program of the fixed reference computing system that produces x as output.1

  • First million bits of Pi vs. First million bits of your favourite song recording

Information distance
Information Distance

  • Given two strings x & y, Information Distance is the length of the shortest binary program that computes output y from input x, and also output x from input y 1

  • ID minorizes all other computable distance metrics

Normalized information distance
Normalized Information Distance

  • Roughly speaking, two objects are deemed close if we can significantly “compress” one given the information in the other, the idea being that if two pieces are more similar, then we can more succinctly describe one given the other.2

  • NID is characterized as the most informative metric

  • Sadly, completely and utterly uncomputable

Normalized compression distance
Normalized Compression Distance

  • But we have compressors (lossless)

  • If C is a compressor, and C(x) is the compressed length of a string x then

  • NCD gets closer to NID as the compressor approximates the ultimate compression, Kolmogorov complexity

Normalized compression distance ii
Normalized Compression Distance II

  • Basic Process to compute NCD for x & y

    • Use compressor to compute C(x), C(y)

    • Append x to y and compute C(xy)

  • Use relatively simple clustering methods to use NCD as a similarity metric to group strings

Normalized compression distance iii
Normalized Compression Distance III

  • Using Bzip2 on various types of files

Normalized compression distance iv
Normalized Compression Distance IV

  • The evolutionary tree built from complete mammalian mtDNA sequences of 24 species2

Normalized compression distance v
Normalized Compression Distance V

  • Clustering of Native-American, Native-African, and Native-European languages (translations of The Universal Declaration of Human Rights)2

Normalized compression distance vi
Normalized Compression Distance VI

  • Optical Character Recognition using NCD. More complex clustering techniques achieved 85% success rate as opposed to industry standard 90%-95%2

What about semantic meaning
What about Semantic meaning ?

  • Or what about how different a horse is from a car, or a hawk from a handsaw for that matter ?

  • Compressors are semantically indifferent to their data

  • To insert semantic relationships, turn to Google


  • Massive database, containing lots of information about semantic relationships

  • The Quick Brown ___ ?

  • Use simple page counts as indicators of closeness

  • Use relative number of hits as a measure of probability to create a Google Distribution i.e. p(x) = hits in a search of x/total number of pages indexed

Google ii
Google II

  • Given that we can construct a distribution we can construct a Google Shannon Fano code (conceptually) because we can apply the Kraft inequality (after some normalization)

    .... ???

Normalized google distance
Normalized Google Distance

  • After all that hand waving, we can create a distance (like) metric NGD that has all kinds of nice properties

Applying ngd
Applying NGD

  • NGD as applied to 15 painting names by 3 Dutch artists

Applying ngd ii
Applying NGD II

  • Using SVM to learn the concept of primes2

Applying ngd iii
Applying NGD III

  • Using SVM to learn “electrical” terms 2

Applying ngd iv
Applying NGD IV

  • Using SVM to learn “religious” terms 2


  • R. Cilibrasi and P. Vitanyi, “Automatic Meaning Discovery Using Google”

  • R. Cilibrasi, P. Vitanyi. Clustering by compression, Submitted to IEEE Trans. Information Theory.

  • C.H. Bennett, P. G´acs, M. Li, P.M.B. Vit´anyi,W. Zurek, Information Distance, IEEE Trans. Information Theory, 44:4(1998), 1407–1423.

  • M. Li, X. Chen, X. Li, B. Ma, P. Vitanyi. The similarity metric, IEEE Trans. Information Theory, 50:12(2004), 3250- 3264.

  • “Algorithmic Information Theory”, Wikipedia, accessed 25th January 2005.

  • Greg Harfst, “Kolmogorov Complexity”, accessed 25th January 2005.