Clustering by compression
1 / 19

Clustering by Compression - PowerPoint PPT Presentation

  • Uploaded on

Clustering by Compression. Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA). Overview. Input to the software is a set of files Output is a hierarchical clustering shown as an unrooted binary tree This is a case of unsupervised learning (example follows). Process Overview.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Clustering by Compression' - sanam

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Clustering by compression

Clustering by Compression

Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)


  • Input to the software is a set of files

  • Output is a hierarchical clustering shown as an unrooted binary tree

  • This is a case of unsupervised learning

  • (example follows)

Process overview
Process Overview

  • 1. File translations, if necessary, for example from MIDI to “player-piano” type format.

  • 2. Calculation of Normalized Compression Distance, or NCD.

  • 3. Representation as an unrooted binary tree.

What s unique
What’s Unique?

  • This clustering system is unique in that it can be described as feature-free

  • There are no parameters to tune, and no domain-specific knowledge went into it.

  • Using general-purpose data compressors gives us a parameterized family of features automatically for each domain

Featureless clustering
Featureless Clustering

  • No parameters and no customized features makes it convenient to develop as well as use

  • Since it is based on information-theoretic foundations, it tends to be less brittle than other methods that make considerably more domain-specific assumptions

  • So how does it work?

Midi translation
Midi Translation

  • In order to restrict information entering the algorithm, we removed undesirable MIDI fields such as artist or composer name, headers, and other non-musical data.

  • We keep only the basic MIDI-track decomposition as well as note timing and duration events. We throw away individual note volume.

Gene sequence translation
Gene sequence translation

  • Genetic sequences are represented in ASCII ain four letter alphabets: A,T,G,C

  • Almost no translation at all

Image translation
Image Translation

  • Black and white images are converted to ASCII using spaces for black and # for white

  • Newlines are used to separate rows

Clustering by compression

  • Once a group of songs has been acquired and translated, a quantity is computed on each pair in the group

  • Normalized Compression Distance measures how different two files are from one another.

Clustering by compression

  • NCD is based on an earlier idea called Normalized Information Distance.

  • NID uses as compressor a mathematical abstraction called Kolmogorov Complexity, often abbreviated K.

  • K represents a perfect data compressor, and is therefore uncomputable.

Clustering by compression

  • Since we cannot compute K, we approximate it using real general-purpose file-compressors like gzip, bzip2, winzip, ppmz, and others

  • NCD depends on a particular compressor and NCD with different compressors may give different results for the same pair of objects

Clustering by compression

  • C(x) means “the compressed size of x”

  • C(xy) means “compressed size of x and y”

  • 0 <= NCD(x,y) <= 1 (roughly)

Clustering by compression

  • NCD measures how similar or different two strings (or equivalently, files) are.

  • NCD(x,x) = 0, because nothing is different from itself

  • NCD(x,y) = 1 means that x and y are completely unrelated

  • Often less extreme values in real cases

Clustering by compression

  • Computing NCD of every song with every other song yields a 2-dimensional symmetric distance matrix

  • Next step is transforming this array of distances into something easier to grasp

  • We use the Quartet Method to construct an unrooted binary tree from the NCD matrix

Quartet method
Quartet Method

  • Our algorithm is a slight enhancement of the standard quartet method of tree reconstruction popular for the last 30 years

  • The input is a matrix of distances (NCD)

  • The output is an unrooted binary tree topology where each song is at a leaf and each non-leaf node has exactly three connections.

  • Tree is just one visualization of NCD matrix

Newer developments
Newer developments

  • Since the original Algorithmic Clustering of Music paper, we have since developed further the underlying mathematical formalisms upon which the method is based in a new paper, Clustering by Compression

  • We’ve included experiments from many other areas: biology, astronomy, images…

Current and future work
Current and future work

  • This year, we’ve begun experimenting with automatic conversion from .mp3 (and most other audio formats) to MIDI. This enables us to participate in new emerging spaces

  • We’re investigating alternatives for all stages of this process, to try to understand more about this apparently general machine learning algorithm

New directions
New directions

  • Combination of NCD and Support Vector Machine (SVM) learning for providing scalable generalization in a wide class of domains both musical and otherwise

  • Application of our techniques in real outstanding questions within the musical community

Contact and more info
Contact and more info

  • Related papers and information:

  • Software: