Loading in 5 sec....

Clustering by CompressionPowerPoint Presentation

Clustering by Compression

- By
**sanam** - Follow User

- 58 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Clustering by Compression' - sanam

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Clustering by Compression

Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)

Overview

- Input to the software is a set of files
- Output is a hierarchical clustering shown as an unrooted binary tree
- This is a case of unsupervised learning
- (example follows)

Process Overview

- 1. File translations, if necessary, for example from MIDI to “player-piano” type format.
- 2. Calculation of Normalized Compression Distance, or NCD.
- 3. Representation as an unrooted binary tree.

What’s Unique?

- This clustering system is unique in that it can be described as feature-free
- There are no parameters to tune, and no domain-specific knowledge went into it.
- Using general-purpose data compressors gives us a parameterized family of features automatically for each domain

Featureless Clustering

- No parameters and no customized features makes it convenient to develop as well as use
- Since it is based on information-theoretic foundations, it tends to be less brittle than other methods that make considerably more domain-specific assumptions
- So how does it work?

Midi Translation

- In order to restrict information entering the algorithm, we removed undesirable MIDI fields such as artist or composer name, headers, and other non-musical data.
- We keep only the basic MIDI-track decomposition as well as note timing and duration events. We throw away individual note volume.

Gene sequence translation

- Genetic sequences are represented in ASCII ain four letter alphabets: A,T,G,C
- Almost no translation at all

Image Translation

- Black and white images are converted to ASCII using spaces for black and # for white
- Newlines are used to separate rows

NCD

- Once a group of songs has been acquired and translated, a quantity is computed on each pair in the group
- Normalized Compression Distance measures how different two files are from one another.

NCD

- NCD is based on an earlier idea called Normalized Information Distance.
- NID uses as compressor a mathematical abstraction called Kolmogorov Complexity, often abbreviated K.
- K represents a perfect data compressor, and is therefore uncomputable.

NCD

- Since we cannot compute K, we approximate it using real general-purpose file-compressors like gzip, bzip2, winzip, ppmz, and others
- NCD depends on a particular compressor and NCD with different compressors may give different results for the same pair of objects

NCD

- C(x) means “the compressed size of x”
- C(xy) means “compressed size of x and y”
- 0 <= NCD(x,y) <= 1 (roughly)

NCD

- NCD measures how similar or different two strings (or equivalently, files) are.
- NCD(x,x) = 0, because nothing is different from itself
- NCD(x,y) = 1 means that x and y are completely unrelated
- Often less extreme values in real cases

NCD

- Computing NCD of every song with every other song yields a 2-dimensional symmetric distance matrix
- Next step is transforming this array of distances into something easier to grasp
- We use the Quartet Method to construct an unrooted binary tree from the NCD matrix

Quartet Method

- Our algorithm is a slight enhancement of the standard quartet method of tree reconstruction popular for the last 30 years
- The input is a matrix of distances (NCD)
- The output is an unrooted binary tree topology where each song is at a leaf and each non-leaf node has exactly three connections.
- Tree is just one visualization of NCD matrix

Newer developments

- Since the original Algorithmic Clustering of Music paper, we have since developed further the underlying mathematical formalisms upon which the method is based in a new paper, Clustering by Compression
- We’ve included experiments from many other areas: biology, astronomy, images…

Current and future work

- This year, we’ve begun experimenting with automatic conversion from .mp3 (and most other audio formats) to MIDI. This enables us to participate in new emerging spaces
- We’re investigating alternatives for all stages of this process, to try to understand more about this apparently general machine learning algorithm

New directions

- Combination of NCD and Support Vector Machine (SVM) learning for providing scalable generalization in a wide class of domains both musical and otherwise
- Application of our techniques in real outstanding questions within the musical community

Contact and more info

- Related papers and information:
http://www.cwi.nl/~cilibrar

- Software: http://complearn.sourceforge.net/
- [email protected]
- [email protected]
- [email protected]

Download Presentation

Connecting to Server..