Similarity Analysis by Data Compression

Similarity Analysis by Data Compression Peter Grünwald, CWI, Amsterdam Petri Myllymäki, University of Helsinki, CoSCo Research carried out by Rudi Cilibrasi, Teemu Roos, Hannes Wettig. CWI is the National Centre of Mathematics and Computer Science in the Netherlands. CoSCo is the Complex Systems Computation Research Group.

Data Compression… • Consider two files A and B • Let’s compress these with your favourite general-purpose data compressor, e.g. gzip • Let L(A) and L(B) be the compressed length (in bits) of A and B, respectively

…and Similarity • Suppose we want to compress both A and B. • We can either first compress A and then B • Resulting length: L(A)+L(B) • Or we can glue A and B together and compress the resulting file AB • Resulting length L(AB)

…and Similarity • Suppose we want to compress both A and B. • We can either first compress A and then B • Resulting length: L(A)+L(B) • Or we can glue A and B together and compress the resulting file AB • Resulting length L(AB) CLAIM: if (and only if) A and B are ‘similar’, then L(AB) << L(A) + L(B)

“Domain-Independent”Notion of Similarity • Consider same ASCII text in many different languages, e.g., Declaration of Human Rights • English close to German • English reasonable close to French • German farther from French • All three far from, say, Polish • Consider DNA of different species • Human very close to Chimpanzee, somewhat less close to Gorilla, even less close from Baboon…and very far from Wheat • Consider MIDI-files of popular songs…

Background • For a given compressor with length function L, define Normalized Compression Distance as • If L is taken to be Kolmogorov complexity, this becomes a “universal metric” • essentially, whenever two objects are close according to some computable distance function, they will be close according to NCD as well • For practical applications, use computationally practical general-purpose compressor • gzip, bzip, ppm etc.

Applications • For a set of N possibly related files, compute N2 pairwise normalized compression distances • To visualize, create a binary tree such that close objects are close to each other on the tree • e.g. using quartet puzzling method You can do this at home! You cannot do this at home!

Pump-Priming • Pre-Pump Priming: • Theory developed and tested on several data sets at CWI; featured in New Scientist, Pour La Science, Izvestija… • Successes include: SARS is CORONA • Pump Priming: • Development of popular Open-Source Package CompLearn (www.complearn.org, Rudi Cilibrasi) • Application of CompLearn and other compression-based methods to stemmatology

Compression-Based Methods in Stemmatic Analysis Legend of St. Henry of Finland, Manuscript H, Helsinki University Library

Before Gutenberg... • Historical manuscripts were repeatedly copied by hand • Typical ’errors’ include misspellings, omissions, change of word order, etc....

Manuscript Evolution • The texts spread out in a number of copies, following a tree-like graph • Typically only a fraction of the manuscripts remain to our date

Stemmatic Analysis • Stemmatology: ”Discipline that attempts to reconstruct the transmission of a text on the basis of relations between the various surviving manuscripts.” • Cf. Phylogenetics: ”The study of evolutionary relatedness among various groups of organisms.” manuscript  individual written text  DNA copying  reproduction modification  mutation ’contamination’  horizontal transfer

Compression-Based Approach • Most existing approaches (distance-based methods, parsimonial methods, Bayesian methods, etc.) based on methods developed for biological phylogeny: • Pascal pump priming compression-based approach for stemmatic analysis • Cost function: amount of information required to describe B given A.

Constructing the stemma • Dynamic programming for handling the missing nodes • With 52 existing documents, the number of trees is about 2.7 x 1078  simulated annealing search

How Does It Work? • Actually, surprisingly well! • In Helsinki, we have started a 2-year project with the historians, funded by the Emil Aaltonen Foundation,to study thisapproach further

The Pascal Computer-Assisted Stemmatology Challenge • Data set #1: Heinrichi data, collected specifically for this challenge • Data set #2: The Parzival data - text is beginning of German poem Parzival by Wolfram von Eschenbach (translated to English by A.T. Hatto). Data kindly provided to us by M. Spencer and H. F. Windram • Data set #3: Notre Besoin - text is from Stig Dagerman's, Notre besoin de consolation est impossible à rassasier, Paris: Actes Sud, 1952 (translated to French from Swedish by P. Bouquet). Data kindly provided to us by Caroline Macé.

Challenge results • No clear overall winner over all data sets • CompLearn performed very well in Parzival, but poorly in Heinrichi, why?  more research is required • Nice side result: the Heinrichi is internationally a quite unique data set  a platform for future collaboration with other sciences?

Future work • Analysis of Challenge results • New Challenge? • Application to the Finnish Cultural Foundation to fund a two-year European research network on stemmatology • built aroundseries of 4-5 international workshops gathering top experts of the field. • names in application represent various disciplines including historical studies, theology, philology, computer science, mathematics and biology • Workshop on information-theoretic approaches to modeling in Helsinki? • July 2008, during ICML, UAI & COLT

Similarity Analysis by Data Compression

Similarity Analysis by Data Compression

Presentation Transcript

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression by Quantization

Data compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression