1 / 28

From Kolmogorov and Shannon to Bioinformatics and Grid Computing

From Kolmogorov and Shannon to Bioinformatics and Grid Computing. Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo. Aim. Give a flavour of fundamental novel discoveries about indexing and compression: A string, and any compact encoding of it, is the best index for itself

jock
Download Presentation

From Kolmogorov and Shannon to Bioinformatics and Grid Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

  2. Aim • Give a flavour of fundamental novel discoveries about indexing and compression: A string, and any compact encoding of it, is the best index for itself • Give a flavour of some fundamental novel discoveries about Distance functions and Classification, particularly relevant for Bioinformatics • On the way, mention uses of :suffix trees, suffix arrays, Burrows-Wheelet Transform, Move to Front… • In 30 min. an incredibly long jurney: From Kolmogorov and Shannon to Grid Computing • References: available on-line

  3. Types of data DNA sequences Audio-video files Executables Raw sequence of characters or bytes Types of query Character-based query Arbitrary substring What do we mean by “Indexing” ? Indexing approaches : • Full-text indexes, • Suffix Array, Suffix tree,…

  4. Moral: More economical to store data in compressed form than uncompressed • From March 2001 the Memory eXpansion Technology (MXT) is available on IBM eServers x330MXT • Same performance of a PC with double memory but at half cost What do we mean by “Compression” ? • Any Algorithm that squezes data : lossless, lossy • CPU speed nowadays makes (de)compression “costless” !!

  5. What we mean by “Classification” ? • Any tool that can group “related” objects together, e.g. the unaligned mithocondrial genomes NCBI Classfication

  6. In terms of space occupancy Also in terms of compression ratio Compression and Indexing: Two sides of the same coin ! • Do we witness a paradoxical situation ? • An index injects redundant data, in order to speed up the pattern searches • Compression removes redundancy, in order to squeeze the space occupancy • NO, new results proved a mutual reinforcement behaviour ! • Better indexes can be designed by exploiting compression techniques • Better compressors can be designed by exploiting indexing techniques • Classification is the “third side” of the coin: Kolmogorov Complexity, Information Theory, Compression and Indexing

  7. Compressed Index • Space close to gzip, bzip • Query time close to O(|P|) Compression Booster Tool to transform a poorcompressor into a better compression algorithm Kolmogorov Universal Distances and Classification Our journey, today... Index design (Weiner ’73) Compressor design (Shannon ’48) Burrows-Wheeler Transform (1994) Suffix Array (1990)

  8. First Lap…in record time!!! Investigate Indexing ideasCompressor design Booster

  9. s # i m p 1 12 ssi pi# si # i# i ppi# 10 9 11 9 ppi# ssippi# ssippi# ppi# ssippi# 5 2 7 4 ppi# 6 3 Key Idea 1: Suffix Tree [Weiner 73, McCreight 76, Ukkonen 92] • String: mississippi#

  10. bwt(s) #mississipp i i#mississipp ippi#mississ issippi#miss ississippi# m Sort the rows s mississippi# pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i Key Idea 2: Burrows-Wheeler Compression (1994) Let us be given a string s = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi

  11. Burrows and Wheeler Compression • Why it works: • BWT creates a locally homogeneous string: • abaababa bbbaaaaa • MTF transforms it into a globally homegeneous sequence of integers • bbbaaaaa 00010000 • The final string is “easy” to compress • Experimentally: compressibility is proportional to % of zeros

  12. The technique takes a poor compressor A and turns it into a compressor Aboost with better performance guarantee A s c Booster c’ Boosting [Ferragina, Giancarlo, Manzini, Sciortino, 03,04,05] The better isA, the better isAboost The more compressible iss, the better is Aboost Qualitatively, it can be shown: • c’is shorter thanc, ifsis compressible • Time(Aboost) = Time(A), i.e. no slowdown • Ais used as a black-box

  13. Second Lap…Even faster We investigated: Index Ideas Compression design Let’s now turn to the other direction Compression ideasIndex design Compressed Indexes

  14. SA L Rotated text L includes SA and T. Can we search within L ? 12 11 8 5 2 1 10 9 7 4 6 3 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i Suffix Array vs. BW-transform #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m mississippi

  15. The theoretical result: • Query complexity: O(p + occ logeN) time • Space occupancy: O( N Hk(T)) + o(N) bits k-th order empirical entropy A compressed index[Ferragina-Manzini, IEEE Focs 2000] In practice, the index is much appealing: • Space close to the best known compressors, ie. bzip • Query time of few millisecs on hundreds of MBs

  16. Third Lap… Universal Distances and Classification

  17. Large Data Sets • Classification of Sequences on a Genome-wide Scale • Distances based on alignments are either not applicable or too slow • Fast and reliable alignment-free methods are badly needed • Classification of Proteins, both for Function and Structure- Lagging behind to sequence data

  18. Proteins and Their String Representations • Amino acid sequence (FASTA format); • Atomic coordinates (Atom lines)‏;

  19. Protein Representations • Topologic Models (Top Diagrams)‏

  20. Kolmogorov Complexity • The Kolmogorov Complexity K(x) of a stringx is defined as the length of the shortest binary program that produces x. • The conditional Kolmogorov Complexity K(x|y) represents the minimum amount of information required to generate x by an effective computation when y is given as an input to the computation. • The Kolmogorov Complexity K(x,y) of a pair objects x and y is the length of the shortest binary program that produces x and y and a way to tell them apart.

  21. Universal Similarity metric (USM)‏ • Problem: • USM(x,y) is based on Kolmogorov Complexity that is non- computable in the Turing sense. • Solution: • K(x) can be approximated via data compression by using its relationship with Shannon Information Theory. • USM is a methodology rather than a formula quantifying the similarity of two strings.

  22. Approximations of USM • K(x) can be approximated by C(x), K(x,y) by C(xy) and K(x|y*) by C(xy) – C(x). We obtain three approximations to USM: where

  23. Experiments [Ferragina, Giancarlo, Greco, Manzini, Valiente, 2007] • Experimental setup: • Five Benchmarck datasets of proteins (several alternative representations); • A benchmark dataset of Genomic sequences (complete unaligned mitochondrial Genomes)‏; • Twenty-five compression algorithms; • Three dissimilarity functions based on USM. • Two set of experiments to compare USM both with methods based on alignments and not: • via ROC Analysis; • via UPGMA and NJ.

  24. An example • Unaligned mitochondrial DNA complete Genomes

  25. Results and Conclusions • Useful Guidelines for Use of USM Methodilogy for Biological Investigation • Which compressor to use • Which among UCD,NCD and CD to use • Which data representation is best • Etc…

  26. Software • Kolmogorov Library: http://www.math.unipa.it/~raffaele/kolmogorov/ • Sequential processing is too slow even for relatively small data sets, i.e, 278 files (1.5Mb) classification takes 12 hours on a state of the art PC…half an hour on Grid • Soon Available as a Grid-aware Web Service on COMETA Portal

  27. Adevertisement 2 • 20° EDition of Lipari International Summer School for Computer Scientists • TOPIC: Algorithms, Science and Engineering • See Lipari School Website

More Related