1 / 50

Towards Compression-based Information Retrieval

Towards Compression-based Information Retrieval. Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010. Scope. Methods. Theory. Applications. Pattern Matching. Content-based Image Retrieval. Algorithmic KL-divergence. Algorithmic Information Theory. Classification, Clustering,

tamera
Download Presentation

Towards Compression-based Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

  2. Scope Methods Theory Applications Pattern Matching Content-based Image Retrieval Algorithmic KL-divergence Algorithmic Information Theory Classification, Clustering, Detection Data Compression Grammar-based Similarity Shannon Information Theory Complexity Estimation for Annotated Datasets Dictionary-based Similarity Main Contributions

  3. Outline • The core: Compression-based similarity measures (CBSM) • Theory: a “Complex” Web • Speeding up CBSM • Applications and Experiments • Conclusions and Perspectives

  4. Outline • The core: Compression-based similarity measures (CBSM) • Theory: a “Complex” Web • Speeding up CBSM • Applications and Experiments • Conclusions and Perspectives

  5. C(x) x Coder C(xy) NCD Coder y Coder C(y) Compression-based Similarity Measures • Most well-known: Normalized Compression Distance (NCD) • General Distance between any two strings xandy • Similarity metric under some assumptions • Basically parameter-free • Applicable with any off-the-shelf compressor (such as Gzip) • If two objects compress better together than separately, it means they share common patterns and are similar Li, M. et al., “The similarity metric”, IEEE Tr. Inf. Theory, vol. 50, no. 12, 2004

  6. Applications of CBSM DNA Genomes Clustering and classification of: • Texts • Music • DNA genomes • Chain letters • Images • Time Series • … Primates Rodents Seismic signals Satellite images Landslides Explosions

  7. Assessment and discussion of NCD results • Results obtained by NCD often outperform state-of-the-art methods • Comparisons with 51 other distances* • But NCD-like measures have always been applied to restricted datasets • Size < 100 objects in the main papers on the topic • All information retrieval systems use at least thousands of objects • More thorough experiments are required • NCD is too slow to be applied on a large dataset • 1 second (on a 2.65 GHz machine) to process 10 strings of 10 KB each and output 5 distances • Being NCD data-driven, the full data has to be processed again and again to compute each distance from a given object • The price to pay for a parameter-free approach is that a compact representation of the data in any explicit parameter space is not allowed * Keogh, E., Lonardi, S. & Ratanamahatana, C. “Towards Parameter-free Data Mining”, SIGKDD 2004.

  8. Outline • The core: Compression-based similarity measures (CBSM) • Theory: a “Complex” Web • Speeding up CBSM • Applications and Experiments • Conclusions and Perspectives

  9. Classic Information Theory Algorithmic Information Theory Data Compression A “Complex” Web • How to quantify information? Alg. Mutual Information Shannon Mutual Information K(x) H(X) NID NCD Compression

  10. Shannon Entropy • Information content of the output of a random variable X • Example: Entropy of the outcomes of the toss of a biased/unbiased coin • Max H(X) -> Coin not biased • Every toss carries a full bit of information! • Note: H(X) can be (much) greater than 1 if the values that X can take are more than two!

  11. Entropy & Compression: Shannon-Fano Code X={A,B,C,D,E} We should use 3 bits per symbol to encode the outcomes of X  Bits per Symbol in average Compression is achieved!

  12. The probabilistic approach lacuna • H(X) is related to the probability density function of a data source X • It can’t tell the amount of information contained in an isolated object! • Algorithmic Information Theory comes to the rescue… s={I_carry_important_information} • What is the information content of s? • Source S: unknown

  13. Algorithmic Information Theory: Kickoff 3 parallel definitions: • R. J. Solomonoff, “A formal theory of inductive inference”, 1964. • A. N. Kolmogorov, “Three approaches to the quantitative definition of information”, 1965. • G. J. Chaitin, “On the length of programs for computing finite binary sequences”, 1966.

  14. Kolmogorov Complexity • Known also as: • Kolmogorov-Chaitin complexity • Algorithmic complexity • Shortest program q that outputs the string x and halts on an universal Turing machine • K(x) is a measure of the computational resources needed to specify the object x • K(x) is uncomputable

  15. Kolmogorov Complexity • A string with low complexity • 001001001001001001001001001001001001001001 • 13 x (Write 001) • A string with high complexity • 0100110111100100111011100011001001011100101 • Write 0100110111100100111011100011001001011100101

  16. Two approaches to information content Probabilistic (classic) Algorithmic VS. Information  Uncertainty Shannon Entropy Information  Complexity Kolmogorov Complexity • Related to a single objectx • Length of the shortest program q among Qx programs which outputs the finite binary string x and halts on a Universal Turing Machine • Measures how difficult it is to describe xfrom scratch • Uncomputable • Related to a discrete random variableX on a finite alphabet A with a probability mass functionp(x) • Measure of the average uncertainty in X • Measures the average number of bits required to describe X • Computable if p(x) is known

  17. Classic Information Theory Algorithmic Information Theory Data Compression A “Complex” Web • How to measure the information shared between two objects? Alg. Mutual Information Shannon Mutual Information K(x) H(X) NID NCD Compression

  18. Shannon/Kolmogorov Parallelisms: Mutual Information Algorithmic Probabilistic (classic) VS. Algorithmic Mutual Information (Statistic) Mutual Information • Measure in bits of the amount of information a random variable X has about another variable Y • The joint entropyH(X,Y)is the entropy of the pair (X,Y) with a joint distribution p(x,y) • Symmetric, non-negative • If I(X;Y) = 0 then • H(X;Y) = H(X) + H(Y) • X and Yare statistically independent • Amount of computational resources shared by the shortest programs which output the strings x and y • The joint Kolmogorov complexityK(x,y) is the length of the shortest program which outputs x followed by y • Symmetric, non-negative • If then • K(x,y) = K(x) + K(y) • x and yare algorithmically independent

  19. Classic Information Theory Algorithmic Information Theory Data Compression A “Complex” Web • How to derive a computable similarity measure? Alg. Mutual Information Shannon Mutual Information K(x) H(X) NID NCD Compression

  20. Compression: Approximating Kolmogorov Complexity • K(x) represents a lower bound for what an off-the-shelf compressor can achieve • Vitányi et al. suggest this approximation: • C(x) is the size of the file obtained by compressing x with a standard lossless compressor (such as Gzip) A Original size: 65 Kb Compressed size: 47 Kb B Original size: 65 Kb Compressed size: 2 Kb

  21. Back to NCD Algorithmic Computable Normalized Information Distance (NID) Normalized Compression Distance (NCD) • The size K(x) of the shortest program which outputs x is assimilated to the size C(x) of the compressed version of x • Normalized measure of the elements that a compressor may use twice when compressing two objects x and y • Derived from algorithmic mutual information • Normalized length of the shortest program that computes x knowing y, as well as computing yknowing x • Similarity metric minimizing any admissible metric

  22. Classic Information Theory Algorithmic Information Theory Data Compression A “Complex” Web • What if we add some more? Alg. Mutual Information Shannon Mutual Information Occam’s razor K(x) H(X) NID NCD Compression

  23. Occam’s Razor William of Occam 14th century All other things being equal, the simplest solution is the best! Maybe today William of Occam could say: If two explanations exist for a given phenomenon, pick the one with the smallest Kolmogorov Complexity! Say I!

  24. An Example of Occam’s razor • The Copernican model (Ma) is simpler than the Ptolemaic (Mb). • Mb in order to explain the motion of Mercury relative to Venus, introduced the existence ofepicycles in the orbits of the planets. • Ma accounts for this motion with no further explanation. • Finally, the model Ma was acknowledged as correct. • Note that we can assume: • K(Ma) < K(Mb)

  25. Outline • The core: Compression-based similarity measures (CBSM) • Theory: a “Complex” Web • Speeding up CBSM • Applications and Experiments • Conclusions and Perspectives

  26. Evolution of CBSM • 1993 Ziv & Merhav • First use of relative entropy to classify texts • 2000 Frank et al., Khmelev • First compression-based experiments on text categorization • 2001 Benedetto et al. • Intuitively defined compression-based relative entropy • Caused a rise of interest in compression-based methods • 2002 Watanabe et al. • Pattern Representation based on Data Compression (PRDC) • Dictionary-based • First in classifying general data with a first step of conversion into strings • Independent from IT concepts • 2004 NCD • Solid theoretical foundations (Algorithmic Information Theory) • 2005-2006 Other similarity measures • Keogh et al. (Compression-based Dissimilarity Measure), • Chen & Li (Chen-Li Metric for DNA classification) • Sculley & Brodley (Cosine Similarity) • Differ from NCD only by their normalization factors - Sculley & Brodley (2006) • 2008 Macedonas et al. • Independent definition of dictionary distance

  27. Pattern Recognition based on Data Compression (PRDC) Watanabe et al., 2002 • PRDC equation is not normalized according to the complexity of xand skips the joint compression step. • Normalizing the equation used in PRDC, almost identical measure with NCD are obtained ( average difference on 400 measures) • PRDC can be inserted in the list of measures which differ by NCD only for the normalization factor (Sculley & Brodley, 2006) Length of string x coded with the dictionary D(y) extracted from y Length of string x coded with the dictionary D(xy) extracted from x and y joint Distance between two objects encoded into strings x and y Distance of object x from object y Length of string representing the object x Length of string representing the object x coded with the dictionary extracted from itself

  28. Sample CFG G(z) for string z = {aaabaaacaaadaaaf} : size of xof length N represented by its smallest context-free grammar G(x) : number of rules contained in the grammar Grammar-based Approximation • A dictionary extracted from a stringxin PRDC may be regarded as a model for x • To better approximate K(x), consider the smallest Context-Free Grammar (CFG) generating x. • The grammar’s set of rules can be regarded as the smallest dictionary and generative model for x. Avg. distances on 40,000 measurements • Two-part complexity representation • Model + data given the model (MDL-like) • Complexity overestimations are intuitively accounted for and decreased in the second term

  29. Normalized PRDC Comparisons...

  30. Comparisons... Primates Note • Using a specific compressor for DNA genomes improves NCD results considerably Seals Divided Grammars Primates NCD Rodents Rodents scattered Bears far apart

  31. Drawbacks of the Introduced CBSM Solution • Extract dictionaries from the data, possibly offline • Compare only those Better than NCD Comparable to NCD Worse than NCD How to combine accuracy and speed?

  32. Outline • The core: Compression-based similarity measures (CBSM) • Theory: a “Complex” Web • Speeding up CBSM • Applications and Experiments • Conclusions and Perspectives

  33. Images preprocessing: 1D encoding • Conversion to Hue Saturation Value (HSV) color space • Scalar quantization • 4 bits for Hue • Human eye is more sensitive to changes in hue • 2 bits for Saturation • 2 bits for Value HSV color space • What about loss of textural information? • Horizontal textural information is already implicit in the dictionaries • Basic vertical interactions are stored for each pixel • Smooth / Rough: 1 bit of information • Other solutions (e.g. Peano scanning) gave worse performances Scanning Direction

  34. CA AA BB ..ABABBCA.. Dictionary-based Distance: Dictionary Extraction LZW • Convert each image to a string and extract meaningful patterns into dictionaries using an LZW-like compressor • Unlike LZW, loose (or no) constraints on dictionary size, flexible alphabet size • Sort entries in the dictionaries in order to enable binary searches • Store only the dictionary ”TOBEORNOTTOBEORTOBEORNOT!” • Dictionary-based universal compression algorithm • Improvement by Welch (1984) over the LZ78 compressor (Lempel & Ziv, 1978) • Searches for matches between the text to be compressed and a set of previously found strings contained in a dictionary • When a match is found, a substring is substituted by a code representing a pattern in the dictionary AA BB CA

  35. Consider two dictionaries to compute a distance between them as the difference between shared/not shared patterns The joint compression step of NCD is now replaced by an inner join of two sets NCD acts like a black box, FCD simplifies it by making dictionaries explicit Dictionaries have to be extracted only once (also offline) Fast Compression Distance Count(select * from (D(x))) Count(select * from Inner_Join(D(x),D(y)))

  36. How fast is FCD with respect to NCD? Operations needed for the joint compression steps (LZW-based NCD) n. elements inx n. patterns inx • Further advantages • If in the search a pattern gives a mismatch, ignore all extensions of that pattern • Ensured by LZW’s prefix-closure property • Ignore shortest patterns (regard them as noise) • To reduce storage space, ignore all redundant patterns which are prefixes of others • No losses also ensured by LZW’s prefix-closure property • Complexity decreases by approx. one order of magnitude

  37. Datasets Corel Nister-Stewenius 1500 digital photos and hand-drawn images 10,200 photographs of objects pictured from 4 different points of view Lola 164 video frames from the movie “Run, Lola, Run” Liber Liber 90 books of known Italian authors Fawns & Meadows 144 infrared images of meadows some of which contain fawns

  38. Authorship Attribution Author Texts Successes Dante 8 8 D’Annunzio 4 4 Deledda 15 15 Fogazzaro 5 5 Guicciardini 6 6 Machiavelli 12 10 Manzoni 4 4 Pirandello 11 11 Salgari 11 11 Svevo 5 5 Verga 9 9 TOTAL 90 88 FCD Classification accuracy: comparison with 6 other compression-based methods FCD Running times comparison for the top 3 methods

  39. Example of NCD’s Failure: Wild Animals Detection 41 Fawns The 3 missed detections (FCD) Compressor used with NCD: LZW Image size: 160x120 103 Meadows Confusion Matrices Limited buffer size in the compressor and total loss of vertical texture causes NCD’s performance to decrease!

  40. Content-based image retrieval system Classical (Smeulders, 2000) Many steps and parameters to set Proposed Compare each object in the set to a query image Q and rank the results on the basis of their distance from Q

  41. Applications: COREL Dataset 1500 images, 15 classes, 100 images per class Africans Beach Architecture Buses Flowers Dinosaurs Elephants Horses Food Mountains Caves Postcards Tigers Women Sunsets

  42. Precision (P) vs. Recall (R) Evaluation True Positives False Negatives False Positives Minimum Distortion Information Retrieval (Jeong & Gray, 2004) Jointly Trained Codebook (Daptardar & Storer, 2008) Running time: 18 min (images resampled to 64x64) 2 Processors (2GHz) + 2 GB RAM

  43. Confusion Matrix • Classification according to the minimum average distance from a class

  44. Typical images belonging to the class “Africans” False alarms (?) The 10 misclassified images • Mostly misclassified as “Tigers” • No human presence • “Extreme” image misclassified as “Food”

  45. Complexity Reduced Average High Estimation of the “complexity” of a dataset Rank

  46. A larger dataset and a comparison with state-of-the-artmethods: Nister-Stewenius • 10200 images • 2550 objects photographed under 4 points of view • Score (from 1 to 4) represents the number of meaningful objects retrieved in the top-4 • SIFT-based NS1 and NS2 use different training sets and parameters settings, and yield different results • FCD is independent from parameters • Only 1000 images processed for NCD • Query Time: 8 seconds

  47. SAR Scene Hierarchical Clustering 32 TerraSAR-X subsets acquired over Paris False Alarm

  48. Outline • The core: Compression-based similarity measures (CBSM) • Theoretical Foundations • Contributions: Theory • Contributions: Applications and Experiments • Conclusions

  49. What can (not) compression-based techniques do • The FCD allows validating compression-based techniques • Largest dataset tested by NCD: 100 objects • Largest dataset tested by FCD: 10200 objects • CBSM do NOT Outperform every existing technique • Results obtained so far on small datasets could be misleading • On the larger datasets analyzed, results are often inferior to the state of the art • Open question: could they be somehow improved? • BUT they yield comparable results to other existing techniques • Reducing drastically (ideally skipping) the parameters definition • Skipping feature extraction • Skipping clustering (in the case of classification) • Without assuming a priori knowledge of the data • Data driven, applicable to any kind of data

  50. Compression

More Related