Similarity Search

Michael Springmann PhD Seminar October 11th, 2007 Similarity Search

Projects • DELOS (EU FP6) • Network of Excellence on Digital Libraries • http://www.delos.info/ • Task 1.6 Management of and Access to Virtual Electronic Health Records • Task 1.8 DelosDLMS • DILIGENT (EU FP6) • A Digital Library Infrastructure on Grid Enabled Technology • Work Package 1.4 Index & Search – Feature Extraction • ARTE Scenario Michael Springmann - Database & Information Systems Group

What is similarity search? From a collection, return a ranked list of items for a given reference object. 1. 0.999 2. 0.873 3. 0.722 4. 0.712 ReferenceObject 5. 0.503 6. 0.442 7. 0.392 Michael Springmann - Database & Information Systems Group

203 236 172 210 78 Steps to compute similarity • Define query (reference object) • Select feature to use for comparison • Extract feature of reference object • Compare feature with each element of collection • Return (subset) of ranked list e.g.ColorHistogram e.g. 5-NN Michael Springmann - Database & Information Systems Group

Similarity Search: Media Types • Image – Color, Texture, Shape • Text – TF/IDF, Edit Distance • Audio – Spectrum, Rhythm, Beat, Pitch • Video Sequences – Visual, Subtitles / Audio Transcripts, (rich) Meta Data • Combinations of several types • Complex Documents High dimensional feature vectors Michael Springmann - Database & Information Systems Group

Goals Efficiency • Theme: Retrieve the results fast! • Measure: Execution time • Question: How can we achieve this with algorithmic optimizations? Effectiveness • Theme: Find good/better results! • Measure: Quality, e.g. for benchmark collections Precession, Recall, MAP • Question:How can we find better results w.r.t. the information need of the user? Michael Springmann - Database & Information Systems Group

Similarity Search: What it is... A way to order / rank things • May help to group objects • Limitations: • Feature matches categorization criterion • No sharp borders Michael Springmann - Database & Information Systems Group

Michael Springmann - Database & Information Systems Group

ISIS (Interactive Similarity Search) • Originated at ETH Zurich, continued at UMIT and UNIBAS • VA-File can handle collections of size > 600.000 images while still achieving interactive answering times • Used image features: Color Moments, Texture Moments • Global and 5 Fuzzy Regions Michael Springmann - Database & Information Systems Group

5 Fuzzy Regions Michael Springmann - Database & Information Systems Group

Similarity Search: What it is ... and what it ain‘t? A way to order / rank things Feature extraction will not find out: One person sleeping ... at least not without application specific adjustments / training • May help to group objects • Limitations: • Feature matches categorization criterion • No sharp borders Michael Springmann - Database & Information Systems Group

ImageCLEF (http://www.imageclef.org) I. Ad-hoc photographic retrieval task IAPR TC-12 Benchmark, 20.000 (tourist) images, multi-lingual descriptions. Main challenge: Short annotations. II. Object Retrieval Task PASCAL Visual Object, 2617 images, 4754 object in realistic scenes. Main challenge: Pure visual, not pre-segmented. III. Medical Image Retrieval c@simage, PEIR, MIR, PathoPic, mypacs.net: > 70.000 images, heterogeneous case notes in XML IV. Medical Automatic Annotation Task IRMA Database, 11.000 medical images, annotated with IRMA Code (116 classes). Main challenge: Pure visual, classification domain specific. 1123-127-500-000 Michael Springmann - Database & Information Systems Group

IRMA Code Classification Example 4 independent axes: Technical code (T) describes the image modality, e.g. 1 = x-ray, 11 = plain radiography, 112 = analog, 1123 = high beam energy Biological code (B) describes the biological system examined. O always means unspecific and therefore is always followed by other Os or -. 1123-127-500-000 Directional code (D) models body orientation, here: anteroposterior (AP, coronal), supine Anatomical code (A) refers to the body region examined, here: chest Michael Springmann - Database & Information Systems Group

Image Distortion Model (IDM) Uses reduced size images of at most 32 pixels width/height Corresponding pixels Michael Springmann - Database & Information Systems Group

Edge Detection (Sobel Filter) Michael Springmann - Database & Information Systems Group

Efficiency: Speeding up IDM Algorithmic optimization • Idea: Only k ≤ 5 of the 10.000 reference images are used for subsequent kNN classification. • Early termination of distance computation of unused images • Base decision on threshold derived from best k images seen so far Pixels not evaluated due to exceeded threshold Michael Springmann - Database & Information Systems Group

Early Termination Strategy - Experimental results For IDM: Less than 30% of all pixels need to get evaluated Michael Springmann - Database & Information Systems Group

Speaking of numbers… • Original RWTH Aachen implementation of IDM requires for X×32, IDM (gradients, 5×5 window, 3×3 context) about 190 seconds per sample (= comparison) on a standard Pentium 4 PC running at 2.6GHz. • Using L2-Distance in a Sieve function, they reduced to 16.8 seconds – but this causes a slight degradation of results. • Our Java implementation takes for same window & context area on standard Pentium 4 PC 2.4 GHz only 16.0 seconds using the threshold (no degradation). L2-Distance can benefit of threshold – our Sieve function implementation takes only 2.0 seconds per sample. • We cached all features in main memory (only 60 MB). Reading directly from disk takes in total less than 5 seconds. Since performed in parallel to computation, penalty for IDM is only about 0.3 seconds, Sieve function becomes I/O-bound. Michael Springmann - Database & Information Systems Group

Multithreading - Implementation • Several Java Worker Threads, each computes similarity between one reference image and query. • Dispatcher keeps track of distance threshold for early termination. • IDM with early termination takes 4.3 seconds on Fujitsu-Siemens Celsius M450 Workstation (Intel Core 2 Duo E6600, 2.4 GHz) – and only 1.5 seconds on IBM xSeries 445, 8x Intel Xeon MP 2.8 GHz. • Opens possibility for optimizing second goal… Michael Springmann - Database & Information Systems Group

Effectiveness: Adjusting IDM Uses reduced size images of at most 32 pixels width/height Corresponding pixels Michael Springmann - Database & Information Systems Group

Multithreading - Results Michael Springmann - Database & Information Systems Group

ImageCLEF 2007 Results BLOOMSVM: SIFT + Pixels RWTHi6 SVM/ME: Image Patches Use Machine Learning UFR SVM: Color Moments + Texture (DWT) + Edge Orientation Experts on Domain – Provided Dataset, won 2005 RWTH_mi KNN: IDM + CCF + TTF UNIBAS_DBIS KNN: IDM No Machine Learning… yet OHSU Neural Network: GIST SVM: SIFT Use Machine Learning BIOMOD Decission Trees: Random Subwindow Michael Springmann - Database & Information Systems Group

What’s next? More expressive query definition! Region of interest Blobworld (http://elib.cs.berkeley.edu/blobworld/) Michael Springmann - Database & Information Systems Group

Query by Sketch (SNF Project) Michael Springmann - Database & Information Systems Group

Compound Document Matching E.g. patient records Michael Springmann - Database & Information Systems Group

Conclusion • Similarity Search allows for a variety of applications: New means for browsing, data mining, classification • Is computationally intensive • Algorithmic optimization can speed up IDM by factors 3.5-4.9 • Multithreading / distributed execution • Query requires example object • QbS may help Michael Springmann - Database & Information Systems Group

Similarity Search

Similarity Search

Presentation Transcript

Seeds for Similarity Search

Geometry of Similarity Search

Similarity Search in Visual Data

A Metric Cache for Similarity Search

Distributed Spatio-Temporal Similarity Search

Similarity Search in Protein Databases

User Oriented Trajectory Similarity Search

An Efficient Video Similarity Search Algorithm

Efﬁcient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition

Distributed Spatio-Temporal Similarity Search

Database Similarity Search

Sequence Similarity Search: an Overview

Similarity Search for Web Services

Connected Substructure Similarity Search

Similarity Search in Arbitrary Subspaces

MUFIN: Large-scale Similarity Search

Content-Based Similarity Search

Fast Similarity Search in Image Databases

SIMILARITY SEARCH The Metric Space Approach

Distributed Spatio-Temporal Similarity Search

Operators for Similarity Search

Lesson 3 Database Similarity Search