1 / 27

Similarity Search

Michael Springmann PhD Seminar October 11 th , 2007. Similarity Search. Projects. DELOS (EU FP6) Network of Excellence on Digital Libraries http://www.delos.info/ Task 1.6 Management of and Access to Virtual Electronic Health Records Task 1.8 DelosDLMS DILIGENT (EU FP6)

gino
Download Presentation

Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Michael Springmann PhD Seminar October 11th, 2007 Similarity Search

  2. Projects • DELOS (EU FP6) • Network of Excellence on Digital Libraries • http://www.delos.info/ • Task 1.6 Management of and Access to Virtual Electronic Health Records • Task 1.8 DelosDLMS • DILIGENT (EU FP6) • A Digital Library Infrastructure on Grid Enabled Technology • Work Package 1.4 Index & Search – Feature Extraction • ARTE Scenario Michael Springmann - Database & Information Systems Group

  3. What is similarity search? From a collection, return a ranked list of items for a given reference object. 1. 0.999 2. 0.873 3. 0.722 4. 0.712 ReferenceObject 5. 0.503 6. 0.442 7. 0.392 Michael Springmann - Database & Information Systems Group

  4. 203 236 172 210 78 Steps to compute similarity • Define query (reference object) • Select feature to use for comparison • Extract feature of reference object • Compare feature with each element of collection • Return (subset) of ranked list e.g.ColorHistogram e.g. 5-NN Michael Springmann - Database & Information Systems Group

  5. Similarity Search: Media Types • Image – Color, Texture, Shape • Text – TF/IDF, Edit Distance • Audio – Spectrum, Rhythm, Beat, Pitch • Video Sequences – Visual, Subtitles / Audio Transcripts, (rich) Meta Data • Combinations of several types • Complex Documents High dimensional feature vectors Michael Springmann - Database & Information Systems Group

  6. Goals Efficiency • Theme: Retrieve the results fast! • Measure: Execution time • Question: How can we achieve this with algorithmic optimizations? Effectiveness • Theme: Find good/better results! • Measure: Quality, e.g. for benchmark collections Precession, Recall, MAP • Question:How can we find better results w.r.t. the information need of the user? Michael Springmann - Database & Information Systems Group

  7. Similarity Search: What it is... A way to order / rank things • May help to group objects • Limitations: • Feature matches categorization criterion • No sharp borders Michael Springmann - Database & Information Systems Group

  8. Michael Springmann - Database & Information Systems Group

  9. Michael Springmann - Database & Information Systems Group

  10. ISIS (Interactive Similarity Search) • Originated at ETH Zurich, continued at UMIT and UNIBAS • VA-File can handle collections of size > 600.000 images while still achieving interactive answering times • Used image features: Color Moments, Texture Moments • Global and 5 Fuzzy Regions Michael Springmann - Database & Information Systems Group

  11. 5 Fuzzy Regions Michael Springmann - Database & Information Systems Group

  12. Similarity Search: What it is ... and what it ain‘t? A way to order / rank things Feature extraction will not find out: One person sleeping ... at least not without application specific adjustments / training • May help to group objects • Limitations: • Feature matches categorization criterion • No sharp borders Michael Springmann - Database & Information Systems Group

  13. ImageCLEF (http://www.imageclef.org) I. Ad-hoc photographic retrieval task IAPR TC-12 Benchmark, 20.000 (tourist) images, multi-lingual descriptions. Main challenge: Short annotations. II. Object Retrieval Task PASCAL Visual Object, 2617 images, 4754 object in realistic scenes. Main challenge: Pure visual, not pre-segmented. III. Medical Image Retrieval c@simage, PEIR, MIR, PathoPic, mypacs.net: > 70.000 images, heterogeneous case notes in XML IV. Medical Automatic Annotation Task IRMA Database, 11.000 medical images, annotated with IRMA Code (116 classes). Main challenge: Pure visual, classification domain specific. 1123-127-500-000 Michael Springmann - Database & Information Systems Group

  14. IRMA Code Classification Example 4 independent axes: Technical code (T) describes the image modality, e.g. 1 = x-ray, 11 = plain radiography, 112 = analog, 1123 = high beam energy Biological code (B) describes the biological system examined. O always means unspecific and therefore is always followed by other Os or -. 1123-127-500-000 Directional code (D) models body orientation, here: anteroposterior (AP, coronal), supine Anatomical code (A) refers to the body region examined, here: chest Michael Springmann - Database & Information Systems Group

  15. Image Distortion Model (IDM) Uses reduced size images of at most 32 pixels width/height Corresponding pixels Michael Springmann - Database & Information Systems Group

  16. Edge Detection (Sobel Filter) Michael Springmann - Database & Information Systems Group

  17. Efficiency: Speeding up IDM Algorithmic optimization • Idea: Only k ≤ 5 of the 10.000 reference images are used for subsequent kNN classification. • Early termination of distance computation of unused images • Base decision on threshold derived from best k images seen so far Pixels not evaluated due to exceeded threshold Michael Springmann - Database & Information Systems Group

  18. Early Termination Strategy - Experimental results For IDM: Less than 30% of all pixels need to get evaluated Michael Springmann - Database & Information Systems Group

  19. Speaking of numbers… • Original RWTH Aachen implementation of IDM requires for X×32, IDM (gradients, 5×5 window, 3×3 context) about 190 seconds per sample (= comparison) on a standard Pentium 4 PC running at 2.6GHz. • Using L2-Distance in a Sieve function, they reduced to 16.8 seconds – but this causes a slight degradation of results. • Our Java implementation takes for same window & context area on standard Pentium 4 PC 2.4 GHz only 16.0 seconds using the threshold (no degradation). L2-Distance can benefit of threshold – our Sieve function implementation takes only 2.0 seconds per sample. • We cached all features in main memory (only 60 MB). Reading directly from disk takes in total less than 5 seconds. Since performed in parallel to computation, penalty for IDM is only about 0.3 seconds, Sieve function becomes I/O-bound. Michael Springmann - Database & Information Systems Group

  20. Multithreading - Implementation • Several Java Worker Threads, each computes similarity between one reference image and query. • Dispatcher keeps track of distance threshold for early termination. • IDM with early termination takes 4.3 seconds on Fujitsu-Siemens Celsius M450 Workstation (Intel Core 2 Duo E6600, 2.4 GHz) – and only 1.5 seconds on IBM xSeries 445, 8x Intel Xeon MP 2.8 GHz. • Opens possibility for optimizing second goal… Michael Springmann - Database & Information Systems Group

  21. Effectiveness: Adjusting IDM Uses reduced size images of at most 32 pixels width/height Corresponding pixels Michael Springmann - Database & Information Systems Group

  22. Multithreading - Results Michael Springmann - Database & Information Systems Group

  23. ImageCLEF 2007 Results BLOOMSVM: SIFT + Pixels RWTHi6 SVM/ME: Image Patches Use Machine Learning UFR SVM: Color Moments + Texture (DWT) + Edge Orientation Experts on Domain – Provided Dataset, won 2005 RWTH_mi KNN: IDM + CCF + TTF UNIBAS_DBIS KNN: IDM No Machine Learning… yet OHSU Neural Network: GIST SVM: SIFT Use Machine Learning BIOMOD Decission Trees: Random Subwindow Michael Springmann - Database & Information Systems Group

  24. What’s next? More expressive query definition! Region of interest Blobworld (http://elib.cs.berkeley.edu/blobworld/) Michael Springmann - Database & Information Systems Group

  25. Query by Sketch (SNF Project) Michael Springmann - Database & Information Systems Group

  26. Compound Document Matching E.g. patient records Michael Springmann - Database & Information Systems Group

  27. Conclusion • Similarity Search allows for a variety of applications: New means for browsing, data mining, classification • Is computationally intensive • Algorithmic optimization can speed up IDM by factors 3.5-4.9 • Multithreading / distributed execution • Query requires example object • QbS may help Michael Springmann - Database & Information Systems Group

More Related