Multi-modal image search for large-scale applications

Multi-modal image search forlarge-scale applications Petra Budikova, Michal Batko, Pavel Zezula Masaryk University, Czech Republic

Outline • Motivation • Importance of being multi-modal • Diversity of approaches, lack of evaluation • Especially large-scale • Contributions • Classification of approaches • General multi-modal searching • Implementation in image search domain • Experimental evaluation setup • Queries, ground truth, relevance measures • Selected test datasets • Early analysis of results • Conclusions MDDE 2012, Istanbul, August 31

Evolution of Multimedia Retrieval • Text-based searching • Search by annotations, category search • Limits: missing/erroneous annotations, “image is worth a thousand words” • Content-based retrieval • Query-by-example paradigm • Limits: semantic gap • Multi-modal approaches • Combining orthogonal views on similarity • Overcoming limitations of individual approaches • Text and visual, video and text, location and visual, … MDDE 2012, Istanbul, August 31

Multi-modal retrieval: state-of-the-art • Commerce: • text-based searching refined by visual rank • Google, Bing, … • Research: • Multi-modal indices • Specialized • Text&visual, location&text, … • General metric space indexing • M-tree family, M-index, … • Threshold algorithm for modality fusion • Fagin: Combining Fuzzy Information from Multiple Systems. • Late fusion, ranking methods • Visual search+text rank, relevance-feedback ranking, … • Numerous solutions presented at ImageCLEF, … MDDE 2012, Istanbul, August 31

Multi-modal retrieval: state-of-the-art II Number of strategies and techniques but No reasonable comparison! • Only for some pairs • Small-scale MDDE 2012, Istanbul, August 31

Our Objective Large-scale comparison of fundamental approaches to multi-modal retrieval • Classification of solutions • How are the individual modalities processed and fused? • Efficiency, flexibility? • Comparable implementations • Image data domain • MESSIF implementation framework • Real-world evaluation platform • 2 different datasets with 20 million images • Human-evaluated relevance MDDE 2012, Istanbul, August 31

Classification I: query processing phases • Basic search: evaluate query over the whole database • Postprocessing: evaluate query over candidate objects MDDE 2012, Istanbul, August 31

Classification II: modality fusion type • All modalities are equal • Early fusion • Specialized indices • Should be efficient • Usually not flexible • May imply costly evaluations of distances • Late fusion • Threshold Algorithm • In theory exact • Can be extremely costly • Medium flexibility • Fusion ranking • Flexible, efficient • Approximate solution MDDE 2012, Istanbul, August 31

Classification III: modality fusion type • Some modalities are more important • Ranking • Google, Bing, … • Very flexible and efficient • May exploit (pseudo-)relevance feedback • Quality of results strongly influenced by the performance of the primary modality • Inherent fusion • MUFIN • Very flexible • Little added costs as compared to ranking • Possibly better quality than with ranking MDDE 2012, Istanbul, August 31

Evaluation Domain • Image retrieval • Popular application, easy to evaluate • Text and general metric features • the same model applicable to many other domains! • Selected modalities • Text: keywords, tf-idf measure • Visual • MPEG7 global descriptors • SIFT local descriptors • Aggregation function: Weighted sum • Implementation platform • MESSIF library for large-scale metric searching MDDE 2012, Istanbul, August 31

Selected Techniques • Single modality retrieval • Baseline for evaluation • Needed in some search&postprocess strategies • Text search: • tf-idf relevance measure • Lucene implementation • Visual search: • Only by global descriptors – weighted sum of five MPEG7 features • Local descriptors not feasible • Centralized M-index MDDE 2012, Istanbul, August 31

Selected Techniques II • Early fusion: combined text&visual basic search • “joint features model” • Fixed combination of modalities => not flexible • Implementation: metric index (M-index) by text&visual similarity • Late fusion: separate TBIR and CBIR followed by results aggregation • Most frequent technique of image-text fusion • Efficient (parallel evaluation of single-modality retrievals), can interconnect existing systems • Aggregation can be costly • Implementations: Threshold Algorithm, fusion ranking MDDE 2012, Istanbul, August 31

Selected Techniques III • Text-based retrieval with inherent fusion • Text used for selection of candidates • Combined text&visual distances evaluated • “large-scale ranking”, in distributed environment can be executed in parallel on partial candidate sets • Implementation: text search, all objects with non-zero text score ranked by combined similarity • Content-based retrieval with inherent fusion • Complementary to previous • Implementation: candidate data regions indentified in visual-based index, combined similarity evaluated MDDE 2012, Istanbul, August 31

Selected Techniques IV • Result ranking techniques • Rank by text • Implementation: Tf-idf • Rank by visual similarity • Implementation: rank by global descriptors (MEPG7), rank by local descriptors (SIFT) • Pseudo-RF ranking • Popular, rapidly developing methods • Trying to overcome the semantic gap • Explore properties of objects in basic search result, relationships • Implementation: important descriptors rank (low variance), reverse kNN rank, clustering rank MDDE 2012, Istanbul, August 31

Selected Techniques V • Overview of implemented techniques Candidate objects (sizes: 100 – 2000) Query Result Results postprocessing Basic search MDDE 2012, Istanbul, August 31

Evaluation • Queries: 100 image+keyword queries • Frequent queries from photostock company logs • Easy and difficult queries selected by experience • Ground truth • Pooling approach, human assessors • 3-grade relevance: very good, acceptable, irrelevant • Translated to relevance percentage, averaged • Result quality measures • Precision@k, DCG,NDCG MDDE 2012, Istanbul, August 31

Evaluation III Evaluation datasets: Profimedia dataset Real-world photo collection created for sale 20M high-quality images, rich and precise keyword annotations CoPhIR dataset (Flickr photos) Real-world photo collection created for fun 20M images of different quality, sparse and erroneous keyword annotations Close-up of bee sitting on pink field flower animalapisapismelliferaarthropodbeauty in nature beemacroflowerspringinsectanimalspollinationbloom blossombrowncloseupcloseupcollectingcolorcreative_tag extremeclose-up flora floralflower front viewhairyhoneybee horizontal image insectinvertebratehoney-bee no honeybee MDDE 2012, Istanbul, August 31 17/21

[ms] Results: bimodal fusion performance • Text and MPEG7 modalities fusion • Text-based solutions the best • Expected – rich annotations • Ranking significantly improves search • For both text and visual search • Choice of primary modality extremely important! • Threshold algorithm very costly • Not suitable for large-scale • Inherent fusion costs acceptable • Result quality slightly better than with ranking MDDE 2012, Istanbul, August 31

Results: limits of text-based searching • Queries where text-based solutions do not provide best results Query text: bird • complex queries • “two coins” • ambiguous queries • “shells”, “stamp” • too broad queries • “bird” • Future work: a more detailed analysis of aspects that influence the performance of a given modality for a given query MDDE 2012, Istanbul, August 31

Results: multi-modal search with ranking • Effectiveness and efficiency of ranking techniques NDCG at 30 NDCG at 30 # of ranked objects • Text search ranking: • MPEG7 rank performs equally as well as SIFT while more efficient • Visual search ranking: • The most complementary modality is the best • Influence of the number of ranked objects • Differs for text and visual search • May be related to search space dimensionality – future work # of ranked objects MDDE 2012, Istanbul, August 31

Conclusion • State-of-the-art • Multi-modal search paradigm • Rapid development of approaches • Real-world evaluations needed • Our contribution • First extensive evaluation of fundamental approaches to large-scale multi-modal retrieval • No big surprises, but valuable insights gained • Effectiveness vs. efficiency tradeoff, strengths and limits of text-based solutions, performance of various ranking methods • Future work • Evaluate again on qualitatively different data • Determine conditions of usability of individual methods ? MDDE 2012, Istanbul, August 31

Multi-modal image search for large-scale applications

Multi-modal image search for large-scale applications

Presentation Transcript

WISE: Large Scale Content-Based Web Image Search

Large Scale Internet Search at Ask.com

Large-Scale Multi-purpose wireless networks

GPU Requirements for Large Scale Scientific Applications

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Large Scale Multi-Label Classification

VisualRank : Applying PageRank to Large-Scale Image Search

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Large-Scale Image Parsing

VisualRank - Applying PageRank to Large-Scale Image Search

Cross-Indexing of Binary Scale Invariant Feature Transform Codes for Large-Scale Image Search

Hierarchical Semantic Indexing for Large Scale Image Retrieval

Large-Scale Nonparametric Image Parsing

VisualRank : Applying PageRank to Large-Scale Image Search

Large-Scale Content-Based Image Retrieval

Stability Analysis Algorithms for Large-Scale Applications

Multi-scale Image Harmonization

Efficient Algorithms for Large-Scale GIS Applications

Very Large Scale Neighborhood Search

MUFIN: Large-scale Similarity Search

Large Scale Applications

HathiTrust Large Scale Search