Accounting for the relative importance of objects in image retrieval

Sung Ju Hwang and Kristen Grauman University of Texas at Austin Accounting for the relative importance of objects in image retrieval

Image retrieval Content-based retrieval from an image database Image 1 Image 2 Image Database Query image … Image k

Relative importance of objects Which image is more relevant to the query? ? Image Database Query image

Relative importance of objects Which image is more relevant to the query? water sky bird water bird ? Image Database cow fence Query image cow cow mud

Relative importance of objects An image can contain many different objects, but some are more “important” than others. architecture sky mountain bird cow water

Relative importance of objects Some objects are background architecture sky mountain bird cow water

Relative importance of objects Some objects are less salient architecture sky mountain bird cow water

Relative importance of objects Some objects are more prominent or perceptually define the scene architecture sky mountain bird cow water

Our goal Goal: Retrieve those images that share important objects with the query image. versus How to learn a representation that accounts for this?

Idea: image tags as importance cue The order in which person assigns tags provides implicit cues about object importance to scene. TAGS Cow Birds Architecture Water Sky

Idea: image tags as importance cue The order in which person assigns tags provides implicit cues about object importance to scene. TAGS: Cow Birds Architecture Water Sky Learn this connection to improve cross-modal retrieval and CBIR.

Related work Previous work using tagged images focuses on the noun ↔ object correspondence. Duygulu et al. 02 Berg et al. 04 Fergus et al. 05 Li et al., 09 Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, … Related work building richer image representations from “two-view” text+image data: height: 6-11 weight: 235 lbs position:forward, croatia college: Hardoon et al. 04 Gupta et al. 08 Blaschko & Lampert 08 Bekkerman & Jeon 07, Qi et al. 09, Quack et al. 08, Quattoni et al 07, Yakhnenko & Honavar 09,…

Approach overview: Building the image database Cow Grass Horse Grass … Car House Grass Sky Tagged training images Learn projections from each feature space into common “semantic space” Extract visual and tag-based features

Approach overview: Retrieval from the database Untagged query image Retrieved images Cow Tree Grass Cow Tree Image database Tag list query Retrieved tag-list • Image-to-image retrieval • Image-to-tag auto annotation • Tag-to-image retrieval

Dual-view semantic space Visual features and tag-lists are two views generated by the same concept. Semantic space

Learning mappings to semantic space Canonical Correlation Analysis (CCA): choose projection directions that maximize the correlation of views projected from same instance. View 1 View 2 Semantic space: new common feature space

Kernel Canonical Correlation Analysis [Akaho 2001, Fyfe et al. 2001, Hardoon et al. 2004] Linear CCA Given paired data: Select directions so as to maximize: Given pair of kernel functions: Kernel CCA , Same objective, but projections in kernel space: ,

Building the kernels for each view Word frequency, rank kernels Visual kernels Semantic space

Visual features Color Histogram Visual Words Gist captures local appearance (k-means on DoG+SIFT) captures the total scene structure captures the HSV color distribution [Torralba et al.] Average the component χ2 kernels to build a single visual kernel .

Tag features Traditional bag-of-(text)words Word Frequency tagcount Cow 1 Bird 1 Water 1 Architecture 1 Mountain 1 Sky 1 Car 0 Person 0 Cow Bird Water Architecture Mountain Sky

Tag features Absolute rank in this image’s tag-list Absolute Rank tagvalue Cow 1 Bird 0.63 Water 0.50 Architecture 0.43 Mountain 0.39 Sky 0.36 Car 0 Person 0 Cow Bird Water Architecture Mountain Sky

Tag features Percentile rank obtained from the rank distribution of that word in all tag-lists. Relative Rank tagvalue Cow 0.9 Bird 0.6 Water 0.8 Architecture 0.5 Mountain 0.8 Sky 0.8 Car 0 Person 0 Cow Bird Water Architecture Mountain Sky Average the component χ2 kernels to build a single tag kernel .

Recap: Building the image database Visual feature space tag feature space Semantic space

Experiments We compare the retrieval performance of our method with two baselines: Words+Visual Baseline Visual-Only Baseline Query image 1st retrieved image KCCA semantic space 1st retrieved image Query image [Hardoon et al. 2004, Yakhenenko et al. 2009]

Evaluation We use Normalized Discounted Cumulative Gain at top K (NDCG@K) to evaluate retrieval performance: Reward term score for pth ranked example Sum of all the scores for the perfect ranking (normalization) Doing well in the top ranks is more important. [Kekalainen & Jarvelin, 2002]

Evaluation We present the NDCG@k score using two different reward terms: Object presence/scale Ordered tag similarity Cow Tree Grass Person Cow Tree Fence Grass Rewards similarity of query’s objects/scales and those in retrieved image(s). Rewards similarity of query’s ground truth tag ranks and those in retrieved image(s). scale presence relative rank absolute rank

Dataset LabelMe Pascal • 6352 images • Database: 3799 images • Query: 2553 images • Scene-oriented • Contains the ordered tag lists via labels added • 56 unique taggers • ~23 tags/image • 9963 images • Database: 5011 images • Query: 4952 images • Object-central • Tag lists obtained on Mechanical Turk • 758 unique taggers • ~5.5 tags/image

Image-to-image retrieval We want to retrieve images most similar to the given query image in terms of object importance. Visual kernel space Tag-list kernel space Untagged query image Image database Retrieved images

Image-to-image retrieval results Query Image Visual only Words + Visual Our method

Image-to-image retrievalresults Our method better retrieves images that share the query’s important objects, by both measures. 39% improvement Retrieval accuracy measured by object+scale similarity Retrieval accuracy measured by ordered tag-list similarity

Tag-to-image retrieval We want to retrieve the images that are best described by the given tag list Visual kernel space Tag-list kernel space Cow Person Tree Grass Image database Retrieved images Query tags

Tag-to-image retrieval results Our method better respects the importance cues implied by the user’s keyword query. 31% improvement

Image-to-tag auto annotation We want to annotate query image with ordered tags that best describe the scene. Visual kernel space Tag-list kernel space Untagged query image Cow Tree Grass Field Cow Fence Cow Grass Image database Output tag-lists

Image-to-tag auto annotationresults Tree Boat Grass Water Person Boat Person Water Sky Rock Person Tree Car Chair Window Bottle Knife Napkin Light fork k = number of nearest neighbors used

Implicit tag cues as localization prior [Hwang & Grauman, CVPR 2010] Training: Learn object-specific connection between localization parameters and implicit tag features. P (location, scale | tags) Desk Mug Office Computer Poster Desk Screen Mug Poster Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Given novel image, localize objects based on both tags and appearance. Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features

Conclusion • We want to learn what is implied (beyond objects present) by how a human provides tags for an image • Approach requires minimal supervision to learn the connection between importance conveyed by tags and visual features. • Consistent gains over • content-based visual search • tag+visual approach that disregards importance

Thank you

Accounting for the relative importance of objects in image retrieval