Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Sung Ju Hwang and Kristen Grauman University of Texas at Austin Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Detecting tagged objects Image tagged with keywords clearly tell us Which object to search for Dog Black lab Jasper Sofa Self Living room Fedora Explore #24

Detecting tagged objects Image tagged with keywords clearly tell us Which object to search for Berg et al. 2004 Duygulu et al. 2002 Fergus et al. 2005 Vijayanarasimhan & Grauman 2008 Previous work using tagged images focuses on the noun ↔ object correspondence.

Main Idea The list of tags on an image may give useful information Beyond just what objects are present ? Mug Key Keyboard Toothbrush Pen Photo Post-it ? Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Can you guess where and what size the mug will appearin both images?

Main Idea Tag as context Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Absence of larger objects Presence of larger objects Mug is named the first Mug is named later in the list

Feature: word presence/absence Presence/absence of some other objects, and the number of those objects affects the scene layout Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Presence of smaller objects, such as key, and the absence of larger objects hints that it might be a close-up scene Presence of the larger objects such as desk and bookshelf hints that the image describes a typical office scene

Feature: word presence/absence Plain bag-of-words feature describing word frequency. Wi = word Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Blue Larger objects Red Smaller objects

Feature: tag rank People tag the ‘important’ objects earlier Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer If the object is tagged the first, there is a high chance that it is the main object: large, and centered If the object is tagged later, then it means that the object might not be salient: either it might be far from the center or small in scale

Feature: tag rank Percentile of the absolute rank of the tag compared against its typical rank. ri = percentile of the rank for tag i Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Computer Blue High relative rank (>0.6) Red Low relative rank(<0.4) Green Medium relative rank (0.4~0.6)

Feature: proximity People tend to move their eyes to the objects nearby 6 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 10) Computer 5 4 9 7 2 4 7 1 5 8 3 3 6 2 1 10 Objects that are close to each other in the tag list are likely to be close in the image

Feature: proximity Encoded as the inverse of the average rank difference between tag words. Pi,j = rank difference between tag i and j 6 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 10) Computer 5 4 9 7 2 4 7 1 5 8 3 3 6 2 1 10 Blue Objects close to each other

Overview of the approach Image Getting appearance Based prediction Appearance Model What? P(X|A) Sliding window detector Localization result Mug Key Keyboard Toothbrush Pen Photo Post-it P(X|W) P(X|R) P(X|P) W = {1, 0, 2, … , 3} R = {0.9, 0.5, … , 0.2} P = {0.25, 0.33, … , 0.1} Priming the detector Where? Implicit tag features Modeling P(X|T) Tags

Overview of the approach Image Getting appearance Based prediction Appearance Model P(X|A) Sliding window detector Localization result What? 0.24 + 0.81 Mug Key Keyboard Toothbrush Pen Photo Post-it P(X|W) P(X|R) P(X|P) W = {1, 0, 2, … , 3} R = {0.9, 0.5, … , 0.2} P = {0.25, 0.33, … , 0.1} Modulating the detector Implicit tag features Modeling P(X|T) Tags

Approach: modeling P(X|T) We wish to know the conditional PDF of the location and scale of the target object, given the tag features: P(X|T) (X = s,x,y, T = tag feature) Lamp Car Wheel Wheel Light Window House House Car Car Road House Lightpole Car Windows Building Man Barrel Car Truck Car Boulder Car We modeled this conditional PDF P(X|T) directly without calculating the joint distribution P(X,T), using the mixture density network (MDN) Top 30 mostly liked positions for class car. Bounding box sampled according to P(X|T)

Approach: Priming the detector Then how can we make use of this learned distribution P(X|T)? Use it to speed the detection process Use it to modulate the detection confidence score Most probable scale 38600 Rank the detection results based on the learned P(X|T) Ignored Unlikely scale 5 2) Search only the probable region and the scale, following the rank Region to search 33000 Ignored

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? Use it to speed the detection process Use it to modulate the detection confidence score P(X|A) Detector Lamp Car Wheel Wheel Light Logistic regression Classifier P(X|W) P(X|R) P(X|P) Image tags We learn the weights for each prediction, P(X|A), P(X|W), P(X|R), and P(X|P)

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? Use it to speed the detection process Use it to modulate the detection confidence score Prediction based on the original detector score 0.7 0.8 0.9

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? Use it to speed the detection process Use it to modulate the detection confidence score Prediction based on the tag features Prediction based on the original detector score 0.9 0.3 0.7 0.8 0.2 0.9

Approach: Modulating the detector Then how can we make use of this learned distribution P(X|T)? Use it to speed the detection process Use it to modulate the detection confidence score 0.63 0.24 0.18

Experiments • We compare the following two • Detection Speed • Number of windows to search • Detection Accuracy • AUROC • AP • On three methods • Appearance-only • Appearance + Gist • Appearance + tag features (ours)

Experiments: Dataset LabelMe PASCAL VOC 2007 • contains the ordered tag list. • Used Dalal & Trigg’s Hog detector • contains images that have high variance in composition. • Tag lists are obtained from anonymous workers on Mechanical Turks • Felzenszwalb’s LSVM detector

LabelMe: Performance Evaluation Modified version of the HOG detector by Dalal and Triggs. Faster detection, because we know where to look first More accurate detection, Because we know which hypotheses to trust most.

Results: LabelMe Sky Buildings Person Sidewalk Car Car Road Car Window Road Window Sky Wheel Sign HOG HOG+Gist HOG+Tags Gist and Tags are likely to predict the same position, but different scale. Most of the accuracy gain using the tag features comes from accurate scale prediction

Results: LabelMe Desk Keyboard Screen Bookshelf Desk Keyboard Screen Mug Keyboard Screen CD HOG HOG+Gist HOG+Tags

PASCAL VOC 2007: Performance Evaluation Modified Felzenszwalb’sLSVM detector 65% 77% 70% 25% Need to test less number of windows to achieve the same detection rate. 9.2% improvement in accuracy over all classes (Average Precision)

Per-class localization accuracy • Significant improvement on • Bird • Boat • Cat • Dog • Potted plant

PASCAL VOC 2007 (examples) Ours Aeroplane Aeroplane Aeroplane Aeroplane Aeroplane Building Aeroplane Smoke Aeroplane LSVM baseline Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Lamp Person Bottle Dog Sofa Painting Table Bottle

PASCAL VOC 2007 (examples) Dog Floor Hairclip Dog Dog Dog Person Person Ground Bench Scarf Dog Person Microphone Light Horse Person Tree House Building Ground Hurdle Fence Person

PASCAL VOC 2007 (Failure case) Bottle Glass Wine Table Aeroplane Sky Building Shadow Dog Clothes Rope Rope Plant Ground Shadow String Wall Person Person Pole Building Sidewalk Grass Road

Some Observations • We find that often implicit features predict: • - scale better for indoor objects • - position better for outdoor objects • We find Gist usually better for y position, while tags are generally stronger for scale • - agrees with previous experiments using Gist • In general, need to have learned about target objects in variety of examples with different contexts

Conclusion • We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.

Future Work • Joint multi-object detection • From tags to natural language sentences • Image retrieval • Using Wordnet to group words with similar meanings

Conclusion • We showed how to exploit the implicit information present in human tagging behavior, on improving object localization performance in both speed and accuracy.

Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags