Building Text features for object image classification Gang Wang Derek Hoeim David Forsyth
Main Idea • Text based image features built using auxiliary dataset of images(internet) annotated with tags. • Visual classifier with an object viewed under novel circumstances. So, basically, Text classifier Image Classifier Unified
Challenges • Determine which objects are present in an image based on the text that surrounds similar images drawn from large collections. • Sounds easy but: • Object appearance • Pose • Illumination
Low Level Features Can Rescue But….. • Color • Texture • SIFT features Can help if we had millions of training samples but this is unrealistic. So what can help????? Millions of images on the internet, not tagged but the text associated with them helps classification.
Eureka!!!!!! • Easier to determine image content using surrounding text than with currently available image features. • Given a large enough dataset, we are bound to find very similar images to an input image. So they infer likely text for an input image based on similar images
The Common Approach Approach • Improve annotation quality or filter spurious search results that can be used for training. The Problem • Noise or ambiguity in annotations can easily nullify any benefit Proposal • Learn a distance metric that causes images with similar surrounding text to be similar in visual feature space.
Their Approach • Build text features for object image classification as they are expected to capture direct semantic meaning of an image.
Approach Explained • Dataset = Training + Test images • Auxiliary Dataset= Internet images(Flickr), have associated text. • For each training image • Extract visual features. • Find K nearest neighbor images from internet dataset. • Use text associated with these internet images to build text feature. • Train!! • Repeat for visual features and combine both.
Visual Features • SIFT : • Used for image matching and object recognition. • They use to detect and describe local patches. • Extract 1000 local patches from each image. • Quantized to 1000 clusters and each patch denoted to a cluster index. • Finally each image represented as a normalized histogram of cluster indices.
GIST: • Powerful in scene categorization and retreiving. • They represent each image as a 960 dimension GIST descriptor. • Color: • Quantize each channel to 8 bins. • Each pixel value is represented as integer between 1 to 512. • 512 dimensional histogram for each image.
Gradient • Can be considered as global and coarse SIFT feature. • Divide image into 4*4 cells • At each cell quantize the gradient into 16 bins. • Whole image represented as 256 dimensional vector. • Unified • Concatenation of the 4 previously described features. • Let the above features be f1, f2, f3, f4 . • Resultant features [w1f1, w2f2 ,w3f3,w4f4]
How to find weights: • Learn weights from training images. • Aim to force the images from the same category to be close and vice versa. • Randomly select N pairs of images from the training set. • For ith pair, Si=1 if two images share atleast one same object class, otherwise Si=0. • Calculate chi square distance fj for the ith pair as • Learn weights: Can solve directly using “fmincon” in Matlab.
Chi square??? • Chi square distance(http://www.stat.lsu.edu/faculty/moser/exst7037/geometry.pdf): • Denominator is the normalization component for each point in X. • So for n dimensions:
Fmincon????? • Finds minimum of constrained nonlinear multivariable function. • x = fmincon(fun,x0,A,b)x = fmincon(fun,x0,A,b,Aeq,beq)x = fmincon(fun,x0,A,b,Aeq,beq,lb,ub)….. • http://www.mathworks.com/help/toolbox/optim/ug/fmincon.html
Auxiliary Dataset • Collected from Flickr. • Total 1 million images • Out of which 700,000 images collected for 58 object categories whose names come from PASCAL and CALTECH 256 datasets. • Rest collected from a group called “10 million photos ”. Random images.
Text Features • For each training/test image • Find K nearest neighbor images from the auxiliary dataset. • Extract text with these associated images • Build text features. • “Dogs! Dogs! Dogs!” treated as a single item. • Use only frequent tags and group names(6000) in the auxiliary dataset. • Text feature is a normalized histogram of tag and group name counts.
Classifier • SVM classifier with a chi-squared kernel for text features. • Same used for visual features as well.
Fusion • Build visual classifier • Build text classifier • Third classifier trained to combine the confidence values of above two to give final prediction. • Final classifier logistic regression and is trained on a validation test.
Results • PASCAL VOC 2006-10 object categories • PASCAL VOC 2007-20 object categories • Performance quantitatively measured using AUC(Area under the ROC curve) for 2006 dataset and by AP(Average Precision) for 2007 dataset. • Use 150 nearest neighbor images in all experiments.
Performance Metrics • Performance of text features built with different visual features. • Effects of combining text and visual classifiers. • Effects of varying number of training images • Performance of the text features built with varying number of internet images • Effects of category names
For 2006 Dataset: Text classifier outperforms GIST KNN for each feature. Unified is best amongst all. Combination(V) etc. are obtained by training a logistic regression classifier on the validation dataset using the confidence values returned by the individual classifiers.