Using large scale web data to facilitate textual query based retrieval of consumer photos
1 / 36

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos - PowerPoint PPT Presentation

  • Uploaded on

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos. Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo Nanyang Technological University & Kodak Research Lab. Motivation. Digital cameras and mobile phone cameras popularize rapidly:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos' - franz

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using large scale web data to facilitate textual query based retrieval of consumer photos

Using Large-Scale Web Data to Facilitate Textual QueryBased Retrieval of Consumer Photos

Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo

Nanyang Technological University & Kodak Research Lab


  • Digital cameras and mobile phone cameras popularize rapidly:

    • More and more personal photos;

    • Retrieving images from enormous collections of personal photos becomes an important topic.


How to retrieve?

Previous work


Previous Work

  • Content-Based Image Retrieval (CBIR)

    • Users provide images as queries to retrieve personal photos.

  • The paramount challenge -- semantic gap:

    • The gap between the low-level visual features and the high-level semantic concepts.



Image with high-level concept



Feature vectors

in DB


Feature vector

A more natural way for consumer applications






A More Natural Way For Consumer Applications

  • Image annotation is used to classify images w.r.t. high-level semantic concepts.

    • Semantic concepts are analogous to the textual terms describing document contents.

  • An intermediate stage for textual query based image retrieval.

  • Let the user to retrieve the desirable personal photos using textual queries.


Annotation Result:

high-level concepts


Our goal






people, family


people, wedding


… …

… …

Consumer Photos

Our Goal

  • Leverage information from web image to retrieve consumer photos in personal photo collection.

  • A real-time textual query based consumer photo retrieval system without any intermediate annotation stage.

  • Web images are accompanied by tags, categories and titles.

No intermediate image annotation process.

Web Images

System framework

Large Collection of

Web images

(with descriptive words)

Raw Consumer Photos

Automatic Web

Image Retrieval





Photo Retrieval











System Framework

  • When user provides a textual query,



  • Then, a classifier is trained based on these web images.

  • It would be used to find relevant/irrelevant images in web image collections.

  • And then consumer photos can be ranked based on the classifier’s decision value.

  • The user can also gives relevance feedback to refine the retrieval results.

Automatic web image retrieval

… …


Web Images

… …




… …


Web Images



… …

… …



Semantic Word Trees

Based on WordNet

Automatic Web Image Retrieval


  • For user’s textual query, first search it in the semantic word trees.

  • The web images containing the query word are considered as “relevant web images”.

  • The web images which do not contain the query word and its two-level descendants are considered as “irrelevant web images”.

Decision stump ensemble
Decision Stump Ensemble

  • Train a decision stump on each dimension.

  • Combine them with their training error rates.

Why decision stump ensemble
Why Decision Stump Ensemble?

  • Main reason: low time cost

    • Our goal: a (quasi) real-time retrieval system.

    • For basic classifiers: SVMs are much slower;

    • For combination: boosting is also much slower.

  • The advantage of decision stump ensemble:

    • Low training cost;

    • Low testing cost;

    • Very easy to parallelize;

Asymmetric bagging
Asymmetric Bagging

  • Imbalance: count(irrelevant) >> count(relevant)

    • Side effects, e.g. overfitting.

  • Solution: asymmetric bagging

    • Repeat 100 times by using different randomly sampled irrelevant web images.

100 training sets





Relevance feedback
Relevance Feedback

  • The user labels nl relevant or irrelevant consumer photos.

    • Use this information to further refine the retrieval results;

  • Challenge 1: Usually nl is small;

  • Challenge 2: Cross-domain learning

    • Source classifier is trained on the web image domain.

    • The user labels some personal photos.

Method 1 cross domain combination of classifiers
Method 1: Cross-Domain Combination of Classifiers

  • Re-train classifiers with data from both domain?

    • Neither effective nor efficient;

  • A simple but effective method:

    • Train an SVM on the consumer photo domain with user-labeled photos;

    • Convert the responds of source classifier and SVM classifier to probability, and add them up;

    • Rank consumer photos based on this sum value.

  • Referred as DS_S+SVM_T.

Method 2 cross domain regularized regression cdrr
Method 2: Cross-Domain Regularized Regression (CDRR)

  • Construct a linear regression function fT(x):

    • For labeled photos: fT(xi) ≈ yi;

    • For unlabeled photos: fT(xi) ≈ fs(xi);



Using large scale web data to facilitate textual query based retrieval of consumer photos

User-labeled images x1,…,xl

f T(x) should be the user’s label y(x)

f T(x) should be f s(x)

Other images

A regularizer to control the complexity of the target classifier fT(x)

  • Design a target linear classifier fT(x) = wTx.

  • This problem can be solved with least square solver.

Hybrid method
Hybrid Method

  • A combination of two methods.

  • For labeled consumer photos:

    • Measure the average distance davg to their 30 nearest unlabeled neighbors in feature space;

    • If davg < ε: Use DS_S+SVM_T;

    • Otherwise: Use CDRR.

  • Reason:

    • For consumer photos which are visually similar to user-labeled images, they should be influenced more by user-labeled images.

Dataset and experimental setup
Dataset and Experimental Setup

  • Web Image Database:

    • 1.3 million photos from photoSIG.

    • Relatively professional photos.

  • Text descriptions for web images:

    • Title, portfolio, and categories accompanied with web images;

    • Remove the common high-frequency words;

    • Remove the rarely-used words.

    • Finally, 21377 words in our vocabulary.

Dataset and experimental setup1
Dataset and Experimental Setup

  • Testing Dataset #1: Kodak dataset

    • Collected by Eastman Kodak Company:

      • From about 100 real users.

      • Over a period of one year.

    • 1358 images:

      • The first keyframe from each video.

    • 21 concepts:

      • We merge “group_of_two” and “group_of_three_or_more” to one concept.

Dataset and experimental setup2
Dataset and Experimental Setup

  • Testing Dataset #2: Corel dataset

    • 4999 images

      • 192x128 or 128x192.

    • 43 concepts:

      • We remove all concepts in which there are fewer than 100 images.

Visual features
Visual Features

  • Grid-Based color moment (225D)

    • Three moments of three color channels from each block of 5x5 grid.

  • Edge direction histogram (73D)

    • 72 edge direction bins plus one non-edge bin.

  • Wavelet texture (128D)

  • Concatenate all three kinds of features:

    • Normalize each dimension to avg = 0, stddev = 1

    • Use first 103 principal components.

Retrieval without relevance feedback
Retrieval without Relevance Feedback

  • For all concepts:

    • Average number of relevant images: 3703.5.

Retrieval without relevance feedback1
Retrieval without Relevance Feedback

  • kNN: rank consumer photos with average distance to 300-nn in the relevant web images.

  • DS_S: decision stump ensemble.

Retrieval without relevance feedback2
Retrieval without Relevance Feedback

  • Time cost:

    • We use OpenMP to parallelize our method;

    • With 8 threads, both methods can achieve interactive level.

    • But kNN is expected to cost much time on large-scale datasets.

Retrieval with relevance feedback
Retrieval with Relevance Feedback

  • In each round, the user labels at most 1 positive and 1 negative images in top-40;

  • Methods for comparison:

    • kNN_RF: add user-labeled photos into relevant image set, and re-apply kNN;

    • SVM_T: train SVM based on the user-labeled images in the target domain;

    • A-SVM: Adaptive SVM;

    • MR: Manifold Ranking based relevance feedback method;

Retrieval with relevance feedback1
Retrieval with Relevance Feedback

  • Setting of y(x) for CDRR:

    • Positive: +1.0;

    • Negative: -0.1;

  • Reason:

    • The top-ranked negative images are not extremely negative;

    • Positive: “what is”; Negative: “what is not”.





Retrieval with relevance feedback4
Retrieval with Relevance Feedback

  • Time cost:

    • All methods except A-SVM can achieve real-time speed.


  • Our goal: (quasi) real-time textual query based consumer photo retrieval.

  • Our method:

    • Use web images and their surrounding text descriptions as an auxiliary database;

    • Asymmetric bagging with decision stumps;

    • Several simple but effective cross-domain learning methods to help relevance feedback.

Future work
Future Work

  • How to efficiently use more powerful source classifiers?

  • How to further improve the speed:

    • Control training time within 1 seconds;

    • Control testing time when the consumer photo set is very large.

Thank you
Thank you!

  • Any questions?