“Semantic” Image Annotation and Retrieval

“Semantic” Image Annotation and Retrieval R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts Amherst

Motivation • How can we retrieve images? • Image retrieval based on similarity search of image features between a visual query and a database image. • Not very effective. • Doesn’t lend itself to textual queries. • Manually annotate them using text and then use a text search engine. • Used by many libraries • Expensive, tedious. • Aside: Google image search based on textual cues and surrounding text. • Example: gandhi.jpg maybe a picture of Gandhi.

Automatic Image Annotation • Can we automatically annotate unseen images with keywords? • Given a training set of blobs and image annotations, learn a model and annotate a test set of images. • Example: The annotation for this picture would be • Tiger, grass. • Question: Do we need to recognize objects?

Object Recognition • A classic unsolved problem in computer vision. • Humans can do it easily. • How? – we don’t know. • Two main questions. • What is the object in a picture? • Where is it in the picture. • Annotation: Enough to answer the first question. Answering the second question requires labeling image segments. • In some cases as in face recognition, finding a face in an image is considered face detection and finding the identity of the specific individual is considered face recognition.

Statistical Data-driven Approaches • Statistical data driven approaches have been successful in many areas • Information Retrieval, Machine Translation, Optical Character Recognition, Information Extraction, …. • Early work on vision was focused on single images, pairs of images or short video sequences. • Computational Cost • Recent successful work in object detection and recognition uses a lot of data (training examples) and test.

Object Detection/Recognition • Focused on a few specific objects like faces, cars. • Learn the joint probability for different regions forming the object. • Train on examples of the specific object. • Solve a two-class classification problem. • Features: - Wavelets, Gabor filters, … • Examples: • Schneiderman and Kanade’s Face Detector • Requires training images of cut out faces. • Works fairly well but still makes mistakes. • (pictures from their webpage) • Fergus, Perona and Zisserman , CVPR 2003. • Other object like motorcycles, cars, … • Training images of object + background. • Learns a single class at a time. • (picture from their webpage).

Image Annotation • Describe an image using words or image features. • Vocabularies in two different languages to describe the same thing. • “Visterms” and words. • Visterms e.g. Segment image, compute features over each region, cluster the features into a discrete vocabulary. • Partition the image into regions and compute a set of continuous features. • “Translate” and retrieve. • “Cross-lingual” retrieval. • Use a training set of images and captions and learn annotations. • Characteristics of (most) of these approaches. • Associate words (“semantics”) with pictures. • Context is important in images. Take advantage of the association of different regions in the image • Tiger is associated with grass and not a computer. • Note: For text, performance on cross-lingual retrieval is equal to or exceeds mono-lingual retrieval.

Image Vocabulary • Images are segmented into semantic regions.( Blobworld, Normalized-cuts algorithm.) • Segments are clustered into fixed number of blobs (visterms). Each blob is a word in the image vocabulary. • Any image can be represented by a small set of blobs (Duygulu et al, ECCV 2002) Blobs Images Segments … …

Models • Co-occurrence Model (Mori et al): Compute the co-occurrence of visterms and words. • Mean precision of 0.07 • Translation Model (Duygulu, Barnard, de Freitas and Forsyth): Treat it as a problem of translating from the vocabulary of blobs (discrete visterms) to that of words. (Also try labeling regions). • Mean precision of 0.14 • Cross Media Relevance Model (Jeon, Lavrenko, Manmatha): Use a relevance (based language) model. Discrete model. • Mean precision of 0.33 • Correlation Latent Dirichlet Allocation (Blei and Jordan): Model generates words and regions based on a latent factor. Direct comparison on same dataset not available. (Also try labeling regions). • Guess is that its comparable or slightly worse than the relevance model. • Continuous Relevance Model (Lavrenko, Manmatha, Jeon): Relevance Model with continuous features. • Mean precision of 0.6

R Cross Media Relevance Models • Goal: Estimating Relevance Model – the joint distribution of words and visterms. • Find probability of observing word w and visterms bi P(w,b1,…,bm) together. • To annotate image with visterms • Grass, tiger, water, road • P(w|bgrass,btiger,bwater,broad) • If top three probabilities are for words • grass, water, tiger. • Then annotate image with grass, water, tiger Tiger Water Grass

Relevance Models • Annotation • Or • J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models, To appear in SIGIR’03.

Training • Joint distribution computed as an expectation over the training set J • Given J, the events are independent

Annotation • Compute P(w|I) for different w. • Probabilistic Annotation: • Annotate the image with every possible w in the vocabulary with associated probabilities. • Useful for retrieval but not for people. • Fixed Length Annotation: • For people, take the top (say 3 or 4) words for every image and annotate images with them.

Retrieval • Language Modeling Approach: • Given a Query Q, the probability of drawing Q from image I is • Or using the probabilistic annotation. • Rank images according to this probability.

Samples - good • Annotation examples - CMRM • Retrieval examples – Top 4 images, CMRM Query : Tiger Query : Pillar

“Semantic” Image Annotation and Retrieval