“Semantic” Image Annotation and Retrieval. R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts Amherst. Motivation. How can we retrieve images? Image retrieval based on similarity search of image features between a visual query and a database image.
“Semantic” Image Annotation and Retrieval R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts Amherst
Motivation • How can we retrieve images? • Image retrieval based on similarity search of image features between a visual query and a database image. • Not very effective. • Doesn’t lend itself to textual queries. • Manually annotate them using text and then use a text search engine. • Used by many libraries • Expensive, tedious. • Aside: Google image search based on textual cues and surrounding text. • Example: gandhi.jpg maybe a picture of Gandhi.
Automatic Image Annotation • Can we automatically annotate unseen images with keywords? • Given a training set of blobs and image annotations, learn a model and annotate a test set of images. • Example: The annotation for this picture would be • Tiger, grass. • Question: Do we need to recognize objects?
Object Recognition • A classic unsolved problem in computer vision. • Humans can do it easily. • How? – we don’t know. • Two main questions. • What is the object in a picture? • Where is it in the picture. • Annotation: Enough to answer the first question. Answering the second question requires labeling image segments. • In some cases as in face recognition, finding a face in an image is considered face detection and finding the identity of the specific individual is considered face recognition.
Statistical Data-driven Approaches • Statistical data driven approaches have been successful in many areas • Information Retrieval, Machine Translation, Optical Character Recognition, Information Extraction, …. • Early work on vision was focused on single images, pairs of images or short video sequences. • Computational Cost • Recent successful work in object detection and recognition uses a lot of data (training examples) and test.
Object Detection/Recognition • Focused on a few specific objects like faces, cars. • Learn the joint probability for different regions forming the object. • Train on examples of the specific object. • Solve a two-class classification problem. • Features: - Wavelets, Gabor filters, … • Examples: • Schneiderman and Kanade’s Face Detector • Requires training images of cut out faces. • Works fairly well but still makes mistakes. • (pictures from their webpage) • Fergus, Perona and Zisserman , CVPR 2003. • Other object like motorcycles, cars, … • Training images of object + background. • Learns a single class at a time. • (picture from their webpage).
Image Annotation • Describe an image using words or image features. • Vocabularies in two different languages to describe the same thing. • “Visterms” and words. • Visterms e.g. Segment image, compute features over each region, cluster the features into a discrete vocabulary. • Partition the image into regions and compute a set of continuous features. • “Translate” and retrieve. • “Cross-lingual” retrieval. • Use a training set of images and captions and learn annotations. • Characteristics of (most) of these approaches. • Associate words (“semantics”) with pictures. • Context is important in images. Take advantage of the association of different regions in the image • Tiger is associated with grass and not a computer. • Note: For text, performance on cross-lingual retrieval is equal to or exceeds mono-lingual retrieval.
Image Vocabulary • Images are segmented into semantic regions.( Blobworld, Normalized-cuts algorithm.) • Segments are clustered into fixed number of blobs (visterms). Each blob is a word in the image vocabulary. • Any image can be represented by a small set of blobs (Duygulu et al, ECCV 2002) Blobs Images Segments … …
Models • Co-occurrence Model (Mori et al): Compute the co-occurrence of visterms and words. • Mean precision of 0.07 • Translation Model (Duygulu, Barnard, de Freitas and Forsyth): Treat it as a problem of translating from the vocabulary of blobs (discrete visterms) to that of words. (Also try labeling regions). • Mean precision of 0.14 • Cross Media Relevance Model (Jeon, Lavrenko, Manmatha): Use a relevance (based language) model. Discrete model. • Mean precision of 0.33 • Correlation Latent Dirichlet Allocation (Blei and Jordan): Model generates words and regions based on a latent factor. Direct comparison on same dataset not available. (Also try labeling regions). • Guess is that its comparable or slightly worse than the relevance model. • Continuous Relevance Model (Lavrenko, Manmatha, Jeon): Relevance Model with continuous features. • Mean precision of 0.6
R Cross Media Relevance Models • Goal: Estimating Relevance Model – the joint distribution of words and visterms. • Find probability of observing word w and visterms bi P(w,b1,…,bm) together. • To annotate image with visterms • Grass, tiger, water, road • P(w|bgrass,btiger,bwater,broad) • If top three probabilities are for words • grass, water, tiger. • Then annotate image with grass, water, tiger Tiger Water Grass
Relevance Models • Annotation • Or • J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models, To appear in SIGIR’03.
Training • Joint distribution computed as an expectation over the training set J • Given J, the events are independent
Annotation • Compute P(w|I) for different w. • Probabilistic Annotation: • Annotate the image with every possible w in the vocabulary with associated probabilities. • Useful for retrieval but not for people. • Fixed Length Annotation: • For people, take the top (say 3 or 4) words for every image and annotate images with them.
Retrieval • Language Modeling Approach: • Given a Query Q, the probability of drawing Q from image I is • Or using the probabilistic annotation. • Rank images according to this probability.
Samples - good • Annotation examples - CMRM • Retrieval examples – Top 4 images, CMRM Query : Tiger Query : Pillar