Associating video frames with text
Download
1 / 22

Associating Video Frames with Text - PowerPoint PPT Presentation


  • 262 Views
  • Updated On :

Associating Video Frames with Text. Pinar Duygulu and Howard D. Wactlar Informedia Project Carnegie Mellon University ACM SIGIR 2003. Abstract. Integration of visual and textual data in order to annotate video frames with more reliable labels and descriptions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Associating Video Frames with Text' - richard_edik


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Associating video frames with text l.jpg

Associating Video Frames with Text

Pinar Duygulu and Howard D. Wactlar

Informedia Project

Carnegie Mellon University

ACM SIGIR 2003


Abstract l.jpg
Abstract

  • Integration of visual and textual data

    • in order to annotate video frames with more reliable labels and descriptions

    • Correspondence problem between video frames and associated text

    • Using joint statistics

    • Better annotations can improve the performance of text based queries


Introduction 1 2 l.jpg
Introduction (1/2)

  • Video retrieval

    • visual vs. textual features

    • A system that combines two features is more powerful

  • images/videos with some descriptive text

    • Corel data set, some museum collection and news photographs on the web with captions

  • Correspondence problem

    • Some methods are proposed by modeling the joint statistics of words and image regions


Introduction 2 2 l.jpg
Introduction (2/2)

  • Correspondence problems in video data

    • Because transcripts and frames may not be co-occur in the same time

    • e.g. query = president to Informedia system

  • Goal

    • Determine the correspondence between the video frames and associated text to annotate the video frames with more reliable descriptions


Multimedia translation 1 3 l.jpg
Multimedia Translation (1/3)

  • Analogy

    • learning a lexicon for machine translation vs. learning acorrespondence model for associating words with image regions

  • Missing data problem

    • Assuming unknown one-to-one correspondence between words, missing data is the major problem by using joint probability distribution linking words in two languages

    • deal with by the EM algorithm


Multimedia translation 2 3 l.jpg
Multimedia Translation (2/3)

  • Method

    • a set of images and a set of associated words

    • image segmented into regions and from each region a set of feature (color, texture, shape and position and size) are extracted.

    • Vector-quantize the set of features representing an image region using k-means

    • Each region then gets a single label (blob token)

    • Then construct a probability table that links the blob tokens with word tokens


Multimedia translation 3 3 l.jpg
Multimedia Translation (3/3)

  • Method (cont.)

    • The table is initialized to the co-occurrence of blobs and words

    • The final translation probability table is constructed using EM algorithm which iterates between two steps:

      • use an estimate of the probability table to predict correspondence

      • then use the correspondences to refine the estimate of the probability table

    • Once learned, the table is used to predict words corresponding to particular image


Correspondences on video 1 3 l.jpg
Correspondences on Video (1/3)

  • Broadcast news is very challenging data

    • Due to its nature it is mostly based on people and requires person detection/ recognition.

  • Data set

    • Subsets of Chinese culture and TREC 2001 data sets which are relatively simpler

    • Consists of videoframes and the associated transcript extracted from the audio (Sphinx-III speech recognizer).

    • The frames and transcripts are associated on the shot-basis


Correspondences on video 2 3 l.jpg
Correspondences on Video (2/3)

  • Keyframe

    • Segmented into regions by fixed sized grids

    • A feature vector of size 46 is formed to represent each region

      • Position: (x, y) of the region center

      • Color: using the mean and variance of the HSV and RGB

      • Texture: using the mean and variance of 16 filter

      • Four difference of Gaussian filters with different sigmas and twelve oriented filters, aligned in 30 degree increments


Correspondences on video 3 3 l.jpg
Correspondences on Video (3/3)

  • Vocabulary

    • Consists of only nouns which are extracted from applying Brill’s tagger to the transcript

    • Contain noisy words in the vocabulary


Trec 2001 data 1 3 l.jpg
TREC 2001 Data (1/3)

  • TREC 2001 Data set

    • 2232 keyframes and 1938 nouns

  • Difference between still images and video frames

    • Text for the surrounding frames is also considered by setting the window size to five

  • Process

    • Each image is divided into 7 * 7 blocks (49 regions)

    • Feature space is vector quantized using k-means (k=500)

    • apply EM to obtain final translation probability between 500 blob tokens and 1938 word tokens


Trec 2001 data 2 3 l.jpg
TREC 2001 Data (2/3)

  • Example annotation results for TREC 2001 data


Trec 2001 data 3 3 l.jpg
TREC 2001 Data (3/3)

  • Experiment results (Statue of Liberty)

before

after


Chinese culture data l.jpg
Chinese Culture Data

  • Example of “great wall”

  • 3785 shots and 2597 words

  • After pruning process, 2785 shots and 626 words


Chinese culture data15 l.jpg
Chinese Culture Data

  • Experimental results (panda, wall, emperor)


Chinese culture data16 l.jpg
Chinese Culture Data

  • Evaluate the results on a larger scale 189 images for the word panda


Chinese culture data17 l.jpg
Chinese Culture Data

  • The rank of the word panda as the predicted word for the corresponding frames

    • Red: test set

    • Green: training set

    • Problem: the woman frames highly co-occur with word panda


Chinese culture data18 l.jpg
Chinese Culture Data

  • The effect of window size

    • a single shot, and window size is set to 1, 2 or 3

    • Recall: # of correct predictions over the # of times that the word occurs in the data

    • Precision: # of correct predictions over all predictions


Chinese culture data19 l.jpg
Chinese Culture Data

  • Experimental results of the effect of window size

a single shot

win size =3


Chinese culture data20 l.jpg
Chinese Culture Data

  • Experimental results of the effect of window size (cont.)

    • For some selected words


Discussion and future work l.jpg
Discussion and Future work

  • Discussion

    • Solve the correspondence problem between video frames and associated text

    • relatively simpler and smaller data sets are used

    • Broadcast news which is a harder dataset since there are terabytes of video and it requires focusing on people


Discussion and future work22 l.jpg
Discussion and Future work

  • Better visual feature

    • some detectors (face detector)

    • motion information

    • segmenter

    • use the temporal information to segment the moving objects

  • Text

    • Noun phrases or compound words

    • More lexicalanalysis