1 / 22

Associating Video Frames with Text

Associating Video Frames with Text. Pinar Duygulu and Howard D. Wactlar Informedia Project Carnegie Mellon University ACM SIGIR 2003. Abstract. Integration of visual and textual data in order to annotate video frames with more reliable labels and descriptions

Download Presentation

Associating Video Frames with Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Associating Video Frames with Text Pinar Duygulu and Howard D. Wactlar Informedia Project Carnegie Mellon University ACM SIGIR 2003

  2. Abstract • Integration of visual and textual data • in order to annotate video frames with more reliable labels and descriptions • Correspondence problem between video frames and associated text • Using joint statistics • Better annotations can improve the performance of text based queries

  3. Introduction (1/2) • Video retrieval • visual vs. textual features • A system that combines two features is more powerful • images/videos with some descriptive text • Corel data set, some museum collection and news photographs on the web with captions • Correspondence problem • Some methods are proposed by modeling the joint statistics of words and image regions

  4. Introduction (2/2) • Correspondence problems in video data • Because transcripts and frames may not be co-occur in the same time • e.g. query = president to Informedia system • Goal • Determine the correspondence between the video frames and associated text to annotate the video frames with more reliable descriptions

  5. Multimedia Translation (1/3) • Analogy • learning a lexicon for machine translation vs. learning acorrespondence model for associating words with image regions • Missing data problem • Assuming unknown one-to-one correspondence between words, missing data is the major problem by using joint probability distribution linking words in two languages • deal with by the EM algorithm

  6. Multimedia Translation (2/3) • Method • a set of images and a set of associated words • image segmented into regions and from each region a set of feature (color, texture, shape and position and size) are extracted. • Vector-quantize the set of features representing an image region using k-means • Each region then gets a single label (blob token) • Then construct a probability table that links the blob tokens with word tokens

  7. Multimedia Translation (3/3) • Method (cont.) • The table is initialized to the co-occurrence of blobs and words • The final translation probability table is constructed using EM algorithm which iterates between two steps: • use an estimate of the probability table to predict correspondence • then use the correspondences to refine the estimate of the probability table • Once learned, the table is used to predict words corresponding to particular image

  8. Correspondences on Video (1/3) • Broadcast news is very challenging data • Due to its nature it is mostly based on people and requires person detection/ recognition. • Data set • Subsets of Chinese culture and TREC 2001 data sets which are relatively simpler • Consists of videoframes and the associated transcript extracted from the audio (Sphinx-III speech recognizer). • The frames and transcripts are associated on the shot-basis

  9. Correspondences on Video (2/3) • Keyframe • Segmented into regions by fixed sized grids • A feature vector of size 46 is formed to represent each region • Position: (x, y) of the region center • Color: using the mean and variance of the HSV and RGB • Texture: using the mean and variance of 16 filter • Four difference of Gaussian filters with different sigmas and twelve oriented filters, aligned in 30 degree increments

  10. Correspondences on Video (3/3) • Vocabulary • Consists of only nouns which are extracted from applying Brill’s tagger to the transcript • Contain noisy words in the vocabulary

  11. TREC 2001 Data (1/3) • TREC 2001 Data set • 2232 keyframes and 1938 nouns • Difference between still images and video frames • Text for the surrounding frames is also considered by setting the window size to five • Process • Each image is divided into 7 * 7 blocks (49 regions) • Feature space is vector quantized using k-means (k=500) • apply EM to obtain final translation probability between 500 blob tokens and 1938 word tokens

  12. TREC 2001 Data (2/3) • Example annotation results for TREC 2001 data

  13. TREC 2001 Data (3/3) • Experiment results (Statue of Liberty) before after

  14. Chinese Culture Data • Example of “great wall” • 3785 shots and 2597 words • After pruning process, 2785 shots and 626 words

  15. Chinese Culture Data • Experimental results (panda, wall, emperor)

  16. Chinese Culture Data • Evaluate the results on a larger scale 189 images for the word panda

  17. Chinese Culture Data • The rank of the word panda as the predicted word for the corresponding frames • Red: test set • Green: training set • Problem: the woman frames highly co-occur with word panda

  18. Chinese Culture Data • The effect of window size • a single shot, and window size is set to 1, 2 or 3 • Recall: # of correct predictions over the # of times that the word occurs in the data • Precision: # of correct predictions over all predictions

  19. Chinese Culture Data • Experimental results of the effect of window size a single shot win size =3

  20. Chinese Culture Data • Experimental results of the effect of window size (cont.) • For some selected words

  21. Discussion and Future work • Discussion • Solve the correspondence problem between video frames and associated text • relatively simpler and smaller data sets are used • Broadcast news which is a harder dataset since there are terabytes of video and it requires focusing on people

  22. Discussion and Future work • Better visual feature • some detectors (face detector) • motion information • segmenter • use the temporal information to segment the moving objects • Text • Noun phrases or compound words • More lexicalanalysis

More Related