1 / 12

Efficient Visual Search of Videos Cast as Text Retrieval

Efficient Visual Search of Videos Cast as Text Retrieval. Josef Sivic and Andrew Zisserman PAMI 2009 Presented by: John Paisley, Duke University. Outline. Introduction Text retrieval review Object retrieval in video Experiments Conclusion. Introduction.

trudy
Download Presentation

Efficient Visual Search of Videos Cast as Text Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Visual Search of Videos Cast as Text Retrieval Josef Sivic and Andrew Zisserman PAMI 2009 Presented by: John Paisley, Duke University

  2. Outline • Introduction • Text retrieval review • Object retrieval in video • Experiments • Conclusion

  3. Introduction • Goal: Retrieve objects in a video database similar to a queried object. • This work aims to cast this problem as a text retrieval problem. • In text retrieval, each document is an object and each word is given an index. Each document then is represented by a vector of the counts of each word. • Can we treat video the same way? Each frame is treated as a document. Multiple feature vectors are extracted from a single frame. These are quantized, with the quantized values then being treated as a word. • Text retrieval algorithms can the be used.

  4. Text Retrieval • As mentioned, each document is represented by a vector. The standard way of obtaining this vector is via “term frequency-inverse document frequency” (tf-idf). • Document retrieval then proceeds as follows, where documents are sorted in descending order. • If these vectors are normalized, the Euclidean distance can be used.

  5. Object Retrieval in Video: Viewpoint Invariant Description • Goal: Extract description of an object that is unaffected by viewpoint, scale and illumination, etc. • To do this, for each frame, use segmentation algorithms to define regions of interest (two are used here). Roughly 1,200 regions are computed for each frame. Each region is represented as a 128 dimensional vector using the SIFT descriptor method. • To get rid of bogus regions, they are tracked over a few frames to make sure that the regions are stable, and therefore potentially interesting. This reduces the number of feature vectors to about 600 per frame.

  6. Object Retrieval in Video: Building a Visual Vocabulary • Now represent each frame by roughly a 128 x 600 matrix. • To go from images to words, build a global dictionary using VQ (e.g., K-means) and quantize feature vectors. In this paper, K-means is used using the Mahalanobis distance. • These clusters are found separately for each segmantation algorithm. In all, the authors use 16,000 clusters (or words). • Each frame is now represented as a 16,000 vector of counts of the number of observations in each cluster. Words that arise freqently in documents are thrown out as stop words.

  7. Object Retrieval in Video: Spatial Consistency • Given a queried object, there’s information in the spatial relationships of the region of interest that can help the ranking. • This is done by first returning results using text retrieval algorithm discussed before and then re-ranking by looking at how similar the K-nearest neighbors are

  8. Object Retrieval Process • Feature length film usually has 100K-150K frames. Use one frame per second to reduce to 4K-6K frames. • Features are extracted and quantized as discussed. • The user selects a query region. “Words” are extracted as well as spatial relationships. • A desired number of frames are returned using the text retrieval algorithm and re-ranked using the spatial consistency method.

  9. Experiments • Results using the movies “Groundhog Day,” “Run Lola Run” and “Casablanca” • Six objects of interest were selected and searched for. • An additional benefit of the proposed method is speed.

  10. Experiments • Fig. 16 shows the effect of vocabulary size. • Table 2 shows the effect of building the • dictionary using the right, wrong and two movies • Table 3 shows the combination of the two

  11. Experiments: Different Distance Measures

  12. Conclusion • Vector quantization does not seem to degrade performance, while the speed is significantly faster. • Using spatial information via spatial consistency reranking was shown to significantly improve results. • This can be extended to temporal information as well.

More Related